How can we get the position of text in the generated audio?

#12
by maifeeulasad - opened

It's really cool that we can now generate audio in realtime with microsoft/VibeVoice-Realtime-0.5B. I was thinking about integrating it to my application. And then I found a critical UX requirement, if we could highlight the text with the current audio that would be great.

Does vibe voice support this?

Opened an issue: https://github.com/microsoft/VibeVoice/issues/144

Thank you for your interest. Currently, the model cannot provide alignment information between generated speech and text.

Sign up or log in to comment