-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Just really bad quality in general? #301
Comments
Hey man, did you ever find a forced alignment solution that works? I'm also using 11labs and trying to do some forced alignment to fix the subtitles I'm generating. |
@jahamed Yep, pyfoal works like a charm for me. Takes some headache to set up, but it's extremely good, at least with 11labs' output. Glad to save you all that time it took me to get there lmao |
Thanks @Oleg-A-LLIto! Yes seems like a pain to setup, I got a decent working example with Gentle Forced Aligner too, runs easier on a Mac. Very surprised there isn't an easier & more modern way to get this stuff working (at least in node). |
Hey man I found another good library for this, a lot more modern + easier to use. Alignment is very good, thought you should know, It's working for me perfectly now. |
For Word level timestmaps, you should use whisperX with aeneas. Get the aeneas result, transform data for whisperX align model, profit. |
You might want to use the I just noticed some pretty rough results with a build that was falling back to python + subprocess for speech synthesis, but got much better results with one using the compiled
(But I haven't tried the other packages mentioned here...) |
So, I'm using this to align the text I get from a TTS engine, a pretty good one, too (eleven labs). To me that sounds like a perfect task: no mic noise, no background sounds, English language, and the volume is really stable. Still, not sure what I'm doing wrong here, but it works extremely poorly. To the point, the result is pretty much unusable. Half the words (by the way, yes, I'm aligning per word) are crushed into a 0-second long interval and the others are just overly long periods of time spaced around randomly. I feel like I would get a much better result by just approximating the mapping with character/vowel count. By just how bad it is, I'm guessing this is not how Aeneas normally is, so what could be a problem causing generally bad performance? I'm not getting any errors, I'm running win11 and I process fairly small chunks of text.
The text was updated successfully, but these errors were encountered: