Tacotron2 TTS viseme generation - text-to-speech

I am currently working on a project that use tacotron2 TTS to produce human like voice for a robot. I would also like to get the visemes from the TTS, so I can synchronize the robot face animation with the voice. How could I get the visemes and the duration of each one with tacotron2?
Thanks

Can you get the phonemes out? You can refer to these phoneme-viseme tables to do conversion. You could try using espeak to do the text -> phoneme conversion. If you don't mind just a rough sync you could compare the duration of espeak output to your tacotron2 output.

Related

Best way to create text to speech voice variant

I need a minimum of 3/4 different tts voice but unfortunatenly I have only one voice.
This because I have only one Italian neural voice (Diego) and the others are all standard voice and the quality is much worse.
The final objective is create a voice over for 3/4 persons minimum and I can't use the some exact voice.
For this reason, I like to create some variant started by the only one neural voice that I have, that gives the impression of a voice of other people all of this without seem unnatural.
Actually I have Adobe Audition, Audacity , Ircam Trax, ffmpeg and apart this I can use SSML with API (in this case microsoft Azure).
I don't known what are the effects and in what measure use it without damage the voices.
In short I ask what is the best way to do using the software that I have or other if I will get better results.
Thanks !
what language are you using? If you are using English, I am sure you can find more than 3-4 neural voices. There are en-US, en-GB, en-CA, en-AU neural voices and all sound natural.
You can also tune the pitch using SSML to make the voice sound different.
If you would like to create different voices, try customvoice.ai with your speech data (or your voice talents).
or, what are the particular 'variances' you are looking for?

Is there a way to make Google Text to Speech, speak text for a desired duration?

I went through the documentation of Google Text to Speech SSML.
https://developers.google.com/assistant/actions/reference/ssml#prosody
So there is a tag called <Prosody/> which as per the documentation of W3 Specification can accept an attribute called duration which is a value in seconds or milliseconds for the desired time to take to read the contained text.
So <speak><prosody duration='6s'>Hello, How are you?</prosody></speak> should take 3 seconds for google text to speech to speak this! But when i try it here https://cloud.google.com/text-to-speech/ , its not working and also I tried it in rest API.
Does google text to speech doesn't take duration attribute into account? If they don't then is there a way to achieve the same?
There are two ways I know of to solve this:
First Option: call Google's API twice: use the first call to measure the time of the spoken audio, and the second call to adjust the rate parameter accordingly.
Pros: Better audio quality? (this is subjective and depends on taste as well as the application's requirements)
Cons: Doubles the cost and processing time.
Second option:
Post-process the audio using a specialized library such as ffmpeg
Pros: Cost effective and can be fast if implemented correctly.
Cons: Some knowledge of the concepts and the usage of an audio post-processing library is required (no need to become an expert though).
As Mr Lister already mentioned, the documentation clearly says.
<prosody>
Used to customize the pitch, speaking rate, and volume of text
contained by the element. Currently the rate, pitch, and volume
attributes are supported.
The rate and volume attributes can be set according to the W3
specifications.
Using the UI interface you can test it.
In particular you can use things like
rate="low"
or
rate="80%"
to adjust the speed. However that is as far as you can go with Google TTS.
AWS Polly does support what you need, but only on Standard voices (not Neural).
Here is the documentation.
Setting a Maximum Duration for Synthesized Speech
Polly also has a UI to do a quick test.

what are the inputs to a wavenet?

I am trying to implement TTS. I have just read about wavenet, but, I am confused on local conditioning. The original paper here, explains to add a time series for local conditioning, this article explains that adding mel spectrogram features for local conditioning is fine. As we know that Wavenet is a generative model and takes raw audio inputs to generate high audio output when conditioned,
my question is that the said mel spectrogram features are of that raw audio passed as in the input or of some other audio.
Secondly, for implementing a TTS the audio input will be generated by some other TTS system whose output quality will be improved by wavenet, am I correct to think this way??
Please help, it is direly needed.
Thanks
Mel features are created by actual TTS module from the text (tacotron2 for example), than you run vocoder module (Wavenet) to create speech.
It is better to try existing implementation like Nvidia/tacotron2 + nvidia/waveglow. Waveglow is better than wavenet between, much faster. Wavenet is very slow.

Pocket Sphinx slow using a grammar

I have been trying to use CMU's Pocket Sphinx to perform speech recognition on an Android tablet.
The tutorial on doing this can be found here. My problem is that recognition runs really slowly if I use a grammar of any significant size. Using a language model, I can achieve good accuracy and speed, so my temporary solution has been to generate a language model from my grammar and use that.
In my configuration, I set -bestpath = false. After that, I am at a loss as to how to speed things up.
Clarification: I understand that a large grammar will take a long time to initialize, but I don't think it should take a long time for recognition to run using it.
Is there anyone with experience using Pocket Sphinx and a grammar who can share their experience, configuration, etc.?
We used pocketsphinx on a 1Ghz Android mobile following tutorials available on the net (just do a google search). It was quite quick to startup, but it hung up after you stopped recording for about 10 secs even if you only recorded 2 words. This was using the default "hub4" prerecorded grammer.

API to break voice into phonemes / synthesize new speech given speech samples?

You know those movies where the tech geeks record someone's voice, and their software breaks it into phonemes? Which they can then use to type in any phrase, and make it seem as if the target is saying it?
Does that software exist in an API Version? I don't even know what to Google.
There is no such software. Breaking arbitrary speech into its constituent phonemes is only a partially solved problem: speech-to-text software is still imperfect, as is text-to-speech.
The idea is to reproduce the timbre of the target's voice. Even if you were able to segment the audio perfectly, reordering the phonemes would produce audio with unnatural cadence and intonation, not to mention splicing artifacts. At that point you're getting into smoothing, time-scaling, and pitch correction, all of which are possible and well-understood in theory, but operate poorly on real-world data, especially when the audio sample in question is as short as a single phoneme, and further when the timbre needs to be preserved.
These problems are compounded on the phonetic side by allophonic variation in sounds based on accent and surrounding phonemes; in order to faithfully produce even a low-quality approximation of the audio, you'd need a detailed understanding of the target's language, accent, and speech patterns.
Furthermore, your ultimate problem is one of social engineering, and people are not easy to fool when it comes to the voices of people they know. Even with a large corpus of input data, at best you could get a short low-quality sample, hardly enough for a conversation.
So while it's certainly possible, it's difficult; even if it existed, it wouldn't always be good enough.
SRI International (the company that created Siri for iOS) has an SDK called EduSpeak, which will take audio input and break it down into individual phonemes. I know this because I sat through a demo of the product about a week ago. During the demo, the presenter showed us an application that was created using the SDK. The application gave a few lines of text for the presenter to read. After reading the text, the application displayed a bar chart where each bar represented a phoneme from his speech. The height of each bar represented a score of how well each phoneme was pronounced (the presenter was not a native English speaker, so he received lower scores on certain phonemes compared to others). The presenter could also click on each individual bar to have only that individual phoneme played back using the original audio.
So yes, software exists that divides audio up by phoneme, and it does a very good job of it. Now, whether or not those phonemes can be re-assembled into speech is an open question. If we end up getting a trial version of the SDK, I'll try it out and let you know.
If your aim is to mimic someone else's voice, then another attitude is to convert your own voice (instead of assembling phonemes). It is (surprisingly) called voice conversion, e.g http://www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm
The technology is called "voice synthesis" and "voice recognition"
The java API for this can be found here Java voice JSAPI
Apple has an API for this Apple speech
Microsoft has several ...one is discussed here Vista speech
Lyrebird is a start-up that is working on this very problem. Given samples of a person's voice and some written text, it can synthesize a spoken version of that written text in the voice of the person in the samples.
You can get interesting voice warping effects with a formant-aware pitch shift. Adobe Audition has a pretty good implementation. Antares produces some interesting vocal effects VST plugins.
These techniques use some form of linear predictive coding (LPC) to treat the voice as a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting.
I dunno about a commercially available solution, but the concept isn't entirely out of the range of possibility. For example, the University of Delaware has fairly decent software for doing just that.
http://www.modeltalker.com