Google Translate: Text-to-speech TTS output in mp3 from IME Pinyin input - text-to-speech

Using Google Translate, it is possible to generate TTS mp3 audio files from English or Chinese input using a simple URL. For example,
URL for English TTS for the word "English":
http://translate.google.com/translate_tts?tl=en&q=English
URL for Chinese TTS for the word 中文:
http://translate.google.com/translate_tts?tl=zh_CN&q=中文
How can I do the same using IME Hanyu Pinyin with tonal notations as input? For example, I want to generate the audio TTS file for zhōng wén (instead of 中文).
I have searched high and low on this website for a solution but was not successful. I do apologise if I have overlooked a solution previously offered here. Thanks

Related

How to build a new voice (language) for festival using HTS

I want to build my own TTS (Text to Speech) App using HTS (HMM-based speech synthesis system) for the Arabic language.
I fail to find any step by step instructions on how to build the synthesizer using HTS. What I have done is to download the sample Speaker Dependent Demo on the HTS website and train that data ana tested it on Festival (English speaker).
Now I don't know what files should I change in the HTS-demo to build my voice(language).
You first build Festival unit selection voice for your language following Building Synthetic Voices guide.
After you have the voice and required lab and utt files, you run training.pl script from HTS with updated paths to your database and it will build the voice for you.

Voice to Text API with Language Detection or Confidence Rate?

I am trying to develop an application that needs to detect one of two possible languages from an audio stream and transform the audio to text.
Most Voice to Text APIs require specifying the language before detecting the text. Google Translate website allows for voice to text with language detection, I was wondering if there's any API that allows for language recognition from audio?

MS Speech Platform 11 Helena Spanish voice does not play interrogative intonation for TTS

I am working on VB.NET application that uses the MS Speech Platform 11 SDK. I have noticed a problem with the MS Helena Spanish voice when playing Spanish text as audio. If the text contains a question, that is, text enclosed with an upside-down and a normal question mark, the audio is not spoken with the intonation of a question. When I use the MS Helen English voice, the text is spoken using the intonation of a question as you would want. I have tried getting around this by adjusting different attributes such as pitch for the Spanish but have not had any success so far. I will continue to investigate.
Please let me know if you have any questions or need further information.
Thanks,
Gil

Google Voice Recognition on Movies

I've had excellent results with the Google API for speech recognition with natural dialogues, however for sounds from Youtube videos or movies recognition is poor or nonexistent.
Recording sounds on an iPhone 4 of my voice in both Spanish to English is recognized, but with the same phone at a movie is almost impossible, even a scene with a character talking with little background noise. Only once had success.
I try to clean up the sound with SoX (Sound eXchange) using noisered and compand efects, without any success.
Any idea? Or simply are sounds that can not be identified by the Google API for more you change? It will have better success with other speech recognition software?
Google voice recognizer (and most other recognizers) is not compatible with reverberation effects. In most video scenes distance between person and microphone more than 1-3 meter. Try to put your phone on table and recognize smth from 3 meters distance. This will does not lead to anything but sound quality will be very good.

Convert audio (wav file) to text using SAPI?

My task is to convert an Audio file not from Direct Speech from Human into text.
e.g If I have "Hello there" store in wav file to it will transcribe it into text and show "Hello there" string on screen.
Any language code in preferred but priority is C#.
SAPI can certainly do what you want. Start with an in-proc recognizer, connect up your audio as a file stream, set dictation mode, and off you go.
Now the disappointing bit. You probably won't get terribly good results; in fact, I suspect that unless you're very lucky, you'll probably get total garbage.
There are several problems:
Dictation really only works well once the SR engine has been trained. If you're lucky (like me), you can get OK results, but if the speaker has an accent, training is a must.
Training only works well for a single voice. If you've got multiple speakers in a single audio file, it's not going to work well.
The audio model for dictation (and Speech Recognition in general) assumes that you're using a close-talk microphone (i.e., a microphone right next to your face, to minimize noise pickup). If your WAV files have extra noise, accuracy will go down dramatically.
Dragon Naturally Speaking Professional has support for transcription, but it still requires training and a single voice. (I do believe that DNS has a custom audio model that works well for voice recorders.) I haven't used it myself, so I don't know how well it would work in your situation.
Now, if you are looking for specific keywords, other people have had success using "Audio Mining" - running the recognizer looking for a specific keyword on an audio stream