How to use Phonetic or Phoneme pronunciation in google text to speech? - text-to-speech

I have been trying for a while to get Phonetic or Phoneme pronunciation working with google's text to speech but have not managed to get it performing consistently.
I have managed to get limited results from using https://tophonetics.com/
It translated "The cow went mad." to "ðə kaʊ wɛnt mæd." but the 'the' 'ðə'
was not audible. And when I tried "ðɪs ɪz səm fəˈnɛtɪk tɛkst ˈɪnˌpʊt".
Are there any SSML codes to define phonetic blocks of text,
that can be this format "D,Is Iz sVm f#n'EtIk t'Ekst 'InpUt"
can be used instead of "ðɪs ɪz səm fəˈnɛtɪk tɛkst ˈɪnˌpʊt"
"

Google Text-to-Speech supports the <phoneme> tag since at least spring 2021.
However, there are a lot of potential gotchas to overcome:
The demo page filters out <phoneme> tags on the client side before they even reach the API. (It does the same with the <voice> tag as pointed out here)
As with Microsoft Azure Text-to-speech (see the other answer for details), each language only supports a limited set of phonemes ("letters") that can be used.
If you use an unsupported one, the phoneme tag is completely ignored without any warning. So the official example <phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme> does not work with any English variant but en-US, since all others lack the "o" or "oʊ" phoneme.
It's unclear if you need to use the v1beta1 API (which I can confirm is working) or if version v1 is also ok.

There is the SSML tag <phoneme> that serves your purpose.
Unfortunately, it's currently not supported in Google Cloud Text-to-speech. The available subset of SSML tags for Google Cloud is listed in the documentation. The <phoneme> tag is not in this list. An experiment using Google Cloud's text-to-speech-demo confirms that the phonemes are ignored. The content of the tag is being read as ordinary text, as has already been remarked by #Trevor in the comments.
The <phoneme> tag is, however, being supported by Microsoft Azure Text-to-Speech and Amazon Polly. In both cases, the available phonemes are limited to those available in the language being used (see here for Azure and here for Polly). The Azure documentation isn't 100% clear about the exclusion of out-of-language phonemes, but practical experiments with the Azure Text-to-Speech demo confirm that they're not working properly. In some cases, they at least seem to be replaced by the nearest available equivalent in the language used.
Being restricted to the phonemes of one language severely limits the usefulness of the phonemes tag. E.g., you can't used the feature to embed correctly pronounced content in a second language, as the second language will usually have some phonemes that are not available in the first language. Concrete language pairs in which each language has some phonemes that are not available in the other one are English/German, Spanish/German, English/Spanish.

Related

Using Microsoft attributes with Google TTS

In my application, I am already using Google TTS but I am amazed by Microsoft TTS because they are providing a lot more useful attributes than Google. Since I am more familiar with Google, I would like to keep my implementation but would still like to be able to use MS attributes like:
<mstts:express-as style="cheerful">
That'd be just amazing!
</mstts:express-as>
Is that possible?
There are no style attributes in Google Text-to-Speech, but you can change the Standard voice to a WaveNet voice[1].
The WaveNet voice synthesizes speech with more human-like emphasis and inflection on syllables, phonemes, and words. You can see all the supported voices in Google Text-to-Speech[2].
[1]https://cloud.google.com/text-to-speech/docs/wavenet#wavenet_voices
[2]https://cloud.google.com/text-to-speech/docs/voices

Documentation for multi-programming-language API

I'm part of team working on SDK that is exposed with several programming languages - currently ObjC, C#, ActionScript, Java (Android) and later we'll have even more languages.
We want to have documentation which is made up of two parts:
Human readable documentation
API Reference
There are links between the two parts: from human readable docs we have links to specific classes or methods and from the API reference we may link to a document that explain the context in which the class or method is used.
We are currently use a combination of sphinx for human readable documentation and language specific tools for API such as doxygen or asdoc
I saw in LeapMotion they were able to generate a complete documentation for multiple programming-language (not human language) with cross links between programming-languages.
The Question
Does someone know how to accomplish such documentation system in a way we'll not have to duplicate each change in human readable docs to every language and have cross links between the languages?
I put together the Leap Motion documentation. I use Sphinx to create the package of docs and Breathe, a Sphinx plug-in, to basically import XML files generated by Doxygen into the Sphinx project for the doxygenated API references (C++, C#, Java, and Objective-C). For links from the so-called "human-readable" pages to the API references, I generate RST substitutions from the .tag files which Doxygen will generate for you. Links from the API reference to the "human-readable" pages are normal, relative hyperlinks (which I should add more of).
I use the conditional content features of Sphinx to generate a separate set of docs (both "human-readable" and API) for each programming language. Thus these articles can be customized for each programming language where needed and have the correct code examples for the current language. Because each doc set has the same structure, it isn't hard to switch from one language to another.
I did add some custom JavaScript to the page templates to help switch between languages.
tl;dr: Sphinx, Breathe, Doxygen and a small amount of custom JavaScript.
If you would like to discuss this further, you can post a question to our (Leap Motion) developer forum. I'll see it (Stackoverflow isn't the proper place for an ongoing discussion).
Hi Ido Ran,
Tools which you've specified are best in industry for documentation purpose,I am afraid there is no such tool yet which could provide both human and as well as API reference.Out of all my personal best is doxygen which is slighly of multi-use (human and API)..Hope this helps.

SSML using Chrome TTS

I'm trying to give a little more clarity to TTS sentences by indicating emphasis, etc. I'm using the Chrome TTS API, which indicates that it accepts SSML-formatted documents in addition to raw text.
After many attempts, and a reading a few comments on the web, it doesn't look like this is actually supported, or possibly that this is up to individual voices for implementation.
Does anyone know:
Has SSML been abandoned under Chrome?
If not, is there any indication whether they expect to support it via native voice, or they're hoping that someone else will implement?
Do any Chrome voices currently exist that support this?
Thanks!
I'm a Chrome engineer. SSML support has not been implemented yet, but it's planned. Obviously not all engines would support it, but when we implement SSML support we'll also implement support for stripping SSML from engines that don't support it.
Sorry the documentation is misleading here.
Star this bug to express interest and get notified when it's fixed: https://code.google.com/p/chromium/issues/detail?id=88072
If anyone's looking at this later, you can control prosody on Mac Chrome using Apple's native command syntax, at least for the default voices:
the square root of [[pbas +4]] 2 [[char LTRL]]a[[char NORM]] to the [[pbas +4]] 14 [[char LTRL]]x[[char NORM]]
Documented here.

Arabic Spell Checker API

I searched about spell checker API from google but the arabic language is not supported there if any one can help me to find the AP that support arabic language or any other good idea.
AFAIK none of Google's spell check APIs are documented/supported, and as such they could break at any time.
It is possible to spell check using the Google translate api:
http://translate.google.com/translate_a/t?client=t&text=كيف حاك&sc=1
Returns:
[[["كيف حاك","كيف حاك","",""]],,"ar",,,,,["كيف \u003cb\u003e\u003ci\u003eحالك\u003c/i\u003e\u003c/b\u003e","كيف حالك"],[],33]
Scroll to the right and you'll see the corrected results كيف حالك at the end.
The sc=1 parameter includes the spell check in the results. You'll have to figure out how to parse the JSON results to get the corrected spelling,
You could try this one:
https://github.com/Arabic-Spell-Checker/spellchecker-api-client
It has a REST API and provides dictionary management functionalities.
Disclaimer: I am the developer of this.

Speech Recognition API

I need to automatically transcribe some short MP3s as part of a proof of concept I am working on. I am currently looking into cloud solutions or web API services to send the MP3 as a simple HTTP request and receive a transcription back.
The only free/open source solution I have found here, but the demos don't seem to work (at least not on the files I need to transcribe). I have found some enterprise solutions for call centers, but so far nothing I can simply integrate into a project.
Are there any web based speech recognition services available? One that is able to filter out small noise would be a plus.
Here is an unofficial method to access Google ASR capability. I just tested on Yesterday and it still works - you can get JSON style ASR output with words and associated confidence score from an FLC audio sampled in 16KHz.
Also you can try speech recognition engine of Windows 7 to produce subtitles. Here is the tool for that.
This may be a good match. Also, their techcrunch profile (See this) lists competitors as: SimulScribe, SpinVox, Vlingo, Nuance, Microsoft, Google
Some of these links may be helpful.
Vlingo, Bing and Google have recognizers in the cloud, but I don't think they make them publicly programmable. I believe they are accessible only from their authorized clients.
For a proof of concept (and low volume), have you considered just using the desktop speech engines that come in Windows 7? What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition? may be helpful. The MS desktop recognizers ship with a dictation grammar and it sounds like that is what you will need.