Using different intonations with Watson text to speech - text-to-speech

I am developing a PoC using Watson text to speech and Watson conversation.
Sometimes, the chatbot needs to ask a question, so I'd like text to speech to synthesize the voice using an interrogation intonation.
Is it possible to be done?

Watson Text to Speech supports SSML, and has expressive SSML tags.
The one you want to use is Uncertainty. As it is defined as "conveys an uncertain, interrogative message".
Example:
<express-as type="Uncertainty">
Could she still be in the office? She told me that she might leave early.
</express-as>
More details on it's usage is here:
https://console.bluemix.net/docs/services/text-to-speech/SSML-expressive.html#the-express-as-element

Yes, you can certainly use text-to-speech (TTS) for output and speech-to-text (STT) for input. You would need to use a middleware or app layer to drive the conversation and route the input/output to the other services (see "how to use" in the docs).
I have used the following TJBot recipe as a simple and good started for some projects: https://github.com/damiancummins/tell_the_time

Unfortunately Concatenative TTS may have problems to create correct intonation in questions. If you think it happens consistently or too often please open a bug.
If you have a specific question which gets incorrect intonation try to rephrase it a little bit if possible. A useful trick for this voice could be to use double question mark '??'

Related

Using Microsoft attributes with Google TTS

In my application, I am already using Google TTS but I am amazed by Microsoft TTS because they are providing a lot more useful attributes than Google. Since I am more familiar with Google, I would like to keep my implementation but would still like to be able to use MS attributes like:
<mstts:express-as style="cheerful">
That'd be just amazing!
</mstts:express-as>
Is that possible?
There are no style attributes in Google Text-to-Speech, but you can change the Standard voice to a WaveNet voice[1].
The WaveNet voice synthesizes speech with more human-like emphasis and inflection on syllables, phonemes, and words. You can see all the supported voices in Google Text-to-Speech[2].
[1]https://cloud.google.com/text-to-speech/docs/wavenet#wavenet_voices
[2]https://cloud.google.com/text-to-speech/docs/voices

SSML using Chrome TTS

I'm trying to give a little more clarity to TTS sentences by indicating emphasis, etc. I'm using the Chrome TTS API, which indicates that it accepts SSML-formatted documents in addition to raw text.
After many attempts, and a reading a few comments on the web, it doesn't look like this is actually supported, or possibly that this is up to individual voices for implementation.
Does anyone know:
Has SSML been abandoned under Chrome?
If not, is there any indication whether they expect to support it via native voice, or they're hoping that someone else will implement?
Do any Chrome voices currently exist that support this?
Thanks!
I'm a Chrome engineer. SSML support has not been implemented yet, but it's planned. Obviously not all engines would support it, but when we implement SSML support we'll also implement support for stripping SSML from engines that don't support it.
Sorry the documentation is misleading here.
Star this bug to express interest and get notified when it's fixed: https://code.google.com/p/chromium/issues/detail?id=88072
If anyone's looking at this later, you can control prosody on Mac Chrome using Apple's native command syntax, at least for the default voices:
the square root of [[pbas +4]] 2 [[char LTRL]]a[[char NORM]] to the [[pbas +4]] 14 [[char LTRL]]x[[char NORM]]
Documented here.

Convert festival tts to flite tts

i currently have a tts which is built using festival and festvox. i need to convert these voices and build a TTS in flite. apparently you can do the conversion using festvox (the festvox and flite websites say so but no proper steps on how to do it). can some one please help me out with it as i am new to this area?
thanx in advance ..
Just in-case anyone else was wondering the same i found the steps mentioned in this document useful and also subscribe to the mailing lists and feel free to ask question.
although i must mention i never implemented to TTS using "flite". i went ahead with "espeak"

Speech Recognition API

I need to automatically transcribe some short MP3s as part of a proof of concept I am working on. I am currently looking into cloud solutions or web API services to send the MP3 as a simple HTTP request and receive a transcription back.
The only free/open source solution I have found here, but the demos don't seem to work (at least not on the files I need to transcribe). I have found some enterprise solutions for call centers, but so far nothing I can simply integrate into a project.
Are there any web based speech recognition services available? One that is able to filter out small noise would be a plus.
Here is an unofficial method to access Google ASR capability. I just tested on Yesterday and it still works - you can get JSON style ASR output with words and associated confidence score from an FLC audio sampled in 16KHz.
Also you can try speech recognition engine of Windows 7 to produce subtitles. Here is the tool for that.
This may be a good match. Also, their techcrunch profile (See this) lists competitors as: SimulScribe, SpinVox, Vlingo, Nuance, Microsoft, Google
Some of these links may be helpful.
Vlingo, Bing and Google have recognizers in the cloud, but I don't think they make them publicly programmable. I believe they are accessible only from their authorized clients.
For a proof of concept (and low volume), have you considered just using the desktop speech engines that come in Windows 7? What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition? may be helpful. The MS desktop recognizers ship with a dictation grammar and it sounds like that is what you will need.

VBA speech recognition /audio input / voice command

Speech recognition may be too grand a term for this problem.
I want my VBA program to wait for the user to say something like "next" or "continue" before it carries on processing.
This is the equivalent of the traditional "Press any key to continue" loop.
This should be fairly simple. All the examples I have found do complicated things like defining lexica and registering callback functions for recognition events. All very nice, but not necessary in my case.
Maybe I can/should use some other (audio) library instead of Speechlib (Microsoft Speech Object Library)
Thanks for any advice.
I think this is a bad idea
So what would happen if I step away and I had the TV or radio on and someone said Next
That would be pretty funny I think...or not
There's no drop-dead simple way to do speech recognition. You have to define a grammar (so the SR engine knows what to listen for) and a recognition handler (so the SR engine can tell you when it's heard something).