Google Cloud Text-to-speech word timestamps - text-to-speech

I'm generating speech through Google Cloud's text-to-speech API and I'd like to highlight words as they are spoken.
Is there a way of getting timestamps for spoken words or sentences?

You can do this using SSML and v1beta1 version of Google Cloud's text-to-speech API: https://cloud.google.com/text-to-speech/docs/reference/rest/v1beta1/text/synthesize#TimepointType
Add <mark> SSML tags to the point in the text that you want a timestamp for (maybe at the end of each sentence).
Set TimepointType to SSML_MARK. If this field is not set, timepoints are not returned by default.

Google's text-to-speech API supports this in the v1beta1 release, at the time of writing.
In Python (as an example) you will need to change the import from:
from google.cloud import texttospeech as tts
to:
from google.cloud import texttospeech_v1beta1 as tts
You must use SSML, not plain text, and use <mark>'s in the XML.
The synthesis request needs the enable_time_pointing flag to be set. In Python this looks like:
response = client.synthesize_speech(
request=tts.SynthesizeSpeechRequest(
...
enable_time_pointing=[
tts.SynthesizeSpeechRequest.TimepointType.SSML_MARK]
)
)
For a runnable example, see my answer on this question.

This question seems to have gotten quite popular so I thought I'd share what I ended up doing. This method will probably only work with English or similar languages.
I first split text on any punctuation that causes a break in speaking. Each "sentence" is converted to speech separately. The resulting audio files have a seemingly random amount of silence at the end which needs to be removed before joining them, this can be done with the FFmpeg silencedetect filter. You can then join the audio files with an appropriate gap. Approximate word timestamps can be linearly interpolated within the sentences.

Related

NER - Extract long entities - voice chatbot

Building a voice Chatbot to do some specific tasks (intents), e.g translation,
Issue is I m having long entities:
input from user: "translate to German The Eminem Show 20th Anniversary launched earlier this year"
I need to extract following entities:
("German", "LanguageTo")
("The Eminem Show 20th Anniversary launched earlier this year", "text")
I tried using Spacy to train custom ner, but it is doing bad on long entities (not catching the whole "text" entity),
"CRF" and "DIETClassifier" within Rasa are better, but not really good,
Do you think extracting the long "text" entity is not a NER task? Any recommendations I would be delighted!
NB: text I m getting from the user (as it is a voice chatbot) has no punctuation nor casing (full text is lowercase) and could be much longer than the example I gave
You're right that this isn't really an NER problem - while in the most general sense NER covers any selection of text from input, many NER models are designed for short proper nouns. A side effect of that is that they're sensitive to where the spans start and end, and have trouble representing long spans.
In the case of spaCy, the spancat component was designed to have less edge sensitivity, and should be a better fit for problems like the one you have. It's still kind of a difficult problem, but should do better than NER.
Backing up a bit, you might want to consider whether you actually need to use a model to find things like the language to translate to - you could just use a list of languages, for example. You could also have an inflexible command structure if you have a small number of well-defined commands.
I would recommend you use whisper from openAi. It adds automatically punctuation when fit and thus you could likely do the entity/text separation. You could also use POS tagging from spacy to detect parts of your speech and extract language.

Is there a way to make Google Text to Speech, speak text for a desired duration?

I went through the documentation of Google Text to Speech SSML.
https://developers.google.com/assistant/actions/reference/ssml#prosody
So there is a tag called <Prosody/> which as per the documentation of W3 Specification can accept an attribute called duration which is a value in seconds or milliseconds for the desired time to take to read the contained text.
So <speak><prosody duration='6s'>Hello, How are you?</prosody></speak> should take 3 seconds for google text to speech to speak this! But when i try it here https://cloud.google.com/text-to-speech/ , its not working and also I tried it in rest API.
Does google text to speech doesn't take duration attribute into account? If they don't then is there a way to achieve the same?
There are two ways I know of to solve this:
First Option: call Google's API twice: use the first call to measure the time of the spoken audio, and the second call to adjust the rate parameter accordingly.
Pros: Better audio quality? (this is subjective and depends on taste as well as the application's requirements)
Cons: Doubles the cost and processing time.
Second option:
Post-process the audio using a specialized library such as ffmpeg
Pros: Cost effective and can be fast if implemented correctly.
Cons: Some knowledge of the concepts and the usage of an audio post-processing library is required (no need to become an expert though).
As Mr Lister already mentioned, the documentation clearly says.
<prosody>
Used to customize the pitch, speaking rate, and volume of text
contained by the element. Currently the rate, pitch, and volume
attributes are supported.
The rate and volume attributes can be set according to the W3
specifications.
Using the UI interface you can test it.
In particular you can use things like
rate="low"
or
rate="80%"
to adjust the speed. However that is as far as you can go with Google TTS.
AWS Polly does support what you need, but only on Standard voices (not Neural).
Here is the documentation.
Setting a Maximum Duration for Synthesized Speech
Polly also has a UI to do a quick test.

Tacotron2 TTS viseme generation

I am currently working on a project that use tacotron2 TTS to produce human like voice for a robot. I would also like to get the visemes from the TTS, so I can synchronize the robot face animation with the voice. How could I get the visemes and the duration of each one with tacotron2?
Thanks
Can you get the phonemes out? You can refer to these phoneme-viseme tables to do conversion. You could try using espeak to do the text -> phoneme conversion. If you don't mind just a rough sync you could compare the duration of espeak output to your tacotron2 output.

what are the inputs to a wavenet?

I am trying to implement TTS. I have just read about wavenet, but, I am confused on local conditioning. The original paper here, explains to add a time series for local conditioning, this article explains that adding mel spectrogram features for local conditioning is fine. As we know that Wavenet is a generative model and takes raw audio inputs to generate high audio output when conditioned,
my question is that the said mel spectrogram features are of that raw audio passed as in the input or of some other audio.
Secondly, for implementing a TTS the audio input will be generated by some other TTS system whose output quality will be improved by wavenet, am I correct to think this way??
Please help, it is direly needed.
Thanks
Mel features are created by actual TTS module from the text (tacotron2 for example), than you run vocoder module (Wavenet) to create speech.
It is better to try existing implementation like Nvidia/tacotron2 + nvidia/waveglow. Waveglow is better than wavenet between, much faster. Wavenet is very slow.

How can I parse a captcha image with data. and data changes

How to parse a captcha Image or get data from it? The data is part of image. The data changes with reloading. How to get the data on the image? can i do anything with data-url of image?
following is a example for captcha:
http://enquiry.indianrail.gov.in/ntes/CaptchaServlet?action=getNewCaptchaImg&t=1400870602238
Using OCR (Optical Character Recognition) is the first step. Below are 2 examples for such tools/APIs that can help you with that.
Try Tesseract.
Tesseract is probably the most accurate open source OCR engine
available. Combined with the Leptonica Image Processing Library it can
read a wide variety of image formats and convert them to text in over
60 languages.
for more info check: https://code.google.com/p/tesseract-ocr/
You can also try OCRopus
OCRopus is an OCR system written in Python, NumPy, and SciPy focusing
on the use of large scale machine learning for addressing problems in
document analysis.
for more info check: https://code.google.com/p/tesseract-ocr/
For detailed info with code smaple on how to do this, check Ben Boyter's article Decoding CAPTCHA’s at: http://www.boyter.org/decoding-captchas/