Does Google's WaveNet support phonetic input (SSML phoneme elements)? - text-to-speech

I am working with a product that uses phonetic input to make TTS generate proper pronunciations for names. I don't see phoneme tags in Google's WaveNet TTS documentation https://cloud.google.com/text-to-speech/docs/ssml, but perhaps I'm missing it.
If any developers for Google are listening, can they share plans to add phonetic input?
Tnx

Since they're based on neural networks from "end to end" (text -> net -> sound), they probably never did a phoneme step like (text -> phoneme -> net -> sound).
This is highly expected, as this phoneme selection should be the job of the neural network, eliminating unnecessary phases.

Related

NER - Extract long entities - voice chatbot

Building a voice Chatbot to do some specific tasks (intents), e.g translation,
Issue is I m having long entities:
input from user: "translate to German The Eminem Show 20th Anniversary launched earlier this year"
I need to extract following entities:
("German", "LanguageTo")
("The Eminem Show 20th Anniversary launched earlier this year", "text")
I tried using Spacy to train custom ner, but it is doing bad on long entities (not catching the whole "text" entity),
"CRF" and "DIETClassifier" within Rasa are better, but not really good,
Do you think extracting the long "text" entity is not a NER task? Any recommendations I would be delighted!
NB: text I m getting from the user (as it is a voice chatbot) has no punctuation nor casing (full text is lowercase) and could be much longer than the example I gave
You're right that this isn't really an NER problem - while in the most general sense NER covers any selection of text from input, many NER models are designed for short proper nouns. A side effect of that is that they're sensitive to where the spans start and end, and have trouble representing long spans.
In the case of spaCy, the spancat component was designed to have less edge sensitivity, and should be a better fit for problems like the one you have. It's still kind of a difficult problem, but should do better than NER.
Backing up a bit, you might want to consider whether you actually need to use a model to find things like the language to translate to - you could just use a list of languages, for example. You could also have an inflexible command structure if you have a small number of well-defined commands.
I would recommend you use whisper from openAi. It adds automatically punctuation when fit and thus you could likely do the entity/text separation. You could also use POS tagging from spacy to detect parts of your speech and extract language.

What are some techniques to improve contextual accuracy of semantic search engine using BERT?

I am implementing a semantic search engine using BERT (using cosine distance) To a certain extend the method is able to find out sentences in a high level context. However when it comes narrowed down context of the sentence, it gives several issues.
Example:
If my search term is "Wrong Product", the search engine might match with sentences like "I bought a washing machine but it was defective".
I understand that the fixes to these is always finding some equilibrium and live with minor errors.
However if there are any rules, techniques which has improved your semantic search implementation accuracy please do share.

IPA (International Phonetic Alphabet) Transcription with Tensorflow

I'm looking into designing a software platform that will aid linguists and anthropologists in their study of previously unstudied languages. Statistics show that around 1,000 languages exist that have never been studied by a person outside of their respective speaker groups.
My goal is to utilize TensorFlow to make a platform that will allow linguists to study and document these languages more efficiently, and to help them create written systems for the ones that don't have a written system already. One of their current methods of accomplishing such a task is three-fold: 1) Record a native speaker conversing in the language, 2) Listening to that recording and trying to transcribe it into the IPA, 3) From the phonetics, analyzing the phonemics and phonotactics of the language to eventually create a written system for the speaker.
My proposed platform would cut that research time down from a minimum of a year to a maximum of six months. Before I start, I have some questions...
What would be required to train TensorFlow to transcribe live audio into the IPA? Has this already been done? and if so, how would I utilize a previous solution for this project? Is a project like this even possible with TensorFlow? if not, what would you recommend using instead?
My apologies for the magnitude of this question. I don't have much experience in the realm of machine learning, as I am just beginning the research process for this project. Any help is appreciated!
I guess I will take a first shot at answering this. Since the question is pretty general, my answer will have to be pretty general as well.
What would be required. At the very least you would have to have a large dataset of pre-transcribed data. Ideally a large amount of spoken language audio mapped to characters in the phonetic alphabet, so the system could learn the sound of individual characters rather than whole transcribed words. If such a dataset doesn't exist, a less granular dataset could be used, mapping single words to their transcriptions. Then you would need a model, that is the actual neural network architecture implemented in code. And lastly you would need some computing resources. This is not something you can train casually, you would either have to buy some time in a cloud based machine learning framework (like Google Cloud ML) or build a fairly expensive machine to train at home.
Has this been done? I don't know. I don't think so. There have been published papers reporting various degrees of success at training systems to transcribe speech. Here is one, for example, http://deeplearning.stanford.edu/lexfree/lexfree.pdf It seems that since the alphabet you want to transcribe to is specifically designed to capture the way words sound rather than just write down the words you might have more success at training such a model.
Is it possible with TensorFlow. Yes, most likely. TensorFlow is well suited for implementing most modern deep learning architectures. Unless you end up designing some really weird and very original model for this purpose, TensorFlow should work just fine.
Edit: after some thought in part 1, you would have to use a dataset mapping spoken words to their transcriptions, since I expect that the same sound pronounced separately would be different from when the same sound is used in a word.
This has actually been done, albeit in PyTorch, by a group at CMU: https://github.com/xinjli/allosaurus

Does syntaxnet support spell checking

I collect lots of words for a specific domain, I can train custom model based on the words.
In my usage there are lots of hard copy tables/sheets related to the domain, I will OCR those tables/sheets to structure documents. Anyway the OCR is not perfect, it always incorrectly recognizes some words. Can I use syntaxnet to do spell checking for optimizing the OCRed result?

Non-Speech Noise or Sound Recognition Software?

I'm working on some software for children, and looking to add the ability for the software to respond to a number of non-speech sounds. For instance, clapping, barking, whistling, fart noises, etc.
I've used CMU Sphinx and the Windows Speech API in the past, however, as far as I can tell neither of these have any support for non-speech noises, and in fact I believe actively filter them out.
In general I'm looking for "How do I get this functionality" but I suspect it may help if I break it down into three questions that are my guesses for what to search for next:
Is there a way to use one of the main speech recognition engines to recognize non-word sounds by changing an acoustic model or pronunciation lexicon?
(or) Is there already an existing library to do non-word noise recognition?
(or) I have a bit of familiarity with Hidden Markov Models and the underlying tech of voice recognition from college, but no good estimate on how difficult it would be to create a very small noise/sound recognizer from scratch (suppose <20 noises to be recognized). If 1) and 2) fail, any estimation on how long it would take to roll my own?
Thanks
Yes, you can use speech recognition software like CMU Sphinx for recognition of non-speech sounds. For this, you need to create your own acoustical and language models and define the lexicon restricted to your task. But to train the corresponding acoustic model, you must have enough training data with annotated sounds of interest.
In short, the sequence of steps is the following:
First, prepare resources for training: lexicon, dictionary etc. The process is described here: http://cmusphinx.sourceforge.net/wiki/tutorialam. But in your case, you need to redefine phoneme set and the lexicon. Namely, you should model fillers as real words (so, no ++ around) and you don't need to define the full phoneme set. There are many possibilities, but probably the most simple one is to have a single model for all speech phonemes. Thus, your lexicon will look like:
CLAP CLAP
BARK BARK
WHISTLE WHISTLE
FART FART
SPEECH SPEECH
Second, prepare training data with labels: Something similar to VoxForge, but text annotations must contain only labels from your lexicon. Of course, non-speech sounds must be labeled correctly as well. Good question here is where to get large enough amount of such data. But I guess it should be possible.
Having that, you can train your model. The task is simpler compared to speech recognition, for instance, you don't need to use triphones, just monophones.
Assuming equal prior probability of any sound/speech, the simplest language model can be a loop-like grammar (http://cmusphinx.sourceforge.net/wiki/tutoriallm):
#JSGF V1.0;
/**
* JSGF Grammar for Hello World example
*/
grammar foo;
public <foo> = (CLAP | BARK | WHISTLE | FART | SPEECH)+ ;
This is the very basic approach to using ASR toolkit for your task. In can be further improved by fine-tuning HMMs configurations, using statistical language models and using fine-grained phonemes modeling (e.g. distinguishing vowels and consonants instead of having single SPEECH model. It depends on nature of your training data).
Outside the framework of speech recognition, you can build a simple static classifier that will analyze the input data frame by frame. Convolutional neural networks that operate over spectrograms perform quite well for this task.
I don't know any existing libraries you can use, I suspect you may have to roll your own.
Would this paper be of interest? It has some technical detail, they seem to be able to recognise claps and differentiate them from whistles.
http://www.cs.bham.ac.uk/internal/courses/robotics/halloffame/2001/team14/sound.htm