voice recognition splitting words to phoneme level - voice-recognition

I am thinking of developing voice recognition software for my native language and I am thinking of using CMUSphinx-4 for that. there is a CMU dictionary file which contains English words that maps with split of the original word to its phoneme boundaries. for example,
ABANDONED => [ 'AH', 'B', 'AE', 'N', 'D', 'AH', 'N', 'D' ]
I cannot understand the logic behind this and I want to develop an algorithm for this conversation of words. If anyone can knows an algorithm for this conversion or how this splitting is happening please share it with me.

original word to its phoneme boundaries
"Boundaries" is a wrong word here. It maps word to a phoneme sequence, nothing about boundaries
If anyone can knows an algorithm for this conversion or how this splitting is happening please share it with me.
Dictionary construction is covered in our tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialdict

Related

NER - Extract long entities - voice chatbot

Building a voice Chatbot to do some specific tasks (intents), e.g translation,
Issue is I m having long entities:
input from user: "translate to German The Eminem Show 20th Anniversary launched earlier this year"
I need to extract following entities:
("German", "LanguageTo")
("The Eminem Show 20th Anniversary launched earlier this year", "text")
I tried using Spacy to train custom ner, but it is doing bad on long entities (not catching the whole "text" entity),
"CRF" and "DIETClassifier" within Rasa are better, but not really good,
Do you think extracting the long "text" entity is not a NER task? Any recommendations I would be delighted!
NB: text I m getting from the user (as it is a voice chatbot) has no punctuation nor casing (full text is lowercase) and could be much longer than the example I gave
You're right that this isn't really an NER problem - while in the most general sense NER covers any selection of text from input, many NER models are designed for short proper nouns. A side effect of that is that they're sensitive to where the spans start and end, and have trouble representing long spans.
In the case of spaCy, the spancat component was designed to have less edge sensitivity, and should be a better fit for problems like the one you have. It's still kind of a difficult problem, but should do better than NER.
Backing up a bit, you might want to consider whether you actually need to use a model to find things like the language to translate to - you could just use a list of languages, for example. You could also have an inflexible command structure if you have a small number of well-defined commands.
I would recommend you use whisper from openAi. It adds automatically punctuation when fit and thus you could likely do the entity/text separation. You could also use POS tagging from spacy to detect parts of your speech and extract language.

How to successfully convert math papers to plain text

Goals:
1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.
Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.
Problems:
All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.
2.PDF is really hard to process.
3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain \def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.
Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as \mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?
Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write \sqrt).
You can further refer to the issue of recognition to this paper:
https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -
Recognition of handwritten symbols
Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥
http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.

IPA (International Phonetic Alphabet) Transcription with Tensorflow

I'm looking into designing a software platform that will aid linguists and anthropologists in their study of previously unstudied languages. Statistics show that around 1,000 languages exist that have never been studied by a person outside of their respective speaker groups.
My goal is to utilize TensorFlow to make a platform that will allow linguists to study and document these languages more efficiently, and to help them create written systems for the ones that don't have a written system already. One of their current methods of accomplishing such a task is three-fold: 1) Record a native speaker conversing in the language, 2) Listening to that recording and trying to transcribe it into the IPA, 3) From the phonetics, analyzing the phonemics and phonotactics of the language to eventually create a written system for the speaker.
My proposed platform would cut that research time down from a minimum of a year to a maximum of six months. Before I start, I have some questions...
What would be required to train TensorFlow to transcribe live audio into the IPA? Has this already been done? and if so, how would I utilize a previous solution for this project? Is a project like this even possible with TensorFlow? if not, what would you recommend using instead?
My apologies for the magnitude of this question. I don't have much experience in the realm of machine learning, as I am just beginning the research process for this project. Any help is appreciated!
I guess I will take a first shot at answering this. Since the question is pretty general, my answer will have to be pretty general as well.
What would be required. At the very least you would have to have a large dataset of pre-transcribed data. Ideally a large amount of spoken language audio mapped to characters in the phonetic alphabet, so the system could learn the sound of individual characters rather than whole transcribed words. If such a dataset doesn't exist, a less granular dataset could be used, mapping single words to their transcriptions. Then you would need a model, that is the actual neural network architecture implemented in code. And lastly you would need some computing resources. This is not something you can train casually, you would either have to buy some time in a cloud based machine learning framework (like Google Cloud ML) or build a fairly expensive machine to train at home.
Has this been done? I don't know. I don't think so. There have been published papers reporting various degrees of success at training systems to transcribe speech. Here is one, for example, http://deeplearning.stanford.edu/lexfree/lexfree.pdf It seems that since the alphabet you want to transcribe to is specifically designed to capture the way words sound rather than just write down the words you might have more success at training such a model.
Is it possible with TensorFlow. Yes, most likely. TensorFlow is well suited for implementing most modern deep learning architectures. Unless you end up designing some really weird and very original model for this purpose, TensorFlow should work just fine.
Edit: after some thought in part 1, you would have to use a dataset mapping spoken words to their transcriptions, since I expect that the same sound pronounced separately would be different from when the same sound is used in a word.
This has actually been done, albeit in PyTorch, by a group at CMU: https://github.com/xinjli/allosaurus

How to build short sentences with a small letter set restriction?

I'm looking for a way to write a program that creates short german sentences with a restricted letter set. The sentences can be nonsense but should grammatically be correct. The following examples only contain the letters "aeilmnost":
"Antonia ist mit Tina im Tal."
"Tamina malt mit lila Tinte Enten."
"Tina nimmt alle Tomaten mit."
For this task I need a dictionary like this one (found in the answer to "Where can I find a parsable list of German words?"). The research area for programatically create text is NLG - Natural Language Generation. On the NLG-Wiki I found a large table of NLG systems. I picked two from the list, which could be appropriate:
SimpleNLG - a Java API, which has also an adaption for the german language
KOMET - multilingual generation, from University Bremen
Do you have worked with a NLG library and have some advice which one to use for building short sentences with a letter set restriction?
Can you recommend a paper to this topic?
Grammatically correct is a pretty fuzzy area, since grammar is not to strictly defined as one might think. What you really want here though, is a part-of-speech tagger, and a markov chain.
Specifically a markov chain says that given a certain state (the first word for instance) there's just a certain chance of moving on to another state (the next word). They are relatively easy to write from scracth, but I've got a gist here in python that shows how they work if you want an example.
Once you've got that I would suggest a part-of-speech-based markov chain, combined with just checking to see if words are constructed from your desired character set. In general the algorithm would go something like this:
Pick first word at random, checking that it is constructed solely from your desired set of characters
Use the Markov Chain to predict the next word
Check if that word is an appropriate part of speech, and that it conforms to the desired character set.
If not, predict another word until it is the case.
If so, then repeat starting at 2 to completion.
Hope that's what you're looking for. Let me know if you have any more questions.
As Slater Tyranus already said, Markov chains certainly form the basis of this task. I am going to suggest a more heavy-duty approach. It is considerably more work, but is likely to give much better results in terms of grammatical correctness.
Language Model based on PCFG parse trees: A language model works by assigning a probability to a sequence of words. It requires training data, however, in order to be built first. In your case, the training process should disregard words containing letters outside the limited set.
While theoretically a language model based on parse trees is much more likely to serve your purpose, there is one caveat: due to the kind of letter-based restriction you have, data sparsity will certainly raise its ugly head. Backoff techniques (e.g. Katz's backoff model) can help a bit, but it will essentially depend on whether or not you can train on enough enough data.
As far as readily available parsers are concerned, the Stanford NLP group provides a German parser based on the Negra corpus, as mentioned in their home page.

API to break voice into phonemes / synthesize new speech given speech samples?

You know those movies where the tech geeks record someone's voice, and their software breaks it into phonemes? Which they can then use to type in any phrase, and make it seem as if the target is saying it?
Does that software exist in an API Version? I don't even know what to Google.
There is no such software. Breaking arbitrary speech into its constituent phonemes is only a partially solved problem: speech-to-text software is still imperfect, as is text-to-speech.
The idea is to reproduce the timbre of the target's voice. Even if you were able to segment the audio perfectly, reordering the phonemes would produce audio with unnatural cadence and intonation, not to mention splicing artifacts. At that point you're getting into smoothing, time-scaling, and pitch correction, all of which are possible and well-understood in theory, but operate poorly on real-world data, especially when the audio sample in question is as short as a single phoneme, and further when the timbre needs to be preserved.
These problems are compounded on the phonetic side by allophonic variation in sounds based on accent and surrounding phonemes; in order to faithfully produce even a low-quality approximation of the audio, you'd need a detailed understanding of the target's language, accent, and speech patterns.
Furthermore, your ultimate problem is one of social engineering, and people are not easy to fool when it comes to the voices of people they know. Even with a large corpus of input data, at best you could get a short low-quality sample, hardly enough for a conversation.
So while it's certainly possible, it's difficult; even if it existed, it wouldn't always be good enough.
SRI International (the company that created Siri for iOS) has an SDK called EduSpeak, which will take audio input and break it down into individual phonemes. I know this because I sat through a demo of the product about a week ago. During the demo, the presenter showed us an application that was created using the SDK. The application gave a few lines of text for the presenter to read. After reading the text, the application displayed a bar chart where each bar represented a phoneme from his speech. The height of each bar represented a score of how well each phoneme was pronounced (the presenter was not a native English speaker, so he received lower scores on certain phonemes compared to others). The presenter could also click on each individual bar to have only that individual phoneme played back using the original audio.
So yes, software exists that divides audio up by phoneme, and it does a very good job of it. Now, whether or not those phonemes can be re-assembled into speech is an open question. If we end up getting a trial version of the SDK, I'll try it out and let you know.
If your aim is to mimic someone else's voice, then another attitude is to convert your own voice (instead of assembling phonemes). It is (surprisingly) called voice conversion, e.g http://www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm
The technology is called "voice synthesis" and "voice recognition"
The java API for this can be found here Java voice JSAPI
Apple has an API for this Apple speech
Microsoft has several ...one is discussed here Vista speech
Lyrebird is a start-up that is working on this very problem. Given samples of a person's voice and some written text, it can synthesize a spoken version of that written text in the voice of the person in the samples.
You can get interesting voice warping effects with a formant-aware pitch shift. Adobe Audition has a pretty good implementation. Antares produces some interesting vocal effects VST plugins.
These techniques use some form of linear predictive coding (LPC) to treat the voice as a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting.
I dunno about a commercially available solution, but the concept isn't entirely out of the range of possibility. For example, the University of Delaware has fairly decent software for doing just that.
http://www.modeltalker.com