CMU Sphinx: how to add keywords in addition to existing vocabulary? - voice-recognition

CMU Sphinx comes with a large vocabulary of English words. that is fine however I want to emphasize certain words which I will be using as commands. some of these words are not English words. how can I make sure that Sphinx can understand both my special command keywords and the rest of the English dictionary words? how can I make sure that my special command keywords take precedence over the rest of the English vocabulary?
Using sphinx, there is a call that I have attempted to use for this purpose:
ps_add_word(ps, "OKAY", "OW K EY", 1);
However all of the words that I add this way appear to not be recognized any more frequently and any other word.

It is not possible in runtime at the moment. You have to add the word to some grammar/language model. You can find more details about language models in CMUSphinx tutorial:
http://cmusphinx.sourceforge.net/wiki/tutoriallm
You can also read advanced LM tutorial to understand how to update current language model
http://cmusphinx.sourceforge.net/wiki/tutoriallmadvanced

Related

Does MS speech support custom vocabulary

I have a requirement to write an application which would take an audio file and identity precisely at which points in the file specific words are being spoken. These are not English words, but rather Aramaic words, so would have to be added as additional vocabulary. Does MS speech recognition support this? Thanks
Yes. There are several options, depending on the specificities of your specific words.
One is simply using phrase list: https://learn.microsoft.com/en-US/azure/cognitive-services/speech-service/improve-accuracy-phrase-list?tabs=terminal&pivots=programming-language-csharp
One is called Custom Speech: https://learn.microsoft.com/en-US/azure/cognitive-services/speech-service/custom-speech-overview
The 1st one is easier to test and implement, as you will not need audio data for the training.

SpaCy different Language Models

I'm making some progress:) developing my litle OCR Project.
I was wondering if my idea is possible in this case!
After extracting the Text from a Images (ocr), I use nlp (spacy) to identify two Entities (LOCation and PERson). I write to a Dictionary and later in a JSON Data. That works good.
Now I'm wondering if I can improve my identified Entities.
One way I can imagine is to use the right Language Model for the text.
I have varies Texts in German, English,Spanish and French.
At the moment I'm using the
But now I have no idea how to put langdetect into this
Have a great week!
Greets
Here is a link that you might find useful when it comes to detecting a language (There are multiple options including langdetect) - How to detect language
You can create a dictionary with the languages you plan to detect and match it with langdetect's output. I guess you have the rest sorted out.

best way to index a text consist of multilingual word in elasticsearch

I'm new to elasticsearch.The doc on official site just say the basic and do not contain specific example.Due to it is a little disorganized as my view, I can't figure out how to get start to achieve my purpose.
I have crawl a lot of torrents, they are published by many different language.
I see there is analysis in elasticsearch to deal with input text, but I don't understand the work flow. elasticsearch do not use all analyzers to process input data as I try.
It seems I should appoint a analyzer to process a text.
Such as a text :no game no life 游戏人生 ノーゲーム・ノーライフ, it contain three language.How can I know which three analyzers I have to use?And it also too heavy to use all analyzer to process this text.
I have seen a article Three Principles for Multilingal Indexing in Elasticsearch talk about this.However I am a beginner and non-native English speaker, it is hard to understand without a example.
Please give me some guide.
Thank you.
I would probably create two fields (or multiple for number of expected languages) and apply different analyzers (language dependent) to each of them. Then when you search you would search both fields.

Phrase and word suggestions using Lucene.net

Need to create a Google like suggestions using Lucene.net. I am currently using ShingleAnalyzerWrapper for phrase suggestions and successfully. But I need to search for a word suggestions if there is no any phrase found.
I am completely new into Lucene world. I need to implement this in a short time. I would appreciate any advice.
Thanks.
Edit
I want simple answers to my questions.
Should I use SpellChecker?
How should I index phrases?
How to search for phrases(What if there are misspelled words?)?
If you are new to Lucene, this might not be that easy. However, what you need to do at a higher level is check your results from the phrase and if it comes back with zero results...simply create a new Query without the phrase.
I am not sure how your phrase is set up, but you could do:
- keyword search on the phrase and eliminate stopwords. "the big bus" phrase could become "big bus" or just "bus"
- add slop setting to your phrase search
- use Fuzzy search
- More like this search
I would recommend the book "Lucene In Action", as it covers Lucene 3.0.3. It is for Java, however the current Lucene.net version is 3.0.3 so there is symmetry between the two APIs and examples in the book. The book dedicates a chapter to what you are looking for and the strategies involved in doing: suggested search on a non-exact match (spell checking, suggesting close documents etc.)

Lucene to bring cheeseburger when searching for burger

I would like that if a lucene document contains the word cheeseburger and a user searches for burger for this documents to come up. I see that I will probably need a custom analyzer to break this compound word into cheese and burger. However, breaking words may also bring irrelevant results.
Ex: if when indexing production we index product and ion as well, then when the user searches for ion documents containing production will come out, which is not relevant.
So a simple word breaker won't cut it. I need a way of knowing that cheeseburger is associated to burger and cheese, but that production is not associated to ion.
Is there a more intelligent process to achieve this?
Does this has a name just like stemming is to reduce words to their root form?
Depending on how accurate you want your synonymy to be, you might need to look into approaches such as Latent Semantic Analysis (LSA) and its variants such as LDA etc. A simpler approach would be to use an Ontology such as Wordnet to augment your searches. A wordnet Lucene index is available. However if your scenario includes domain-specific vocab then you might need to generate a "mapping" Ontology.
You should look at DictionaryCompoundWordTokenFilter which uses a brute-force algorithm to split compound nouns based on a dictionary.
in most cases you can simply use wildcard queries with a leading wildcard *burger. You only have to enable the support for leading wildcards on your query parser:
parser = new QueryParser(LuceneVersion.getVersion(), searchedAttributes, analyzer);
parser.setAllowLeadingWildcard(true);
Take care:
Leading wildcards might slow your search down.
If you need a more specific solution I would suggest to go with stemming. If really a matter of finding the right analyzer.
There are stemming implementations for several languages e.g. the SnowballAnalyzer (http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/snowball/SnowballAnalyzer.html).
Best regards,
Chris
Getting associations by looking at the word is not going to scale to other words. For example, you cannot know "whopper" is associated with burger and "big-mac" is associated with cheese just by looking at the words. To make the search aware of the associations, you probably need a database of associations like "A is a B" or "A contains B". (As Mikos has mentioned, I think WordNet provides such a database.) Then, when you see B in a query, you translate the query so that it also searches for A.
I think the underlying question is -- how big is the collection you are indexing? If you are indexing some collection where all of the synonyms and related words are already known, then the index can just include the synonyms and related words directly, like 'cheeseburger' including the related words 'cheese' and 'burger'. (An approach successfully used in the LOINC standard medical terms Lucene index.)
If you are trying to solve the general problem for a whole human language (English, Chinese, etc.) then you have to move to some kind of semantic analysis as mentioned above.
It might be useful to talk with the subject matter experts of the area you are indexing to see how they search for terms -- what synonyms/related words do they use, do they have defined lists of synonyms/related words, do they need/use stemming, etc. This should give you some idea as to which approach (direct synonym/related-word inclusion or semantic analysis) you need to pursue.