How large must a corpus be to create a language model for Sphinx? - voice-recognition

I would like to know how many documents or sentences or words I need to process in order to get a good language model of a domain and use it in voice recognition tools such as CMU Sphinx.

To create a decent language model for a small domain it's usually enough to have about 100 mb of texts. You can mix them with a generic language model to get a better generalization of the language model.
To create a generic language model developers use very big corpora. For example there is a Google 1TB corpus which contains millions of words and terabyte of data. The trigram part of it is about 40Gb of bigram counts but it must be a hundred terabytes of texts.

adding to Nikolay's answer:
This is not a trivial task. Generating a language model is a time- and resource-intensive task.
If you want to have a "good" language model, you will need a large or very large text corpus to train a language model (think in the order of magnitude of several years of wall street journal texts).
"good" means: if the language model will be able to generalize from the training data to new and previously unseen input data
You should look at the documentation of the Sphinx and the HTK language model toolkits.
Please check these two threads:
Building openears compatible language model
Ruby Text Analysis
You could take a more general Language Model, based on a bigger corpus and interpolate your smaller Language Model with it .. e.g a back-off language model ... but that's not a trivial task.
see: http://en.wikipedia.org/wiki/Katz's_back-off_model

Related

What are some common data augmentation techniques for code?

I understand that there are data augmentation techniques for natural language such as word/sentence shuffling, word replacement with synonyms and syntax-tree manipulation.
However, I have hard time finding data augmentation techniques for code snippets(e.g. java, c++, etc).
For example, natural language (e.g. I am a human) and code snippet (e.g. def function(foo): print(foo)) have quite different syntactic and semantic characteristics. Thus, I don't think the data augmentation techniques for natural language (e.g. word replacement with synonyms) can be applied to code snippets.
Could someone tell me what are some common data augmentation used for code? Thank you.

Scaling up SURF lookups

I am currently trying to recognise DVD covers in generic photos. My initial test involved using 100 DVD covers and 10 test cases of photos that contained them, and with some tweaking of the find_obj.cpp example in OpenCV I was able to get recognition working.
However now I need to do this on a much larger database, and I am aware that the FLANN method will not scale up well to meet this requirement. How do people here recommend I scale up my SURF recognition in an SQL database?
If you really want to scale your system to several orders of magnitude, nearest neighbors search (FLANN) will not be sufficient.
In such a case what you need is to build a visual vocabulary (a.k.a bag of words) by quantizing your descriptors, and create an inverted index.
I recommend you to refer to the Scalable Recognition with a Vocabulary Tree paper that is the reference publication for such a topic.

Suitability of Naive Bayes classifier in Mahout to classifying websites

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.
In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.
From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.
I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.
Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.
Alternatively, if I'm barking up entirely the wrong tree please let me know!
You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.
How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.

How to extract semantic relatedness from a text corpus

The goal is to assess semantic relatedness between terms in a large text corpus, e.g. 'police' and 'crime' should have a stronger semantic relatedness than 'police' and 'mountain' as they tend to co-occur in the same context.
The simplest approach I've read about consists of extracting IF-IDF information from the corpus.
A lot of people use Latent Semantic Analysis to find semantic correlations.
I've come across the Lucene search engine: http://lucene.apache.org/
Do you think it is suitable to extract IF-IDF?
What would you recommend to do what I'm trying to do, both in terms of technique and software tools (with a preference for Java)?
Thanks in advance!
Mulone
Yes, Lucene gets TF-IDF data. The Carrot^2 algorithm is an example of a semantic extraction program built on Lucene. I mention it since, as a first step, they create a correlation matrix. Of course, you probably can build this matrix yourself easily.
If you deal with a ton of data, you may want to use Mahout for the harder linear algebra parts.
It is very easy if you have lucene index. For example to get correllation you can use simple formula count(term1 and term2)/ count(term1)* count(term2). Where count is hits from you search results. Moreover you can easility calculate other semntica metrics such as chi^2, info gain. All you need is to get formula and convert it to terms of count from Query

Non-Speech Noise or Sound Recognition Software?

I'm working on some software for children, and looking to add the ability for the software to respond to a number of non-speech sounds. For instance, clapping, barking, whistling, fart noises, etc.
I've used CMU Sphinx and the Windows Speech API in the past, however, as far as I can tell neither of these have any support for non-speech noises, and in fact I believe actively filter them out.
In general I'm looking for "How do I get this functionality" but I suspect it may help if I break it down into three questions that are my guesses for what to search for next:
Is there a way to use one of the main speech recognition engines to recognize non-word sounds by changing an acoustic model or pronunciation lexicon?
(or) Is there already an existing library to do non-word noise recognition?
(or) I have a bit of familiarity with Hidden Markov Models and the underlying tech of voice recognition from college, but no good estimate on how difficult it would be to create a very small noise/sound recognizer from scratch (suppose <20 noises to be recognized). If 1) and 2) fail, any estimation on how long it would take to roll my own?
Thanks
Yes, you can use speech recognition software like CMU Sphinx for recognition of non-speech sounds. For this, you need to create your own acoustical and language models and define the lexicon restricted to your task. But to train the corresponding acoustic model, you must have enough training data with annotated sounds of interest.
In short, the sequence of steps is the following:
First, prepare resources for training: lexicon, dictionary etc. The process is described here: http://cmusphinx.sourceforge.net/wiki/tutorialam. But in your case, you need to redefine phoneme set and the lexicon. Namely, you should model fillers as real words (so, no ++ around) and you don't need to define the full phoneme set. There are many possibilities, but probably the most simple one is to have a single model for all speech phonemes. Thus, your lexicon will look like:
CLAP CLAP
BARK BARK
WHISTLE WHISTLE
FART FART
SPEECH SPEECH
Second, prepare training data with labels: Something similar to VoxForge, but text annotations must contain only labels from your lexicon. Of course, non-speech sounds must be labeled correctly as well. Good question here is where to get large enough amount of such data. But I guess it should be possible.
Having that, you can train your model. The task is simpler compared to speech recognition, for instance, you don't need to use triphones, just monophones.
Assuming equal prior probability of any sound/speech, the simplest language model can be a loop-like grammar (http://cmusphinx.sourceforge.net/wiki/tutoriallm):
#JSGF V1.0;
/**
* JSGF Grammar for Hello World example
*/
grammar foo;
public <foo> = (CLAP | BARK | WHISTLE | FART | SPEECH)+ ;
This is the very basic approach to using ASR toolkit for your task. In can be further improved by fine-tuning HMMs configurations, using statistical language models and using fine-grained phonemes modeling (e.g. distinguishing vowels and consonants instead of having single SPEECH model. It depends on nature of your training data).
Outside the framework of speech recognition, you can build a simple static classifier that will analyze the input data frame by frame. Convolutional neural networks that operate over spectrograms perform quite well for this task.
I don't know any existing libraries you can use, I suspect you may have to roll your own.
Would this paper be of interest? It has some technical detail, they seem to be able to recognise claps and differentiate them from whistles.
http://www.cs.bham.ac.uk/internal/courses/robotics/halloffame/2001/team14/sound.htm