Spell-checking homophones and numbers with spacy - spell-checking

I am working with a pipeline where speech is translated into text. The speech would be a sort of technical professional jargon, with occurences of acronyms as well as phrases which describe amounts and values.
There is a high chance that the stock speech-to-text translator is gonna misrepresent terms and phrases I would be interested in. I have a corpus where I can tag those words and phrases.
Is there a way to use spacy (or maybe some other NLP tool) to apply spell-checking, so for instance 'pee-age' or 'pee-oh-to' would be corrected to 'pH' and 'pO2', respectively?
Also, for phrases where numbers and amounts are specified, like for example 'one twenty cee-cee' would be corrected to '120cc'?
Thanks in advance

Related

How to extract the keywords on which universal sentence encoder was trained on?

I am using Universal sentence encoder to encode some documents into a 512 dimensional embeddings. These are then used to find similar items to a search query which is also encoded using USE. USE works pretty well on general english words in search query and documents but performs really bad when the search query contains rare keywords such as people's name etc. I am thinking of enabling a reranker over the search results that takes into account the number of rare words present in the search query and the document retrieved. This should boost the scores of documents which contain known words while reduce the score of documents that contain unknown words.
My question is How do I get the grammar of Universal sentence encoder to implement such re-ranker?

How to choose num_words parameter for keras Tokenizer?

tokenizer = Tokenizer(num_words=my_max)
I am using the keras preprocessing tokenizer to process a corpus of text for a machine learning model. One of the parameters for the Tokenizer is the num_words parameter that defines the number of words in the dictionary. How should this parameter be picked? I could choose a huge number and guarantee that every word would be included but certain words that only appears once might be more useful if grouped together as a simple "out of vocabulary" token. What is the strategy in setting this parameter?
My particular use case is a model processing tweets so every entry is below 140 characters and there is some overlap in the types of words that are used. the model is for a kaggle competition about extracting the text that exemplifies a sentiment (i.e "my boss is bullying me" returns "bullying me")
The base question here is "What kinds of words establish sentiment, and how often do they occur in tweets?"
Which, of course, has no hard and fast answer.
Here's how I would solve this:
Preprocess your data so you remove conjunctions, stop words, and "junk" from tweets.
Get the number of unique words in your corpus. Are all of these words essential to convey sentiment?
Analyze the words with the top frequencies. Are these words that convey sentiment? Could they be removed in your preprocessing? The tokenizer records the first N unique words until the dictionary has num_words in it, so these popular words are much more likely to be in your dictionary.
Then, I would begin experimenting with different values, and see the effects on your output.
Apologies for no "real" answer. I would argue that there is no single true strategy to choosing this value. Instead, the answer should come from leveraging the characteristics and statistics of your data.

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

What is the Metaphone 3 Algorithm?

I want to code the Metaphone 3 algorithm myself. Is there a description? I know the source code is available for sale but that is not what I am looking for.
Since the author (Lawrence Philips) decided to commercialize the algorithm itself it is more than likely that you will not find description. The good place to ask would be the mailing list: https://lists.sourceforge.net/lists/listinfo/aspell-metaphone
but you can also checkout source code (i.e. the code comments) in order to understand how algorithm works:
http://code.google.com/p/google-refine/source/browse/trunk/main/src/com/google/refine/clustering/binning/Metaphone3.java?r=2029
From Wikipedia, the Metaphone algorithm is
Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar [...]
Metaphone 3 specifically
[...] achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings.
The overview of the algorithm is:
The Metaphone algorithm operates by first removing non-English letters and characters from the word being processed. Next, all vowels are also discarded unless the word begins with an initial vowel in which case all vowels except the initial one are discarded. Finally all consonents and groups of consonents are mapped to their Metaphone code. The rules for grouping consonants and groups thereof then mapping to metaphone codes are fairly complicated; for a full list of these conversions check out the comments in the source code section.
Now, onto your real question:
If you are interested in the specifics of the Metaphone 3 algorithm, I think you are out of luck (short of buying the source code, understanding it and re-creating it on your own): the whole point of not making the algorithm (of which the source you can buy is an instance) public is that you cannot recreate it without paying the author for their development effort (providing the "precise algorithm" you are looking for is equivalent to providing the actual code itself). Consider the above quotes: the development of the algorithm involved a "test harness of [...] encodings". Unless you happen to have such test harness or are able to create one, you will not be able to replicate the algorithm.
On the other hand, implementations of the first two iterations (Metaphone and Double Metaphone) are freely available (the above Wikipedia link contains a score of links to implementations in various languages for both), which means you have a good starting point in understanding what the algorithm is about exactly, then improve on it as you see fit (e.g. by creating and using an appropriate test harness).
The link by #Bo now refers to (now defucnt) project entire source code.
Hence here is the new link with direct link to Source code for Metaphone 3
https://searchcode.com/codesearch/view/2366000/
by Lawrence Philips
Metaphone 3 is designed to return an approximate phonetic key
(and an alternate * approximate phonetic key when appropriate) that
should be the same for English * words, and most names familiar in
the United States, that are pronounced similarly. * The key value
is not intended to be an exact phonetic, or even phonemic, *
representation of the word. This is because a certain degree of
'fuzziness' has * proven to be useful in compensating for variations
in pronunciation, as well as * misheard pronunciations. For example,
although americans are not usually aware of it, * the letter 's' is
normally pronounced 'z' at the end of words such as "sounds".
The 'approximate' aspect of the encoding is implemented according to the following rules: * * (1) All vowels are
encoded to the same value - 'A'. If the parameter encodeVowels * is
set to false, only initial vowels will be encoded at all. If
encodeVowels is set * to true, 'A' will be encoded at all places in
the word that any vowels are normally * pronounced. 'W' as well as
'Y' are treated as vowels. Although there are differences in * the
pronunciation of 'W' and 'Y' in different circumstances that lead to
their being * classified as vowels under some circumstances and as
consonants in others, for the purposes * of the 'fuzziness' component
of the Soundex and Metaphone family of algorithms they will * be
always be treated here as vowels. * * (2) Voiced and
un-voiced consonant pairs are mapped to the same encoded value. This
means that: * 'D' and 'T' -> 'T' * 'B' and 'P' -> 'P' * 'G' and 'K' -> 'K' * 'Z' and 'S' -> 'S' * 'V' and 'F' -> 'F' * * - In addition to the above voiced/unvoiced rules,
'CH' and 'SH' -> 'X', where 'X' * represents the "-SH-" and "-CH-"
sounds in Metaphone 3 encoding.
I thought it is wrong to have the general community be denied an algorithm (not code)
I am selling source, so the algorithm is not hidden. I am asking $40.00 for a copy of the source code, and asking other people who are charging for their software or services that use Metaphone 3 to pay me a licensing fee, and also asking that the source code not be distributed by other people (except for an exception I made for Google Refine - i can only request that you do not redistribute the copy of Metaphone 3 found there separately from the Refine package.)
Actually Metaphone3 is an algorithm with many very specific rules being a result of some test cases analysis. So it's not only a pure algorithm but it comes with extra domain knowledge. To obtain these knowledge and specific rules the author needed to put in a great effort. That's why this algorithm is not open-source.
There is an alternative anyway which is open-source: Double Metaphone.
See here: https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/language/DoubleMetaphone.html
This is not a commercial post and I have no relationship with the owner but it is worth saying that an implementation of Metaphone3 is available as commercial software from its creator amporphics.com. It looks like his personal store. It is a Java app but I bought the Python version and it works fine.
The Why Metaphone3? page says:
One common solution to spelling variation is the database approach.
Some very impressive work has been done accumulating personal name
variations from all over the world. (Of course, we are always very
pleased when the companies that retail these databases advertise that
they also use some version of Metaphone to improve their flexibility
:-) )
But - there are some problems with this approach:
They only work well until they encounter a spelling variation or a
new word or name that is not already in their database.
Then they don't work at all.
Metaphone 3 is an algorithmic approach that will deliver a phonetic
lookup key for anything you enter into it.
Personal names, that is, first names and family names, are not the
same as company names. In fact, the name of a company or agency may
contain words of any kind, not just names. Database solutions usually
don't cover possible spelling variations, or for that matter
misspellings, for regular 'dictionary' words. Or if they do, not very
thoroughly.
Metaphone 3 was developed to account for all spelling variations
commonly found in English words, first and last names found in the
United States and Europe, and non-English words whose native
pronunciations are familiar to Americans. It doesnt care what kind of
a word you are trying to match.
For what it is worth, we licensed the code since it is affordable and it is easy to use. I can't speak as to performance yet. There are good alternatives on PyPi but I can't find them at the moment.

Search for (Very) Approximate Substrings in a Large Database

I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that Lucene could do it, but is Lucene's levenshtein algorithm fast enough for hundreds of edits? Perhaps something out of the world of plagiarism detection? Any advice is appreciated.
Q-grams could be one approach, but there are others such as Blast, BlastP - which are used for Protein, nucleotide matches etc.
The Simmetrics library is a comprehensive collection of string distance approaches.
Lucene does not seem to be the right tool here. In addition to Mikos' fine suggestions, I have heard about AGREP, FASTA and Locality-Sensitive Hashing(LSH). I believe that an efficient method should first prune the search space heavily, and only then do more sophisticated scoring on the remaining candidates.