ElasticSearch Stemming - lucene

I am using ElasticSerach and I want to setup basic stemming for English. So basically, fighter returns fight or any word that contains the fight root.
I am a little confused how to implement this. I was reading through the analyzers, tokenizers and filters and there are multiple stemming algorithms that can be used in ElasticSearch. I am just not sure which combination to use - snowball, stemmer, porter stem or synonym filters.
Also, an example of the mapping would be really helpful.

Please mind the difference between stemming and lemmatisation. Stemming algorithm applies a series of rules (and/or dictionary lookups, as is the case e.g. for KStem) and doesn't guarantee that the result will be a proper lingustic 'root' (i.e. lemma).
So for instance both words 'marinate' and 'marines' will be converted to 'marin' by Porter stemmer, which is being considered quite 'aggresive' one -- it tends to produce the same stem for big number of words. There are more conservative ones, as for example the S-Stemmer, which only converts plural to singular forms (org.apache.lucene.analysis.en.EnglishMinimalStemFilter).
Comparisons of stemming methods found in research papers seem to favor KStem as being most effective for English texts, but the choice of stemmer highly depends on the vocabulary of your documents. You don't aim to optimize stemmer performance, but rather the performance of the search engine, so measuring it in separation from other elements of your system (especially query expansion) is not a good idea in practice.
The best solution is to try a number of different stemmers that are available in elasticsearch (an example mapping can be seen here) and observe the precision and recall of the results. If you don't have a test suite of queries, then your best bet is to perform 'typical' queries and watch out for 'strange' results (effects of the stemmer being too aggresive) or 'good' results being ommitted (too conservative stemmer).


Azure Search - issues with Phonetic Analyzer

Our clients query on our Azure Search index, mostly for people's names. We are using the Lucene analyzer for all of our fields. We build the query string by making the client's input name into a phrase, and adding proximity rate of 3. Because we search using a phrase, we can not use the Fuzzy Search capability of the Lucene analyzer, as it only works on single words.
We were therefore in search of a solution for being able to bring back results with names that weren't spelled exactly as the client input them. We came across the phonetic analyzer, and have just implemented the Metaphone algorithm into our index. We've run some tests and while it gets us closer to what we need, we still see some issues:
The analyzer's scope is so wide, that it's bringing back a lot of false positives. For example, when searching on Kenneth Gooden, it brings back Kenneth Cotton. That's just a little too far to be considered phonetically similar, in our opinion. Can the sensitivity be tweaked in any way, or, can something be done to boost some other parameter to remedy this?
When doing a search on Barry Soper, the first and highest-scored result that comes back is "Barry Spear." The second result, scored lower, is "Soper, Barry Russell." To a certain extent, I can maybe see why it's scored that way (b/c of the 2nd one being last name first) but then... not really. The 2nd result contains both exact terms within the required proximity. Maybe Azure Search gives priority to the order of words in the phrase before applying the analyzer? Still doesn't make sense to me. (Side note - this query also brings back "Barh Super" - see issue #1 above)
I would like to know if someone could offer suggestions to tweak Azure Search's behavior to work more along the lines of what we need, OR, perhaps suggest an alternative to the phonetic analyzer. We haven't tried any of the the other available phonetic algorithms either yet, only b/c it seems Metaphone is the best and most commonly-used. But we're open to suggestions regarding the other algorithms as well.
You are correct that the fuzzy operator only works on single terms. In this case, you can use a custom analyzer (phonetic tokenfilter) or Synonyms feature (in preview). I am not sure what you meant by "we have just implemented the Metaphone algorithm into our index" but there are several phonetic tokenfilters you can choose from in Azure Search custom analysis stack. Synonyms is a newer feature only available in preview, you can take a look here. For synonyms, you will need to define synonyms rules, say 'Nate, Nathan, Nathaniel' for example, and at query time, searching for one automatically includes the results for the others.
Okay, then how should I use these building blocks in a way to control relevance for my search? One way to model is to use separate field for each expansion strategy. For example, instead of a single field for the name, you can have three fields, say 'name', 'name_synonym', and 'name_phonetic'. The first field 'name' is for exact matches, 'name_synonym' field has synonyms enabled and the third uses a phonetic analyzer and broadens the search the most. You can then use the scoring profile to boost scores from matches in each field. You can give the boost value of 10 for exact matches, 5 for synonyms and 1 for phonetic expansions, for example. Your search will be issued against these three internal fields.
Regarding your question as to why 'Soper, Barry Russell' is ranked lower than 'Barry Spear'. After the phonetic analysis. the words 'soper' and 'spear' reduce to the same form both at indexing and query time and treated as if they were identical terms. In computing the score and ranking, the search engine uses analyzed form of the terms and phonetic similarity makes no influence to the score. That’s why, secondary factors, like field length, will play a more significant role influencing the relevance score.
Hope this helps. I provided one example to model this but you could also take a look at term boosting in the full lucene query syntax.
Let me know if you have any additional questions.

Lucene to bring cheeseburger when searching for burger

I would like that if a lucene document contains the word cheeseburger and a user searches for burger for this documents to come up. I see that I will probably need a custom analyzer to break this compound word into cheese and burger. However, breaking words may also bring irrelevant results.
Ex: if when indexing production we index product and ion as well, then when the user searches for ion documents containing production will come out, which is not relevant.
So a simple word breaker won't cut it. I need a way of knowing that cheeseburger is associated to burger and cheese, but that production is not associated to ion.
Is there a more intelligent process to achieve this?
Does this has a name just like stemming is to reduce words to their root form?
Depending on how accurate you want your synonymy to be, you might need to look into approaches such as Latent Semantic Analysis (LSA) and its variants such as LDA etc. A simpler approach would be to use an Ontology such as Wordnet to augment your searches. A wordnet Lucene index is available. However if your scenario includes domain-specific vocab then you might need to generate a "mapping" Ontology.
You should look at DictionaryCompoundWordTokenFilter which uses a brute-force algorithm to split compound nouns based on a dictionary.
in most cases you can simply use wildcard queries with a leading wildcard *burger. You only have to enable the support for leading wildcards on your query parser:
parser = new QueryParser(LuceneVersion.getVersion(), searchedAttributes, analyzer);
Take care:
Leading wildcards might slow your search down.
If you need a more specific solution I would suggest to go with stemming. If really a matter of finding the right analyzer.
There are stemming implementations for several languages e.g. the SnowballAnalyzer (http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/snowball/SnowballAnalyzer.html).
Best regards,
Getting associations by looking at the word is not going to scale to other words. For example, you cannot know "whopper" is associated with burger and "big-mac" is associated with cheese just by looking at the words. To make the search aware of the associations, you probably need a database of associations like "A is a B" or "A contains B". (As Mikos has mentioned, I think WordNet provides such a database.) Then, when you see B in a query, you translate the query so that it also searches for A.
I think the underlying question is -- how big is the collection you are indexing? If you are indexing some collection where all of the synonyms and related words are already known, then the index can just include the synonyms and related words directly, like 'cheeseburger' including the related words 'cheese' and 'burger'. (An approach successfully used in the LOINC standard medical terms Lucene index.)
If you are trying to solve the general problem for a whole human language (English, Chinese, etc.) then you have to move to some kind of semantic analysis as mentioned above.
It might be useful to talk with the subject matter experts of the area you are indexing to see how they search for terms -- what synonyms/related words do they use, do they have defined lists of synonyms/related words, do they need/use stemming, etc. This should give you some idea as to which approach (direct synonym/related-word inclusion or semantic analysis) you need to pursue.

What is the easiest way to implement terms association mining in Solr?

Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.
So, my needs are:
To retrieve the most associated terms within particular field.
To retrieve the term, that is closest to the specified one within particular field.
I will rate answers in the following way:
Ideally I would like to find Solr's component that directly covers specified needs, that is, something to get associated terms directly.
If this is not possible, I'm seeking for the way to get co-occurrence matrix information for specified field.
If this is not an option too, I would like to know the most straightforward way to 1) get all terms and 2) get ids (numbers) of documents these terms occur in.
You can export a Lucene (or Solr) index to Mahout, and then use Latent Dirichlet Allocation. If LDA is not close enough to LSA for your needs, you can just take the correlation matrix from Mahout, and then use Mahout to take the singular value decomposition.
I don't know of any LSA components for Solr.
Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.
I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:
All terms or at least the most frequent ones, because rare terms won't affect result of association mining by their nature.
Documents where these terms occur, again, at least top documents.
Both these tasks may be easily done with standard Solr components.
To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).
Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.
Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.
You can get the count of occurrences of the current term in found document in the following query:
http://ip:port/solr/someinstance/select?defType=func&fl=termfreq(field,xxx),*&fq={!frange l=1}termfreq(field,xxx)&indent=on&q=termfreq(field,xxx)&sort=termfreq(field,xxx) desc&wt=json

Searching Natural Language Sentence Structure

What's the best way to store and search a database of natural language sentence structure trees?
Using OpenNLP's English Treebank Parser, I can get fairly reliable sentence structure parsings for arbitrary sentences. What I'd like to do is create a tool that can extract all the doc strings from my source code, generate these trees for all sentences in the doc strings, store these trees and their associated function name in a database, and then allow a user to search the database using natural language queries.
So, given the sentence "This uploads files to a remote machine." for the function upload_files(), I'd have the tree:
(NP (DT This))
(VBZ uploads)
(NP (NNS files))
(PP (TO to) (NP (DT a) (JJ remote) (NN machine))))
(. .)))
If someone entered the query "How can I upload files?", equating to the tree:
(SQ (MD can) (NP (PRP I)) (VP (VB upload) (NP (NNS files))))
(. ?)))
how would I store and query these trees in a SQL database?
I've written a simple proof-of-concept script that can perform this search using a mix of regular expressions and network graph parsing, but I'm not sure how I'd implement this in a scalable way.
And yes, I realize my example would be trivial to retrieve using a simple keyword search. The idea I'm trying to test is how I might take advantage of grammatical structure, so I can weed-out entries with similar keywords, but a different sentence structure. For example, with the above query, I wouldn't want to retrieve the entry associated with the sentence "Checks a remote machine to find a user that uploads files." which has similar keywords, but is obviously describing a completely different behavior.
Relational databases cannot store knowledge in a natural way, what you actually need is a knowledge base or ontology (though it may be constructed on top of relational database). It holds data in triplets <subject, predicate, object>, so your phrase will be stored as <upload_file(), upload, file>. There's a lot of tools and methods to search inside such KBs (for example, Prolog is a language that was designed to do it). So, all you have to do is to translate sentences from natural language to KB triplets/ontology graph, translate user query to incomplete triplets (your question will look like <?, upload, file>) or conjunctive queries and then search on your KB. OpenNLP will help you with translating, and the rest depends on concrete technique and technologies you decide to use.
I agree with ffriend that you need to take a different approach that builds on existing work on knowledge bases and natural language search. Storing context-free parse trees in a relational database isn't the problem, but it is going to be very difficult to do a meaningful comparison of parse trees as part of a search. When you are just interested taking advantage of a little knowledge about grammatical relations, parse trees are really too complicated. If you simplify the parse into dependency triples, you can make the search problem much easier and get at the grammatical relations you were interested in in the first place. For instance, you could use the Stanford dependency parser, which generates a context-free parse and then extracts dependency triples from it. It produces output like this for "This function uploads files to a remote machine":
det(function-2, This-1)
nsubj(uploads-3, function-2)
dobj(uploads-3, files-4)
det(machine-8, a-6)
amod(machine-8, remote-7)
prep_to(uploads-3, machine-8)
In your database, you could store a simplified subset of these triples associated with the function, e.g.:
upload_file(): subj(uploads, function)
upload_file(): obj(uploads, file)
upload_file(): prep(uploads, machine)
When people search, you can find the function that has the most overlapping triples or something along those lines, where you probably also want to weight the different dependency relations or allow partial matches, etc. You probably also want to reduce the words in the triples to lemmas, maybe POS depending on what you need.
There are plenty of people who have worked on natural language search (like Powerset), so be sure to search for existing approaches. My proposed approach here is really minimal and I can think of tons of examples where it will have problems, but I think something along these lines could work reasonably well for a restricted domain.
This is not a complete answer, but if you want to perform linguistically sophisticated queries on your trees, the best bet is to pre-process your parser output and search it with tgrep2:
Trgrep/tgrep2 are, as far as I know, the most flexible and full-featured packages for searching parse trees. This is not a MySQL-based solution as you requested, but I thought you might be interested to know about this option.
Tgrep2 allows you to ask questions about parents, descendants and siblings, whereas other solutions would not retain the full tree structure of the parse or allows such sophisticated queries.

What are the things should we consider while writing a Spell Checker?

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?
Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.
I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.
Under linux/unix you have ispell. Why reinventing the whell.