I would like that if a lucene document contains the word cheeseburger and a user searches for burger for this documents to come up. I see that I will probably need a custom analyzer to break this compound word into cheese and burger. However, breaking words may also bring irrelevant results.
Ex: if when indexing production we index product and ion as well, then when the user searches for ion documents containing production will come out, which is not relevant.
So a simple word breaker won't cut it. I need a way of knowing that cheeseburger is associated to burger and cheese, but that production is not associated to ion.
Is there a more intelligent process to achieve this?
Does this has a name just like stemming is to reduce words to their root form?
Depending on how accurate you want your synonymy to be, you might need to look into approaches such as Latent Semantic Analysis (LSA) and its variants such as LDA etc. A simpler approach would be to use an Ontology such as Wordnet to augment your searches. A wordnet Lucene index is available. However if your scenario includes domain-specific vocab then you might need to generate a "mapping" Ontology.
You should look at DictionaryCompoundWordTokenFilter which uses a brute-force algorithm to split compound nouns based on a dictionary.
in most cases you can simply use wildcard queries with a leading wildcard *burger. You only have to enable the support for leading wildcards on your query parser:
parser = new QueryParser(LuceneVersion.getVersion(), searchedAttributes, analyzer);
parser.setAllowLeadingWildcard(true);
Take care:
Leading wildcards might slow your search down.
If you need a more specific solution I would suggest to go with stemming. If really a matter of finding the right analyzer.
There are stemming implementations for several languages e.g. the SnowballAnalyzer (http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/snowball/SnowballAnalyzer.html).
Best regards,
Chris
Getting associations by looking at the word is not going to scale to other words. For example, you cannot know "whopper" is associated with burger and "big-mac" is associated with cheese just by looking at the words. To make the search aware of the associations, you probably need a database of associations like "A is a B" or "A contains B". (As Mikos has mentioned, I think WordNet provides such a database.) Then, when you see B in a query, you translate the query so that it also searches for A.
I think the underlying question is -- how big is the collection you are indexing? If you are indexing some collection where all of the synonyms and related words are already known, then the index can just include the synonyms and related words directly, like 'cheeseburger' including the related words 'cheese' and 'burger'. (An approach successfully used in the LOINC standard medical terms Lucene index.)
If you are trying to solve the general problem for a whole human language (English, Chinese, etc.) then you have to move to some kind of semantic analysis as mentioned above.
It might be useful to talk with the subject matter experts of the area you are indexing to see how they search for terms -- what synonyms/related words do they use, do they have defined lists of synonyms/related words, do they need/use stemming, etc. This should give you some idea as to which approach (direct synonym/related-word inclusion or semantic analysis) you need to pursue.
Related
Our clients query on our Azure Search index, mostly for people's names. We are using the Lucene analyzer for all of our fields. We build the query string by making the client's input name into a phrase, and adding proximity rate of 3. Because we search using a phrase, we can not use the Fuzzy Search capability of the Lucene analyzer, as it only works on single words.
We were therefore in search of a solution for being able to bring back results with names that weren't spelled exactly as the client input them. We came across the phonetic analyzer, and have just implemented the Metaphone algorithm into our index. We've run some tests and while it gets us closer to what we need, we still see some issues:
The analyzer's scope is so wide, that it's bringing back a lot of false positives. For example, when searching on Kenneth Gooden, it brings back Kenneth Cotton. That's just a little too far to be considered phonetically similar, in our opinion. Can the sensitivity be tweaked in any way, or, can something be done to boost some other parameter to remedy this?
When doing a search on Barry Soper, the first and highest-scored result that comes back is "Barry Spear." The second result, scored lower, is "Soper, Barry Russell." To a certain extent, I can maybe see why it's scored that way (b/c of the 2nd one being last name first) but then... not really. The 2nd result contains both exact terms within the required proximity. Maybe Azure Search gives priority to the order of words in the phrase before applying the analyzer? Still doesn't make sense to me. (Side note - this query also brings back "Barh Super" - see issue #1 above)
I would like to know if someone could offer suggestions to tweak Azure Search's behavior to work more along the lines of what we need, OR, perhaps suggest an alternative to the phonetic analyzer. We haven't tried any of the the other available phonetic algorithms either yet, only b/c it seems Metaphone is the best and most commonly-used. But we're open to suggestions regarding the other algorithms as well.
Thanks.
You are correct that the fuzzy operator only works on single terms. In this case, you can use a custom analyzer (phonetic tokenfilter) or Synonyms feature (in preview). I am not sure what you meant by "we have just implemented the Metaphone algorithm into our index" but there are several phonetic tokenfilters you can choose from in Azure Search custom analysis stack. Synonyms is a newer feature only available in preview, you can take a look here. For synonyms, you will need to define synonyms rules, say 'Nate, Nathan, Nathaniel' for example, and at query time, searching for one automatically includes the results for the others.
Okay, then how should I use these building blocks in a way to control relevance for my search? One way to model is to use separate field for each expansion strategy. For example, instead of a single field for the name, you can have three fields, say 'name', 'name_synonym', and 'name_phonetic'. The first field 'name' is for exact matches, 'name_synonym' field has synonyms enabled and the third uses a phonetic analyzer and broadens the search the most. You can then use the scoring profile to boost scores from matches in each field. You can give the boost value of 10 for exact matches, 5 for synonyms and 1 for phonetic expansions, for example. Your search will be issued against these three internal fields.
Regarding your question as to why 'Soper, Barry Russell' is ranked lower than 'Barry Spear'. After the phonetic analysis. the words 'soper' and 'spear' reduce to the same form both at indexing and query time and treated as if they were identical terms. In computing the score and ranking, the search engine uses analyzed form of the terms and phonetic similarity makes no influence to the score. That’s why, secondary factors, like field length, will play a more significant role influencing the relevance score.
Hope this helps. I provided one example to model this but you could also take a look at term boosting in the full lucene query syntax.
Let me know if you have any additional questions.
Nate
I am trying to design a database with search-ability at its core. My knowledge of database design and SQL is all self-taught and still fairly beginner-level, so my questions may possibly have easy answers.
Suppose I have a single table containing a large number of records. For example, suppose that each record contains details of a different computer application (name, developer, version number, etc). A list of keywords are associated with each record, such as a list of programming languages used to write the applications.
I wish to be able to enter one or more keywords (each separated by a space) into a search box, and I wish to have all associated records returned. How should I design the database to store the keywords, and what SQL query would I need to apply to the search text? (The search should be uppercase/lowercase independent.)
My next challenge would then be to order search results by relevance, and to allow entire key-phrases as well as keywords to be associated with each record. For example, if I type "Visual Basic" into the search field, I want the first results to have exactly the key-phrase "Visual Basic" associated with them. The next results should all have both keywords "Visual" and "Basic" associated with them, and the remaining results should have only one of these keywords. Again, please could anyone advise on how to implement this?
The final challenge I believe would be much harder: how much 'intelligent interpretation' can I design my database and SQL code to handle? For example, if I search for "CSS", can I get the records with the key-phrase "Cascading Style Sheets" to appear? Can I also get SQL to identify and search for similar words, such as plurals of search phrases or, for example, "programmer" or "programming" when "program" is input? Thanks!
Learn relational algebra, normalization rules, and SQL.
Start with entity relationships. Sounds like you could have an APPLICATION table as parent for a FEATURE child table, with a one-to-many relationship between the two. You'll query them by JOINing one to the other:
SELECT A.NAME, F.NAME
FROM APPLICATION AS A
JOIN FEATURE AS F
ON F.APP_ID = A.ID
Your challenges would not suggest SQL and relations to me. I would think more in terms of a parser, an indexer and search engine like Lucene, and a NoSQL document database like MongoDB.
I've come to the conclusion, after a LOT of research, that #duffymo's answer is hinting in the right direction. For the benefit of other n00bs like me, here's the conclusion I've drawn:
Many open source search engine server apps are out there to install for free. Lucene was the first I had ever heard of them, but others do exist and I think my favourite at the moment is Sphinx. As far as I can tell, the 'indexer' that #duffymo mentions is built into it. I have learnt that the indexer is the program that will examine my database for keywords and will automatically keep a record of which results should be returned for different input queries. I have also now learnt that the terminology for the behaviour I was looking for (and which Sphinx has) is 'stemming'. I'm still not sure what role a parser plays in all this...
A more basic approach would be to use SQL itself. Whilst I was already aware of the most basic of these (ie. using the LIKE keyword with 'wildcards'), I also discovered something a little more powerful: natural language / full-text search. For anyone not interested in installing a server app, I recommend you look this up.
Also, I see no reason why I would need to use NoSQL instead of SQL (as #duffymo has suggested), and so I'm going to stick with SQL for the moment (at least until I come across some good entry-level books to learn NoSQL from). Furthermore, I have very little intention to learn relational algebra until I know why I should and how it would be useful. The message here is that other beginners shouldn't be off-put by these things, as I don't think Sphinx requires any knowledge of them.
while I like #duffymo's answer, I will also suggest you research SPARQL and the wordnet project for your semantic equivalence questions.
If you choose Oracle, you can use the spatial option triple store to implement the SPARQL endpoint and do some very nice seaching like your css = Cascading Style Sheet example.
I am using ElasticSerach and I want to setup basic stemming for English. So basically, fighter returns fight or any word that contains the fight root.
I am a little confused how to implement this. I was reading through the analyzers, tokenizers and filters and there are multiple stemming algorithms that can be used in ElasticSearch. I am just not sure which combination to use - snowball, stemmer, porter stem or synonym filters.
Also, an example of the mapping would be really helpful.
Please mind the difference between stemming and lemmatisation. Stemming algorithm applies a series of rules (and/or dictionary lookups, as is the case e.g. for KStem) and doesn't guarantee that the result will be a proper lingustic 'root' (i.e. lemma).
So for instance both words 'marinate' and 'marines' will be converted to 'marin' by Porter stemmer, which is being considered quite 'aggresive' one -- it tends to produce the same stem for big number of words. There are more conservative ones, as for example the S-Stemmer, which only converts plural to singular forms (org.apache.lucene.analysis.en.EnglishMinimalStemFilter).
Comparisons of stemming methods found in research papers seem to favor KStem as being most effective for English texts, but the choice of stemmer highly depends on the vocabulary of your documents. You don't aim to optimize stemmer performance, but rather the performance of the search engine, so measuring it in separation from other elements of your system (especially query expansion) is not a good idea in practice.
The best solution is to try a number of different stemmers that are available in elasticsearch (an example mapping can be seen here) and observe the precision and recall of the results. If you don't have a test suite of queries, then your best bet is to perform 'typical' queries and watch out for 'strange' results (effects of the stemmer being too aggresive) or 'good' results being ommitted (too conservative stemmer).
For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai
I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?
Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.
I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.
Under linux/unix you have ispell. Why reinventing the whell.