Phrase and word suggestions using Lucene.net - lucene

Need to create a Google like suggestions using Lucene.net. I am currently using ShingleAnalyzerWrapper for phrase suggestions and successfully. But I need to search for a word suggestions if there is no any phrase found.
I am completely new into Lucene world. I need to implement this in a short time. I would appreciate any advice.
Thanks.
Edit
I want simple answers to my questions.
Should I use SpellChecker?
How should I index phrases?
How to search for phrases(What if there are misspelled words?)?

If you are new to Lucene, this might not be that easy. However, what you need to do at a higher level is check your results from the phrase and if it comes back with zero results...simply create a new Query without the phrase.
I am not sure how your phrase is set up, but you could do:
- keyword search on the phrase and eliminate stopwords. "the big bus" phrase could become "big bus" or just "bus"
- add slop setting to your phrase search
- use Fuzzy search
- More like this search
I would recommend the book "Lucene In Action", as it covers Lucene 3.0.3. It is for Java, however the current Lucene.net version is 3.0.3 so there is symmetry between the two APIs and examples in the book. The book dedicates a chapter to what you are looking for and the strategies involved in doing: suggested search on a non-exact match (spell checking, suggesting close documents etc.)

Related

find indexed terms in non-indexed document/string

Sorry if I'm using the wrong terminology here, I'm new to Lucene :D
Assume that I have indexed all titles of the English Wikipedia in Lucene.
Let's say I'm visiting a news website. Within the article I want to convert all phrases (that match a title in the Wikipedia) into a link to the Wikipedia page.
To clarify: I don't want to put the news article into the Lucene index, but rather use the indexed WP titles to find matches within a given string (the article). We also don't want to bother with the JS/HTML stuff, just focus on Lucene for now.
I'd also like to match greedy: i.e. if the text contains "Stack Overflow", I'd like to link to SO, rather than to "Stack" and "Overflow". But if I can get shorter matches as well, that would be neat, too. (I kindof want to do both, but I'll settle for either one if having both is difficult).
naive solution: I can see that I'd be able to query for single words iteratively and whenever I hit an index, try to find the current word plus the next word until I miss. Then convert the last match into a link and continue after that, until I'm through the complete document.
But, that seems really awkward and I have a suspicion that Lucene might have some functionality that could support me here (or at least I hope so :D), but I have no clue what I'd be looking for.
Lucene's inverted index should make this a pretty fast operation, so I guess this might perform reasonably well, even with the naive approach.
Any pointers? I'm stuck :3
There is such a thing, it's called the Tagger Handler:
Given a dictionary (a Solr index) with a name-like field, you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. It’s used for named entity recognition (NER).
It seems a bit fiddly to set-up, but it's exactly what I wanted :D

Solr spellcheck vs fuzzy search

I don't quite understand the difference between apache solr's spell check vs fuzzy search functionality.
I understand that fuzzy search matches your search term with the indexed value based on some difference expressed in distance.
I also understand that spellcheck also give you suggestions based on how close your search term is to a value in the index.
So to me those two things are not that different though I am sure that this is due to my shortcoming in understanding each feature thoroughly.
If anyone could provide an explanation preferably via an example, I would greatly appreciate it.
Thanks,
Bob
I'm not a professional in the Solr but I try to explain.
Fuzzy search is a simple instruction for Solr to use a kind of spellchecking during requests - Solr’s standard query parser supports the fuzzy search and you can use this one without any additional settings, for example: roam~ or roam~1. And this so-colled spellcheking is used a Damerau-Levenshtein Distance or Edit Distance algorithm.
To use spellchecking you need to configure it in the solrconfig.xml (please, see here). It gives you sort of flexibility how to implement spellcheking (there are a couple of OOTB implementation) so, for example, you can use another index for spellcheck thereby you decrease load on main index. Also for spellchecking you use another URL: /spell so it is not a search query like fuzzy query.
Why should I use spellcheking or fuzzy search? I guess it is depended on your server loading because the fuzzy search is more expensive and not recommended by the Solr team.
P.S. It is my understanding of fuzzy and spellcheking so if somebody has more correct and clear explanation, please, give us advice how to deal with them.

Lucene to bring cheeseburger when searching for burger

I would like that if a lucene document contains the word cheeseburger and a user searches for burger for this documents to come up. I see that I will probably need a custom analyzer to break this compound word into cheese and burger. However, breaking words may also bring irrelevant results.
Ex: if when indexing production we index product and ion as well, then when the user searches for ion documents containing production will come out, which is not relevant.
So a simple word breaker won't cut it. I need a way of knowing that cheeseburger is associated to burger and cheese, but that production is not associated to ion.
Is there a more intelligent process to achieve this?
Does this has a name just like stemming is to reduce words to their root form?
Depending on how accurate you want your synonymy to be, you might need to look into approaches such as Latent Semantic Analysis (LSA) and its variants such as LDA etc. A simpler approach would be to use an Ontology such as Wordnet to augment your searches. A wordnet Lucene index is available. However if your scenario includes domain-specific vocab then you might need to generate a "mapping" Ontology.
You should look at DictionaryCompoundWordTokenFilter which uses a brute-force algorithm to split compound nouns based on a dictionary.
in most cases you can simply use wildcard queries with a leading wildcard *burger. You only have to enable the support for leading wildcards on your query parser:
parser = new QueryParser(LuceneVersion.getVersion(), searchedAttributes, analyzer);
parser.setAllowLeadingWildcard(true);
Take care:
Leading wildcards might slow your search down.
If you need a more specific solution I would suggest to go with stemming. If really a matter of finding the right analyzer.
There are stemming implementations for several languages e.g. the SnowballAnalyzer (http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/snowball/SnowballAnalyzer.html).
Best regards,
Chris
Getting associations by looking at the word is not going to scale to other words. For example, you cannot know "whopper" is associated with burger and "big-mac" is associated with cheese just by looking at the words. To make the search aware of the associations, you probably need a database of associations like "A is a B" or "A contains B". (As Mikos has mentioned, I think WordNet provides such a database.) Then, when you see B in a query, you translate the query so that it also searches for A.
I think the underlying question is -- how big is the collection you are indexing? If you are indexing some collection where all of the synonyms and related words are already known, then the index can just include the synonyms and related words directly, like 'cheeseburger' including the related words 'cheese' and 'burger'. (An approach successfully used in the LOINC standard medical terms Lucene index.)
If you are trying to solve the general problem for a whole human language (English, Chinese, etc.) then you have to move to some kind of semantic analysis as mentioned above.
It might be useful to talk with the subject matter experts of the area you are indexing to see how they search for terms -- what synonyms/related words do they use, do they have defined lists of synonyms/related words, do they need/use stemming, etc. This should give you some idea as to which approach (direct synonym/related-word inclusion or semantic analysis) you need to pursue.

lucene / solr remove common phrases (stop phrases)

I would like to eliminate from the search query the words/phrases that bring no meaning to the query (we could call them stop phrases). Example:
"How to .."
"Where can I find .."
"What is the meaning of .."
etc.
Where to find / how to compute a list of 'common phrases' for English and for French?
How to implement it in Solr (Is there anything more advanced than the stopwords feature?)
I think that you shouldn't try to completely get rid of these phrases, because they reveal intent of the searcher. You can try to leverage the existence of them by using a natural language question answering system like Ephyra. There is even a project aimed at integration of it with Lucene. I haven't used it myself, but maybe at least evaluating it is
worth a try.
If you are determined to remove them, then I think that you need to write custom QueryParser that will filter the query, delegating the further processing to a parser of your choice.

Semantic analysis using Solr

I'm considering about adding semantic analysis to my Solr installation, but I don't exactly know where to start.
Basically, I'd like Solr to be able to find "similar" words (taken from the body of the indexed documents).
For example, if I search for "music", I should be able to query the semantic engine and obtain "rock", "pop", etc. (of course if these words appeared near to music in some of the indexed documents).
I found this project, but I don't know if it is the correct place to start:
http://code.google.com/p/semanticvectors/
Semantic indexing is a good place to start. However, in my experience, these kind of technologies don't work that well in practice. You often end up with very bizarre results. Also, because of Google, people have a certain expectation of how keyword search should behave - i.e. your search term should appear in the matching document.
You may use the Lucene Wordnet contrib package to look for synonyms.
Optimizing Findability in Lucene and Solr gives other ways to expand queries.