I need lucene to search for synonyms as well as the actual keyword. that is if I search for "CI", I want it to search for CI OR "continues integration". at the moment I search for keywords I have the synonyms for and replace them with the "OR-ed" version, but I suspect there should be a better way to do this. my method will not work for complex queries where you have something like "x AND y OR NOT z".
That's pretty much how I was planning on implementing this functionality. I was planning on building my own version of this but then I ran across this site WordNet.Net which seems to try to address the issue of building the synonyms. There is a wordnet extension to Lucene.Net which rewrites the query, so I'm guessing that is really the standard way of handling this.
At least in the Java version of Lucene, you could write yourself a recursive function that digs through the BooleanQuery Query objects that the QueryParser will build; every time it finds a TermQuery, it could replace it with a BooleanQuery that OR's the original term with the new term you want added into the query.
Related
I want to upgrade my Alfresco server to 5.2 and in all my custom webscripts am using lucene queries. Since from Alfresco 5.x lucene indexing has been removed and solr indexing is not instantaneous, am planing to use fts_alfresco search. While testing i found that few lucene queries can be used for fts_alfresco search without modifying. So my concern is will i be able to do fts_alfresco search using lucene query? If no, is there any better way to migrate all my lucene queries to fts_alfresco?
Thanks in advance.
You will need to test/check your queries since there are small differences (for instance, date range query is not the same), but in general there's no reason why you would not be able to use FTS.
I'm not sure a comprehensive documentation exists where you would see all those small differences, though. If you find it, please share.
"Alfresco FTS is compatible with most, if not all of the examples here.."
https://community.alfresco.com/docs/DOC-4673-search
Need to create a Google like suggestions using Lucene.net. I am currently using ShingleAnalyzerWrapper for phrase suggestions and successfully. But I need to search for a word suggestions if there is no any phrase found.
I am completely new into Lucene world. I need to implement this in a short time. I would appreciate any advice.
Thanks.
Edit
I want simple answers to my questions.
Should I use SpellChecker?
How should I index phrases?
How to search for phrases(What if there are misspelled words?)?
If you are new to Lucene, this might not be that easy. However, what you need to do at a higher level is check your results from the phrase and if it comes back with zero results...simply create a new Query without the phrase.
I am not sure how your phrase is set up, but you could do:
- keyword search on the phrase and eliminate stopwords. "the big bus" phrase could become "big bus" or just "bus"
- add slop setting to your phrase search
- use Fuzzy search
- More like this search
I would recommend the book "Lucene In Action", as it covers Lucene 3.0.3. It is for Java, however the current Lucene.net version is 3.0.3 so there is symmetry between the two APIs and examples in the book. The book dedicates a chapter to what you are looking for and the strategies involved in doing: suggested search on a non-exact match (spell checking, suggesting close documents etc.)
I am editing a lucene .net implementation (2.3.2) at work to include stemming and automatic wildcarding (adding of * at the ends of words).
I have found that exact words with wildcarding don't work. (so stack* works for stackoverflow, but stackoverflow* does not get a hit), and was wondering what causes this, and how it might be fixed.
Thanks in advance. (Also thanks for not asking why I am implementing both automatic wildcarding and stemming.)
I am about to make the query always prefix query so I don't have to do any nasty adding "*"s to queries, so we will see if anything becomes clear then.
Edit: Only words that are stemmed do not work wildcarded. Example Silicate* doesn't work, but silic* does.
The reason it doesnt work is because you stem the content, thus changing the Term.
For example consider the word "valve". The snowball analyzer will stem it down to "valv".
So at search time, since you stem the input query, both "valve" and "valves" will be stemmed down to "valv". A TermQuery using the stemmed Term "valv" will yield a match on both "valve" and "valves" occurences.
But now, since in the Index you stored the Term "valv", a query for "valve*" will not match anything. That is because the QueryParser does not run the Analyzer on Wildcard Queries.
There is the AnalyzingQueryParser than can handle some of these cases, but I don't think it was in 2.3.x versions of Lucene. Anyway its not a universal fit, the documentation says:
Warning: This class should only be used with analyzers that do not use stopwords or that add tokens. Also, several stemming analyzers are inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but H?user will become h?user when using this parser and thus no match would be found (i.e. using this parser will be no improvement over QueryParser in such cases).
The solution mentionned in the duplicate I linked works for all cases, but you will get bigger indexes.
I would like to eliminate from the search query the words/phrases that bring no meaning to the query (we could call them stop phrases). Example:
"How to .."
"Where can I find .."
"What is the meaning of .."
etc.
Where to find / how to compute a list of 'common phrases' for English and for French?
How to implement it in Solr (Is there anything more advanced than the stopwords feature?)
I think that you shouldn't try to completely get rid of these phrases, because they reveal intent of the searcher. You can try to leverage the existence of them by using a natural language question answering system like Ephyra. There is even a project aimed at integration of it with Lucene. I haven't used it myself, but maybe at least evaluating it is
worth a try.
If you are determined to remove them, then I think that you need to write custom QueryParser that will filter the query, delegating the further processing to a parser of your choice.
I'm considering about adding semantic analysis to my Solr installation, but I don't exactly know where to start.
Basically, I'd like Solr to be able to find "similar" words (taken from the body of the indexed documents).
For example, if I search for "music", I should be able to query the semantic engine and obtain "rock", "pop", etc. (of course if these words appeared near to music in some of the indexed documents).
I found this project, but I don't know if it is the correct place to start:
http://code.google.com/p/semanticvectors/
Semantic indexing is a good place to start. However, in my experience, these kind of technologies don't work that well in practice. You often end up with very bizarre results. Also, because of Google, people have a certain expectation of how keyword search should behave - i.e. your search term should appear in the matching document.
You may use the Lucene Wordnet contrib package to look for synonyms.
Optimizing Findability in Lucene and Solr gives other ways to expand queries.