Lucene synonym expansion,stemming,spell check and more - lucene

I am using Lucene to index my database and then perform a phrase search on a specific field(field name: keyword).
I am using following code currently:
String userQuery = request.getParameter("query");
//create standard analyzer object
analyzer = new StandardAnalyzer(Version.LUCENE_30);
Analyzer analyze=AnalyzerUtil.getPorterStemmerAnalyzer(analyzer);
//create File object of our index directory
File file = new File(LUCENE_INDEX_DIRECTORY);
//create index reader object
reader = IndexReader.open(FSDirectory.open(file),true);
//create index searcher object
searcher = new IndexSearcher(reader);
//create topscore document collector
collector = TopScoreDocCollector.create(1000, false);
//create query parser object
parser = new QueryParser(Version.LUCENE_30,"keyword", analyze);
parser.setAllowLeadingWildcard(true);
//parse the query and get reference to Query object
query = parser.parse(userQuery);
//********Line 1***********************
//search the query
searcher.search(query, collector);
hits = collector.topDocs().scoreDocs;
//check whether the search returns any result
if(hits.length>0){//Code to retrieve hits}
This code works fine for stemming, but now I want to also expand my query to do synonym search like if I enter "Man" and my lucene index has a entry "male", it would still be able to give me that as a hit.
I tried to add this at Line 1 in the above code query=SynExpand.expand(userQuery,
searcher, analyze,"keyword",serialVersionUID);
But it doesn't give me any result.
I also want to introduce spell check, where in if I enter "ubelievable" instead of "unbelievable" it would still give me a result.
I have no idea why synonym expansion isn't working for me and how to do spelling check.Please if someone could guide me I will be really grateful.
Thanks!

Fuzzy search may be done by query keyword modifier, namely by adding tilde:
keyword:ubelievable~
See Lucene Parser Syntax for more details and other types of queries that may be interesting to you.
There are 2 ways of dealing with synonyms. Query expansion you are trying to use relies on WordNet. As SynExpand's documentation says, you should first invoke Syns2Index to use expansion. This is easy way, but it works only with English words.
If you need to add support for multiple languages or add your own synonyms, you can use synonym injection during indexing. The idea is to write your own analyzer that will inject synonyms from your own dictionary into indexed documents. This may sound hard to implement, but fortunately there's excellent example in Lucene in Action book (source code is available for free, see lia.analysis.synonym package. Though, I highly recommend to get your copy of this nice book).

Related

Lucene: query parser is not working as expected

I'm using Lucene.Net but I'm sure it still aplies for the non.Net flavour.
This is my query:
Collection:drwho AND Format:"Blu-ray"
This is what the query parser does to it:
{+Collection:drwho +Format:"blu ray"}
This is clearly not what I am after. This is the code I'm using:
Dim analyzer = New StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29)
Dim qp = New QueryParser(Lucene.Net.Util.Version.LUCENE_29, Nothing, analyzer)
Dim q As Query = qp.Parse(query)
Any ideas on why the query is being butched? According to http://lucene.apache.org/java/3_4_0/queryparsersyntax.html, I cannot for the life of me see what is wrong with my query...
For NOT_ANALYZED fields either you should create TermQuery in your code or use KeywordAnalyzer since it requires exact matching of the term in the index and in your query(your input is stored as Blu-ray in the index) where other analyzers processes the input and converts Blu-ray to blu ray for example, as you have already noticed.
If you change your field to ANALYZED and use StandardAnalyzer while indexing, your query would also work in current form.

Apache lucene and text meaning

I have a question about searching process in lucene/.
I use this code for search
Directory directory = FSDirectory.GetDirectory(#"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
In one document I've set "I want to go shopping" for a field and in other document I've set "I wanna go shopping".
the meaning of both sentences is same!
is there any good solution for lucene to understand meaning of sentences or kind of normalize the scentences ? for example save the fields like "I wanna /want to/ go shopping" and remove the comment with regexp in result.
Lucene provides filter to normalize words and even map similar words.
PorterStemFilter -
Stemming allows words to be reduced to their roots.
e.g. wanted, wants would be reduced to root want and search for any of those words would match the document.
However, wanna does not reduce to root want. So it may not work in this case.
SynonymFilter -
would help you to map words similar in a configuration file.
so wanna can be mapped to want and if you search for either of those, the document must match.
you would need to add the filters in your analysis chain.

Is there any Lucene highlighter that does not require the original text - but rather can work on term positions etc

I have been reading the new 2nd edition of the Lucene in Action and they give an example of doing highlighting but unfortunately it requires the original text so it can get the position of terms etc. The highlighter is the official one in contrib, so that implies its the sponsorted or official highlighter.
Does anyone know of another highlighter that does not require the original text but works using the term positions (sorry if i got the terminology wrong) ???
Both the standard highlighter and FastVectorHighlighter can use the index if you store the terms. (FVH can only use the index, in fact). You can see an example of this on page 274 of Lucene in Action. The relevant code line is:
TokenStream stream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), sd.doc, "title", doc, analyzer);
That will get the token stream from the index.

Parse a search string (into NHibernate Criterias )

I would like to implement an advanced search for my project.
The search right now uses all the strings the user enters and makes one big disjunction with criteria API.
This works fine, but now I would like to implement more features: AND, OR and brackets()
I have got a hard time parsing the string - and building criterias from the string. I have found this Stackoverflow question, but it didn't really help (he didn't make it clear what he wanted).
I found another article, but this supports much more and spits out sql statements.
Another thing I've heard mention a lot is Lucene - but I'm not sure if this really would help me.
I've been searching around a little bit and I've found the Lucene.Net WhitespaceAnalyzer and the QueryParser.
It changes the search A AND B OR C into something like +A +B C, which is a good step in the correct direction (plus it handles brackets).
The next step would be to get the converted string into a set of conjunctions and disjunctions.
The Java example I found was using the query builder which I couldn't find in NHibernate.
Any more ideas ?
Guess you haven't heard about Nhibernate Search till now
Nhibernate Search uses lucene underneath and gives u all the options of using AND, OR, grammar.
All you have to do is attribute your entities for indexing and Nhibernate will index it at a predefined location.
Next time you can search this index with the power that lucene exposes and then get your domain level entity objects in return.
using (IFullTextSession s = Search.CreateFullTextSession(sf.OpenSession(new SearchInterceptor()))) {
QueryParser qp = new QueryParser("id", new StopAnalyzer());
IQuery NHQuery = s.CreateFullTextQuery(qp.Parse("Summary:series"), typeof(Book));
IList result = NHQuery.List();
Powerful, isn’t it?
What I am basically doing right now is parsing the input string with the Lucene.Net parse API.
This gives me a uniform and simplified syntax. (Pseudocode)
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
void Function Search (string queryString)
{
Analyzer analyzer = new WhitespaceAnalyzer();
QueryParser luceneParser = new QueryParser("name", analyzer);
Query luceneQuery = luceneParser.Parse(queryString);
string[] words = luceneQuery.ToString().Split(' ');
foreach (string word in words)
{
//Parsing the lucene.net string
}
}
After that I am parsing this string manually, creating the disjunctions and conjunctions.

Lucene (Java) - How to specify default search field programatically?

I have the following code and would appreciate your advice.
QueryParser queryParser = new QueryParser(searchTerm, analyzer);
Query query = queryParser.parse(searchTerm);
My first question is, this "doubled"? As I have the "String to search for (=searchTerm)" in the constructor as well as in the parse() method. Is this really required? (For further usage i need a Query object). If i do it this way, does this maybe even introduce some negative side effects?
And I am not able to specify programmatically the "default field" to search for. In my queries I write "content:House" and it searches in the field "content". But how can I specify this programmatically that "content:" is my default field and a user only has to enter "House" (and lucene then automatically searches in the "content" field).
Thank you so much
jan
The first argument to the QueryParser constructor is the default search field, even if the javadoc doesn't make that obvious.
So you want this:
QueryParser queryParser = new QueryParser("content", analyzer);
Query query = queryParser.parse(searchTerm);