Lucene: query parser is not working as expected - lucene

I'm using Lucene.Net but I'm sure it still aplies for the non.Net flavour.
This is my query:
Collection:drwho AND Format:"Blu-ray"
This is what the query parser does to it:
{+Collection:drwho +Format:"blu ray"}
This is clearly not what I am after. This is the code I'm using:
Dim analyzer = New StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29)
Dim qp = New QueryParser(Lucene.Net.Util.Version.LUCENE_29, Nothing, analyzer)
Dim q As Query = qp.Parse(query)
Any ideas on why the query is being butched? According to http://lucene.apache.org/java/3_4_0/queryparsersyntax.html, I cannot for the life of me see what is wrong with my query...

For NOT_ANALYZED fields either you should create TermQuery in your code or use KeywordAnalyzer since it requires exact matching of the term in the index and in your query(your input is stored as Blu-ray in the index) where other analyzers processes the input and converts Blu-ray to blu ray for example, as you have already noticed.
If you change your field to ANALYZED and use StandardAnalyzer while indexing, your query would also work in current form.

Related

Lucene substring matching

What is the best way to check if String a is part of String b in Lucene. For example: a = "capital" and b = "Berlin is a capital of Germany". In this case b contains a and fits the requirement.
I think your problem can be treated as some field contains certain term or not.
The basic TermQuery should be enough to solve your problem, in most analyzers, "Berlin is a capital of Germany" will be analyzed as "berlin", "capital" "germany"(if you use the basic stop words)
// code in Scala
new TermQuery(new Term("contents", "capital"))
you can also use PhraseQuery to solve your problem(though, your problem is not the most suitable scenario for PhraseQuery).
val query = new PhraseQuery();
query.add(new Term("contents", "capital"))
Lucene In Action 2nd, 3.4 Lucene’s diverse queries introduces all kinds of Query used in Lucene. I suggest you have a read and that might help.

querying for a string'ed number in lucene finds nothing

I have an existing index with some documents I'm trying to search.
When I search a "real textual" field, everything is OK.
When I try to search a field which is a number, the search gives 0 results.
The code is something like this (it is pylucene but the concept is the same):
dir = SimpleFSDirectory(File(indexDir))
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
searcher = IndexSearcher(dir)
query = QueryParser(Version.LUCENE_CURRENT, "id", analyzer).parse("902")
hits = searcher.search(query, MAX)
print hits.totalHits #gives me 0
a luke search (id:902) gives me empty results as well.
When I look at the Overview tab on luke it says this field is UTF-8 (string)
Anything I'm doing wrong?
edit:
It appears this happens on Fields that are indexed and has no Norm (according to the flags of luke).
Can someone explain it?
I don't like answering my own questions but I believe this answer is an important reference.
The solution is put a NumericRange query with both numbers the number you seek (this time in java):
NumericRangeQuery.newIntRange("id", Integer.valueOf(902), Integer.valueOf(902),
true, true)
Are you using SimpleAnalyzer while indexing? It strips off numbers. Make sure you are using same analyzer while indexing and searching.

Lucene synonym expansion,stemming,spell check and more

I am using Lucene to index my database and then perform a phrase search on a specific field(field name: keyword).
I am using following code currently:
String userQuery = request.getParameter("query");
//create standard analyzer object
analyzer = new StandardAnalyzer(Version.LUCENE_30);
Analyzer analyze=AnalyzerUtil.getPorterStemmerAnalyzer(analyzer);
//create File object of our index directory
File file = new File(LUCENE_INDEX_DIRECTORY);
//create index reader object
reader = IndexReader.open(FSDirectory.open(file),true);
//create index searcher object
searcher = new IndexSearcher(reader);
//create topscore document collector
collector = TopScoreDocCollector.create(1000, false);
//create query parser object
parser = new QueryParser(Version.LUCENE_30,"keyword", analyze);
parser.setAllowLeadingWildcard(true);
//parse the query and get reference to Query object
query = parser.parse(userQuery);
//********Line 1***********************
//search the query
searcher.search(query, collector);
hits = collector.topDocs().scoreDocs;
//check whether the search returns any result
if(hits.length>0){//Code to retrieve hits}
This code works fine for stemming, but now I want to also expand my query to do synonym search like if I enter "Man" and my lucene index has a entry "male", it would still be able to give me that as a hit.
I tried to add this at Line 1 in the above code query=SynExpand.expand(userQuery,
searcher, analyze,"keyword",serialVersionUID);
But it doesn't give me any result.
I also want to introduce spell check, where in if I enter "ubelievable" instead of "unbelievable" it would still give me a result.
I have no idea why synonym expansion isn't working for me and how to do spelling check.Please if someone could guide me I will be really grateful.
Thanks!
Fuzzy search may be done by query keyword modifier, namely by adding tilde:
keyword:ubelievable~
See Lucene Parser Syntax for more details and other types of queries that may be interesting to you.
There are 2 ways of dealing with synonyms. Query expansion you are trying to use relies on WordNet. As SynExpand's documentation says, you should first invoke Syns2Index to use expansion. This is easy way, but it works only with English words.
If you need to add support for multiple languages or add your own synonyms, you can use synonym injection during indexing. The idea is to write your own analyzer that will inject synonyms from your own dictionary into indexed documents. This may sound hard to implement, but fortunately there's excellent example in Lucene in Action book (source code is available for free, see lia.analysis.synonym package. Though, I highly recommend to get your copy of this nice book).

Parse a search string (into NHibernate Criterias )

I would like to implement an advanced search for my project.
The search right now uses all the strings the user enters and makes one big disjunction with criteria API.
This works fine, but now I would like to implement more features: AND, OR and brackets()
I have got a hard time parsing the string - and building criterias from the string. I have found this Stackoverflow question, but it didn't really help (he didn't make it clear what he wanted).
I found another article, but this supports much more and spits out sql statements.
Another thing I've heard mention a lot is Lucene - but I'm not sure if this really would help me.
I've been searching around a little bit and I've found the Lucene.Net WhitespaceAnalyzer and the QueryParser.
It changes the search A AND B OR C into something like +A +B C, which is a good step in the correct direction (plus it handles brackets).
The next step would be to get the converted string into a set of conjunctions and disjunctions.
The Java example I found was using the query builder which I couldn't find in NHibernate.
Any more ideas ?
Guess you haven't heard about Nhibernate Search till now
Nhibernate Search uses lucene underneath and gives u all the options of using AND, OR, grammar.
All you have to do is attribute your entities for indexing and Nhibernate will index it at a predefined location.
Next time you can search this index with the power that lucene exposes and then get your domain level entity objects in return.
using (IFullTextSession s = Search.CreateFullTextSession(sf.OpenSession(new SearchInterceptor()))) {
QueryParser qp = new QueryParser("id", new StopAnalyzer());
IQuery NHQuery = s.CreateFullTextQuery(qp.Parse("Summary:series"), typeof(Book));
IList result = NHQuery.List();
Powerful, isn’t it?
What I am basically doing right now is parsing the input string with the Lucene.Net parse API.
This gives me a uniform and simplified syntax. (Pseudocode)
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
void Function Search (string queryString)
{
Analyzer analyzer = new WhitespaceAnalyzer();
QueryParser luceneParser = new QueryParser("name", analyzer);
Query luceneQuery = luceneParser.Parse(queryString);
string[] words = luceneQuery.ToString().Split(' ');
foreach (string word in words)
{
//Parsing the lucene.net string
}
}
After that I am parsing this string manually, creating the disjunctions and conjunctions.

Using MultiFieldQueryParser

Am using MultiFieldQueryParser for parsing strings like a.a., b.b., etc
But after parsing, its removing the dots in the string.
What am i missing here?
Thanks.
I'm not sure the MultiFieldQueryParser does what you think it does. Also...I'm not sure I know what you're trying to do.
I do know that with any query parser, strings like 'a.a.' and 'b.b.' will have the periods stripped out because, at least with the default Analyzer, all punctuation is treated as white space.
As far as the MultiFieldQueryParser goes, that's just a QueryParser that allows you to specify multiple default fields to search. So with the query
title:"Of Mice and Men" "John Steinbeck"
The string "John Steinbeck" will be looked for in all of your default fields whereas "Of Mice and Men" will only be searched for in the title field.
What analyzer is your parser using? If it's StopAnalyzer then the dot could be a stop word and is thus ignored. Same thing if it's StandardAnalyzer which cleans up input (includes removing dots).
(Repeating my answer from the dupe. One of these should be deleted).
The StandardAnalyzer specifically handles acronyms, and converts C.F.A. (for example) to cfa. This means you should be able to do the search, as long as you make sure you use the same analyzer for the indexing and for the query parsing.
I would suggest you run some more basic test cases to eliminate other factors. Try to user an ordinary QueryParser instead of a multi-field one.
Here's some code I wrote to play with the StandardAnalyzer:
StringReader testReader = new StringReader("C.F.A. C.F.A word");
StandardAnalyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("title", testReader);
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
The output for this, by the way was:
(cfa,0,6,type=<ACRONYM>)
(c.f.a,7,12,type=<HOST>)
(word,13,17,type=<ALPHANUM>)
Note, for example, that if the acronym doesn't end with a dot then the analyzer assumes it's an internet host name, so searching for "C.F.A" will not match "C.F.A." in the text.