querying for a string'ed number in lucene finds nothing - lucene

I have an existing index with some documents I'm trying to search.
When I search a "real textual" field, everything is OK.
When I try to search a field which is a number, the search gives 0 results.
The code is something like this (it is pylucene but the concept is the same):
dir = SimpleFSDirectory(File(indexDir))
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
searcher = IndexSearcher(dir)
query = QueryParser(Version.LUCENE_CURRENT, "id", analyzer).parse("902")
hits = searcher.search(query, MAX)
print hits.totalHits #gives me 0
a luke search (id:902) gives me empty results as well.
When I look at the Overview tab on luke it says this field is UTF-8 (string)
Anything I'm doing wrong?
edit:
It appears this happens on Fields that are indexed and has no Norm (according to the flags of luke).
Can someone explain it?

I don't like answering my own questions but I believe this answer is an important reference.
The solution is put a NumericRange query with both numbers the number you seek (this time in java):
NumericRangeQuery.newIntRange("id", Integer.valueOf(902), Integer.valueOf(902),
true, true)

Are you using SimpleAnalyzer while indexing? It strips off numbers. Make sure you are using same analyzer while indexing and searching.

Related

How to search in lucene where there is only a single token for a field

I am creating an index where the documents are only a single term.
I am indexing domain names, so the field "domain" would look like:
example.com
thisiscool.com
justtesting.org
cnn.com
I am creating my search terms etc. programatically, and because all my document field is just a single term, it appears as though my searches won't work as they are since there is only a single term and if I add multiple terms in a boolean query it will never find anything.
How should I be searching given I have only a single term? I want to make this as efficient as possible.
Query term = new TermQuery("domain", "this")
Query term2 = new TermQuery("domain", "cool")
// add to boolean query
bq.add(term, Occur.MUST)
bq.add(term2, Occur.MUST)
indexSearcher.search(bq, 100)
I was expecting to get "thisiscool.com" back, but I get 0 hits. My guess is because lucene can't break things down into tokens, so it will never find any document that has both tokens "this" and "cool".
How should I be searching given this scenerio?
Apply a wildcard to your search clause.
Query term = new TermQuery("domain", "this*");
Query term2 = new TermQuery("domain", "cool*"); // *cool* won't work sadly
However, that might not work because the logic is going to result in a query like this, where the domain has to begin with "this" as well as "cool"
bq.add(term, Occur.MUST)
bq.add(term2, Occur.MUST)
=> +domain:this* +domain:cool*
Query term = new TermQuery("domain", "this*cool*");
=> +domain:this*cool* // probably gets hits
If you're using newer versions then you can use regular expressions in queries:
http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/util/automaton/RegExp.html
The above example isn't actually how you should do this. I tested it out, and it doesn't even really work. What you'll want to do is build specialized queries, such as PrefixQuery, WildcardQuery, or RegexpQuery.
Additionally, if you're not using QueryParser or something that takes an Analyzer, queries have to match exactly to what's in your index. If domain is a TextField it might have been lowercased or had something else happen to it, so you'll need to know that too.
I'd just use regex.
RegExp r = new RegExp("this.*cool");
Query q = new RegexpQuery(new Term("domain", r.toString()));
It can be slow, but if you don't prefix with any char it should be perfectly fine. I'm also not entirely sure how to ignore case with this, but that might be default.

Space issue in Lucene.NET C#

I want to search sentence which has space in full text search.
Ex: Tom is a very good boy in class.
I want to Search the key word "very good".
I'm using white space tokenizer to create/search index. But it is not finding the keyword if it is separated by space.
Code:
Query searchItemQuery = new WildcardQuery(new Term(string-field-name, searchkeyword.ToLower()));
I've tried with split but it is not working properly.
Do anyone suggest me a solution for this problem?
Thanks,
Vijay
Since, you are working with tokenized string, every word is a separate term.
In order too find a phrase consisting of multiple terms, you would need to use PhraseQuery instead of WildcardQuery.
Like this:
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.Add(new Term(string-field-name, "very"));
phraseQuery.Add(new Term(string-field-name, "good"));
Note also, that you are using wildcard query. Wildcards in phrase query are a bit complex. Check this post for details: Lucene - Wildcards in phrases
And finally, I would suggest to consider using QueryParser instead of constructing query manually.

Lucene: query parser is not working as expected

I'm using Lucene.Net but I'm sure it still aplies for the non.Net flavour.
This is my query:
Collection:drwho AND Format:"Blu-ray"
This is what the query parser does to it:
{+Collection:drwho +Format:"blu ray"}
This is clearly not what I am after. This is the code I'm using:
Dim analyzer = New StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29)
Dim qp = New QueryParser(Lucene.Net.Util.Version.LUCENE_29, Nothing, analyzer)
Dim q As Query = qp.Parse(query)
Any ideas on why the query is being butched? According to http://lucene.apache.org/java/3_4_0/queryparsersyntax.html, I cannot for the life of me see what is wrong with my query...
For NOT_ANALYZED fields either you should create TermQuery in your code or use KeywordAnalyzer since it requires exact matching of the term in the index and in your query(your input is stored as Blu-ray in the index) where other analyzers processes the input and converts Blu-ray to blu ray for example, as you have already noticed.
If you change your field to ANALYZED and use StandardAnalyzer while indexing, your query would also work in current form.

Apache lucene and text meaning

I have a question about searching process in lucene/.
I use this code for search
Directory directory = FSDirectory.GetDirectory(#"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
In one document I've set "I want to go shopping" for a field and in other document I've set "I wanna go shopping".
the meaning of both sentences is same!
is there any good solution for lucene to understand meaning of sentences or kind of normalize the scentences ? for example save the fields like "I wanna /want to/ go shopping" and remove the comment with regexp in result.
Lucene provides filter to normalize words and even map similar words.
PorterStemFilter -
Stemming allows words to be reduced to their roots.
e.g. wanted, wants would be reduced to root want and search for any of those words would match the document.
However, wanna does not reduce to root want. So it may not work in this case.
SynonymFilter -
would help you to map words similar in a configuration file.
so wanna can be mapped to want and if you search for either of those, the document must match.
you would need to add the filters in your analysis chain.

How to make Lucene match all words in query?

I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered.
Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action.
Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words.
Just to clarify
I also thought of inserting '+' into all spaces in the query. I just wanted to avoid detecting grouped terms (brackets, quotes etc) and potentially breaking the query. Is there another approach?
This looks similar to the Lucene Sentence Search question. If you're interested, this is how I answered that question:
String defaultField = ...;
Analyzer analyzer = ...;
QueryParser queryParser = new QueryParser(defaultField, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("Searching is fun");
Like Adam said, there's no need to do anything to the query string. QueryParser's setDefaultOperator does exactly what you're asking for.
Why not just preparse the user search input and adjust it to fit your criteria using the Lucene query syntax before passing it on to Lucene. Alternatively, you could just create some help documentation on how to use the standard syntax to create a specific query and let the user decide how the query should be performed.
Lucene has a extensive query language as described here that describes everything you want except for + being the default but that's something you can simple handle by replacing spaces with +. So the only thing you need to do is define the format you want people to enter their search queries in (I would strongly advise to adhere to the default Lucene syntax) and then you can write the transformations from your own syntax to the Lucene syntax.
The behavior is hard-coded in method addClause(List, int, int, Query) of class org.apache.lucene.queryParser.QueryParser, so the only way to change the behavior (other than the workarounds above) is to change that method. The end of the method looks like this:
if (required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
Changing "SHOULD" to "MUST" should make clauses (e.g. words) required by default.