Lucene substring matching

Lucene substring matching - lucene

What is the best way to check if String a is part of String b in Lucene. For example: a = "capital" and b = "Berlin is a capital of Germany". In this case b contains a and fits the requirement.

I think your problem can be treated as some field contains certain term or not.
The basic TermQuery should be enough to solve your problem, in most analyzers, "Berlin is a capital of Germany" will be analyzed as "berlin", "capital" "germany"(if you use the basic stop words)
// code in Scala
new TermQuery(new Term("contents", "capital"))
you can also use PhraseQuery to solve your problem(though, your problem is not the most suitable scenario for PhraseQuery).
val query = new PhraseQuery();
query.add(new Term("contents", "capital"))
Lucene In Action 2nd, 3.4 Lucene’s diverse queries introduces all kinds of Query used in Lucene. I suggest you have a read and that might help.

Related

Space issue in Lucene.NET C#

I want to search sentence which has space in full text search.
Ex: Tom is a very good boy in class.
I want to Search the key word "very good".
I'm using white space tokenizer to create/search index. But it is not finding the keyword if it is separated by space.
Code:
Query searchItemQuery = new WildcardQuery(new Term(string-field-name, searchkeyword.ToLower()));
I've tried with split but it is not working properly.
Do anyone suggest me a solution for this problem?
Thanks,
Vijay

Since, you are working with tokenized string, every word is a separate term.
In order too find a phrase consisting of multiple terms, you would need to use PhraseQuery instead of WildcardQuery.
Like this:
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.Add(new Term(string-field-name, "very"));
phraseQuery.Add(new Term(string-field-name, "good"));
Note also, that you are using wildcard query. Wildcards in phrase query are a bit complex. Check this post for details: Lucene - Wildcards in phrases
And finally, I would suggest to consider using QueryParser instead of constructing query manually.

querying for a string'ed number in lucene finds nothing

I have an existing index with some documents I'm trying to search.
When I search a "real textual" field, everything is OK.
When I try to search a field which is a number, the search gives 0 results.
The code is something like this (it is pylucene but the concept is the same):
dir = SimpleFSDirectory(File(indexDir))
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
searcher = IndexSearcher(dir)
query = QueryParser(Version.LUCENE_CURRENT, "id", analyzer).parse("902")
hits = searcher.search(query, MAX)
print hits.totalHits #gives me 0
a luke search (id:902) gives me empty results as well.
When I look at the Overview tab on luke it says this field is UTF-8 (string)
Anything I'm doing wrong?
edit:
It appears this happens on Fields that are indexed and has no Norm (according to the flags of luke).
Can someone explain it?

I don't like answering my own questions but I believe this answer is an important reference.
The solution is put a NumericRange query with both numbers the number you seek (this time in java):
NumericRangeQuery.newIntRange("id", Integer.valueOf(902), Integer.valueOf(902),
true, true)

Are you using SimpleAnalyzer while indexing? It strips off numbers. Make sure you are using same analyzer while indexing and searching.

Lucene: query parser is not working as expected

I'm using Lucene.Net but I'm sure it still aplies for the non.Net flavour.
This is my query:
Collection:drwho AND Format:"Blu-ray"
This is what the query parser does to it:
{+Collection:drwho +Format:"blu ray"}
This is clearly not what I am after. This is the code I'm using:
Dim analyzer = New StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29)
Dim qp = New QueryParser(Lucene.Net.Util.Version.LUCENE_29, Nothing, analyzer)
Dim q As Query = qp.Parse(query)
Any ideas on why the query is being butched? According to http://lucene.apache.org/java/3_4_0/queryparsersyntax.html, I cannot for the life of me see what is wrong with my query...

For NOT_ANALYZED fields either you should create TermQuery in your code or use KeywordAnalyzer since it requires exact matching of the term in the index and in your query(your input is stored as Blu-ray in the index) where other analyzers processes the input and converts Blu-ray to blu ray for example, as you have already noticed.
If you change your field to ANALYZED and use StandardAnalyzer while indexing, your query would also work in current form.

Parse a search string (into NHibernate Criterias )

I would like to implement an advanced search for my project.
The search right now uses all the strings the user enters and makes one big disjunction with criteria API.
This works fine, but now I would like to implement more features: AND, OR and brackets()
I have got a hard time parsing the string - and building criterias from the string. I have found this Stackoverflow question, but it didn't really help (he didn't make it clear what he wanted).
I found another article, but this supports much more and spits out sql statements.
Another thing I've heard mention a lot is Lucene - but I'm not sure if this really would help me.
I've been searching around a little bit and I've found the Lucene.Net WhitespaceAnalyzer and the QueryParser.
It changes the search A AND B OR C into something like +A +B C, which is a good step in the correct direction (plus it handles brackets).
The next step would be to get the converted string into a set of conjunctions and disjunctions.
The Java example I found was using the query builder which I couldn't find in NHibernate.
Any more ideas ?

Guess you haven't heard about Nhibernate Search till now
Nhibernate Search uses lucene underneath and gives u all the options of using AND, OR, grammar.
All you have to do is attribute your entities for indexing and Nhibernate will index it at a predefined location.
Next time you can search this index with the power that lucene exposes and then get your domain level entity objects in return.
using (IFullTextSession s = Search.CreateFullTextSession(sf.OpenSession(new SearchInterceptor()))) {
QueryParser qp = new QueryParser("id", new StopAnalyzer());
IQuery NHQuery = s.CreateFullTextQuery(qp.Parse("Summary:series"), typeof(Book));
IList result = NHQuery.List();
Powerful, isn’t it?

What I am basically doing right now is parsing the input string with the Lucene.Net parse API.
This gives me a uniform and simplified syntax. (Pseudocode)
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
void Function Search (string queryString)
{
Analyzer analyzer = new WhitespaceAnalyzer();
QueryParser luceneParser = new QueryParser("name", analyzer);
Query luceneQuery = luceneParser.Parse(queryString);
string[] words = luceneQuery.ToString().Split(' ');
foreach (string word in words)
{
//Parsing the lucene.net string
}
}
After that I am parsing this string manually, creating the disjunctions and conjunctions.

Using MultiFieldQueryParser

Am using MultiFieldQueryParser for parsing strings like a.a., b.b., etc
But after parsing, its removing the dots in the string.
What am i missing here?
Thanks.

I'm not sure the MultiFieldQueryParser does what you think it does. Also...I'm not sure I know what you're trying to do.
I do know that with any query parser, strings like 'a.a.' and 'b.b.' will have the periods stripped out because, at least with the default Analyzer, all punctuation is treated as white space.
As far as the MultiFieldQueryParser goes, that's just a QueryParser that allows you to specify multiple default fields to search. So with the query
title:"Of Mice and Men" "John Steinbeck"
The string "John Steinbeck" will be looked for in all of your default fields whereas "Of Mice and Men" will only be searched for in the title field.

What analyzer is your parser using? If it's StopAnalyzer then the dot could be a stop word and is thus ignored. Same thing if it's StandardAnalyzer which cleans up input (includes removing dots).

(Repeating my answer from the dupe. One of these should be deleted).
The StandardAnalyzer specifically handles acronyms, and converts C.F.A. (for example) to cfa. This means you should be able to do the search, as long as you make sure you use the same analyzer for the indexing and for the query parsing.
I would suggest you run some more basic test cases to eliminate other factors. Try to user an ordinary QueryParser instead of a multi-field one.
Here's some code I wrote to play with the StandardAnalyzer:
StringReader testReader = new StringReader("C.F.A. C.F.A word");
StandardAnalyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("title", testReader);
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
The output for this, by the way was:
(cfa,0,6,type=<ACRONYM>)
(c.f.a,7,12,type=<HOST>)
(word,13,17,type=<ALPHANUM>)
Note, for example, that if the acronym doesn't end with a dot then the analyzer assumes it's an internet host name, so searching for "C.F.A" will not match "C.F.A." in the text.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene substring matching - lucene

What is the best way to check if String a is part of String b in Lucene. For example: a = "capital" and b = "Berlin is a capital of Germany". In this case b contains a and fits the requirement.

Related

Space issue in Lucene.NET C#

querying for a string'ed number in lucene finds nothing

Lucene: query parser is not working as expected

Parse a search string (into NHibernate Criterias )

Using MultiFieldQueryParser

Categories

Resources