MultiFieldQueryParser fuzzy searching - lucene

I'm trying to implement a fuzzy search which takes a string like "Samsung Galaxy S7 tablet" and searches across multiple fields (Category, Brand, Model), but I'm unable to set the sensitivity of the fuzzy query.
The following code
var line = "Samsung Galaxy S7 tablet";
Analyzer analyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48);
string[] searchfields = new string[] { "Category", "Brand", "Model" };
var query = new BooleanQuery();
var parser = new MultiFieldQueryParser(LuceneVersion.LUCENE_48, searchfields, analyzer);
string[] terms = line.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
foreach (string term in terms)
{
var q = parser.Parse(term.Replace("~", "") + "~0.8");
query.Add(q, Occur.MUST);
}
always produces a query like
+(Category:samsung~2 Brand:samsung~2 Model:samsung~2) +(Category:galaxy~2 Brand:galaxy~2 Model:galaxy~2) +(Category:s7~2 Brand:s7~2 Model:s7~2) +(Category:tablet~2 Brand:tablet~2 Model:tablet~2)
Any idea how can I set the sensitivity of the fuzzy query. Is there a better way to perform this type of search using Lucene?
Thank you!

Related

Apache Lucene fuzzy search for multi-worded phrases

I have the following Apache Lucene 7 application:
StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(standardAnalyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document document = new Document();
document.add(new TextField("content", new FileReader("document.txt")));
writer.addDocument(document);
writer.close();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Query fuzzyQuery = new FuzzyQuery(new Term("content", "Company"), 2);
TopDocs results = searcher.search(fuzzyQuery, 5);
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore())
when I use it with :
new FuzzyQuery(new Term("content", "Company"), 2);
the application works fine and returns the following result:
Hits: 1
Max score:0.35161147
but when I try to search with multi term query, for example:
new FuzzyQuery(new Term("content", "Company name"), 2);
it returns the following result:
Hits: 0
Max score:NaN
Anyway, the phrase Company name exists in the source document.txt file.
How to properly use FuzzyQuery in this case in order to be able to do the fuzzy search for multi-word phrases.
UPDATED
Based on the provided solution I have tested it on the following text information:
Company name: BlueCross BlueShield Customer Service
1-800-521-2227
of Texas Preauth-Medical 1-800-441-9188
Preauth-MH/CD 1-800-528-7264
Blue Card Access 1-800-810-2583
For the following query:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCross"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
the search works fine:
Hits: 1
Max score:0.5753642
but when I try to corrupt a little bit the search query(for example from BlueCross to BlueCros)
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCros"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
it stops working and returns:
Hits: 0
Max score:NaN
The problem here is the following, you're using TextField, which is tokenizing field. E.g. your text "Company name is working on something" would be effectively split by spaces (and others delimeters). So, even if you have the text Company name, during indexation it will become Company, name, is, etc.
In this case this TermQuery won't be able to find what you're looking for. The trick which going to help you would look like this:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "some"), 2));
clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "text"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
However, I wouldn't recommend this approach much, especially if your load would be big and you're planning on searching on a 10 term long company names. One should be aware, that those query are potentially heavy to execute.
The following problem with BlueCros is the following. By default Lucene uses StandardAnalyzer for TextField. So it means it effectively lowercase the terms, basically it means that BlueCross in the content field becomes bluecross.
Fuzzy difference between BlueCros and bluecross is 3, that's the reason you do not have a match.
Simple proposal would be to convert term in query to the lowercase, by doing something like .toLowerCase()
In general, one should prefer to use same analyzers during the query time as well (e.g. during construction of the query)
For Lucene.Net it can be like this.
private string _IndexPath = #"Your Index Path";
private Directory _Directory;
private Searcher _IndexSearcher;
private MultiPhraseQuery _MultiPhraseQuery;
_Directory = FSDirectory.Open(_IndexPath);
IndexReader indexReader = IndexReader.Open(_Directory, true);
string field = "Name" // Your field name
string keyword = "big red fox"; // your search term
float fuzzy = 0,7f; // between 0-1
using (_IndexSearcher = new IndexSearcher(indexReader))
{
// "big red fox" to [big,red,fox]
var keywordSplit = keyword.Split();
_MultiPhraseQuery = new MultiPhraseQuery();
FuzzyTermEnum[] _FuzzyTermEnum = new FuzzyTermEnum[keywordSplit.Length];
Term[] _Term = new Term[keywordSplit.Length];
for (int i = 0; i < keywordSplit.Length; i++)
{
_FuzzyTermEnum[i] = new FuzzyTermEnum(indexReader, new Term(field, keywordSplit[i]),fuzzy);
_Term[i] = _FuzzyTermEnum[i].Term;
if (_Term[i] == null)
{
_MultiPhraseQuery.Add(new Term(field, keywordSplit[i]));
}
else
{
_MultiPhraseQuery.Add(_FuzzyTermEnum[i].Term);
}
}
var results = _IndexSearcher.Search(_MultiPhraseQuery, indexReader.MaxDoc);
foreach (var loopDoc in results.ScoreDocs.OrderByDescending(s => s.Score))
{
//YourCode Here
}
}

Lucene problems searchinh hyphenated field

I'm having some problems with Lucene that are driving me nuts. I have the following field:
doc.Add(new Field("cataloguenumber", i.CatalogueNumber.ToLower(), Field.Store.YES, Field.Index.ANALYZED));
Which will contain a catalogue number that looks something like this:
DF-GH5
DF-FJ4
DF-DOG
AC-DP
AC-123
AC-DOCO
i.e. two characters followed by a hyphen followed by 2-5 alphanumeric characters.
I'm trying to run a boolean query to allow users to search over the data:
// specify the search fields, lucene search in multiple fields
string[] searchfields = new string[] { "cataloguenumber", "title", "author", "categories", "year", "length", "keyword", "description" };
// Making a boolean query for searching and get the searched hits
BooleanQuery mainQuery = new BooleanQuery();
QueryParser parser;
//Add filter for main keyword
parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, searchfields, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30));
parser.AllowLeadingWildcard = true;
mainQuery.Add(parser.Parse(GetMainSearchQueryString(SearchPhrase)), Occur.MUST);
The system is working fine for all fields EXCEPT cataloguenumber which for whatever reason is not working at all.
Ideally we would like to be able to search by full or partial cataloguenumber so for example "DF-" should return all items prefixed DF
Does anyone know how I can make this work?
Thanks very much in advance
Olly
A common source of problems is to use different analyzers on index-time and query-time. You should be able to get good results by using a StandardAnalyzer - it treats the text DF-GH5 as a single token so you will be able to search using fx df-gh5 or df-* but make sure to use it for the IndexWriter and the QueryParser.
Here is a simple example which builds an in-memory index with a single document, and tries to query the index by cataloguenumber.
public static void Test()
{
// Use an in-memory index.
RAMDirectory indexDirectory = new RAMDirectory();
// Make sure to use the same analyzer for indexing
Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
// Add single document to the index.
using (IndexWriter writer = new IndexWriter(indexDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
{
Document document = new Document();
document.Add(new Field("content", "This is just some text", Field.Store.YES, Field.Index.ANALYZED));
document.Add(new Field("cataloguenumber", "DF-GH5", Field.Store.YES, Field.Index.ANALYZED));
writer.AddDocument(document);
}
var parser = new MultiFieldQueryParser(
Lucene.Net.Util.Version.LUCENE_30,
new[] { "cataloguenumber", "content" },
analyzer);
var searcher = new IndexSearcher(indexDirectory);
DoSearch("df-gh5", parser, searcher);
DoSearch("df-*", parser, searcher);
}
private static void DoSearch(string queryString, MultiFieldQueryParser parser, IndexSearcher searcher)
{
var query = parser.Parse(queryString);
TopDocs docs = searcher.Search(query, 10);
foreach (ScoreDoc scoreDoc in docs.ScoreDocs)
{
Document searchHit = searcher.Doc(scoreDoc.Doc);
string cataloguenumber = searchHit.GetValues("cataloguenumber").FirstOrDefault();
string content = searchHit.GetValues("content").FirstOrDefault();
Console.WriteLine($"Found object: {cataloguenumber} {content}");
}
}

Can we use SpanNearQuery in phonetic index?

I've implemented a lucene-based software to index more than 10 millions of person's names and these names can be written on different ways like "Luíz" and "Luis". The index was created using the phonetic values of the respective tokens (a custom analyzer was created).
Currently, I'm using QueryParser to query for a given name with good results. But, in the book "Lucene in Action" is mentioned that SpanNearQuery can improve my queries using the proximity of tokens. I've played with the SpanNearQuery against a non-phonetic index of name and the results were superior compared to QueryParser.
As we should query using the same analyzer used to indexing, I couldn't find how I can use my custom phonetic analyzer and SpanNearQuery at same time, or rephrasing:
how can I use SpanNearQuery on the phonetic index?
Thanks in advance.
My first thought is: Wouldn't a phrase query with slop do the job? That would certainly be the easiest way:
"term1 term2"~5
This will use your phonetic analyzer, and produce a proximity query with the resulting tokens.
So, if you really do need to use SpanQueries here (perhaps you are using fuzzy queries or wildcards or some such, or PhraseQuery has been leering menacingly at you and you want nothing more to do with it), you'll need to do the analysis yourself. You can do this by getting a TokenStream from Analyzer.tokenStream, and iterating through the analyzed tokens.
If you are using a phonetic algorithm that produces a single code per term (soundex, for example):
SpanNearQuery.Builder nearBuilder = new SpanNearQuery.Builder("text", true);
nearBuilder.setSlop(4);
TokenStream stream = analyzer.tokenStream("text", queryStringToParse);
stream.addAttribute(CharTermAttribute.class);
stream.reset();
while(stream.incrementToken()) {
CharTermAttribute token = stream.getAttribute(CharTermAttribute.class);
nearBuilder.addClause(new SpanTermQuery(new Term("text", token.toString())));
}
Query finalQuery = nearBuilder.build();
stream.close();
If you are using a double metaphone, where you can have 1-2 terms at the same position, it's a bit more complex, as you'll need to consider those position increments:
SpanNearQuery.Builder nearBuilder = new SpanNearQuery.Builder("text", true);
nearBuilder.setSlop(4);
TokenStream stream = analyzer.tokenStream("text", "through and through");
stream.addAttribute(CharTermAttribute.class);
stream.addAttribute(PositionIncrementAttribute.class);
stream.reset();
String queuedToken = null;
while(stream.incrementToken()) {
CharTermAttribute token = stream.getAttribute(CharTermAttribute.class);
PositionIncrementAttribute increment = stream.getAttribute(PositionIncrementAttribute.class);
if (increment.getPositionIncrement() == 0) {
nearBuilder.addClause(new SpanOrQuery(
new SpanTermQuery(new Term("text", queuedToken)),
new SpanTermQuery(new Term("text", token.toString()))
));
queuedToken = null;
}
else if (increment.getPositionIncrement() >= 1 && queuedToken != null) {
nearBuilder.addClause(new SpanTermQuery(new Term("text", queuedToken)));
queuedToken = token.toString();
}
else {
queuedToken = token.toString();
}
}
if (queuedToken != null) {
nearBuilder.addClause(new SpanTermQuery(new Term("text", queuedToken)));
}
Query finalQuery = nearBuilder.build();
stream.close();

Lucene MultiFieldQuery with WildcardQuery

Currently I have an issue with the Lucene search (version 2.9).
I have a search term and I need to use it on several fields. Therefore, I have to use MultiFieldQueryParser. On the other hand, I have to use the WhildcardQuery(), because our customer wants to search for a term in a phrase (e.g. "CMH" should match "KRC250/CMH/830/T/H").
I have tried to replace the slashes ('/') with stars ('*') and use a BooleanQuery with enclosed stars for the term.
Unfortunately whichout any success.
Does anyone have any Idea?
Yes, if the field shown is a single token, setting setAllowLeadingWildcard to be true would be necessary, like:
parser.setAllowLeadingWildcard(true);
Query query = parser.parse("*CMH*");
However:
You don't mention how the field is analyzed. By default, the StandardAnalyzer is used, which will split it into tokens at slashes (or asterisks, when indexing data). If you are using this sort of analysis, you could simply create a TermQuery searching for "cmh" (StandardAnalyzer includes a LowercaseFilter), or simply:
String[] fields = {"this", "that", "another"};
QueryParser parser = MultiFieldQueryParser(Version.LUCENE_29, fields, analyzer) //Assuming StandardAnalyzer
Query simpleQuery = parser.parse("CMH");
//Or even...
Query slightlyMoreComplexQuery = parser.parse("\"CMH/830/T\"");
I don't understand what you mean by a BooleanQuery with enclosed stars, if you can include code to elucidate that, it might help.
Sorry, maybe I have described it a little bit wrong.
I took something like this:
BooleanQuery bq = new BooleanQuery();
foreach (string field in fields)
{
foreach (string tok in tokArr)
{
bq.Add(new WildcardQuery(new Term(field, " *" + tok + "* ")), BooleanClause.Occur.SHOULD);
}
}
return bq;
but unfortunately it did not work.
I have modified it like this
string newterm = string.Empty;
string[] tok = term.Split(new[] { ' ', '/' }, StringSplitOptions.RemoveEmptyEntries);
tok.ForEach(x => newterm += x.EnsureStartsWith(" *").EnsureEndsWith("* "));
var version = Lucene.Net.Util.Version.LUCENE_29;
var analyzer = new StandardAnalyzer(version);
var parser = new MultiFieldQueryParser(version, fields, analyzer);
parser.SetDefaultOperator(QueryParser.Operator.AND);
parser.SetAllowLeadingWildcard(true);
return parser.Parse(newterm);
and my customer love it :-)

Lucene: the same query parsed from String and build via Query API doesn't yield same results

I have the following code:
public static void main(String[] args) throws Throwable {
String[] texts = new String[]{
"starts_with k mer",
"starts_with mer",
"starts_with bleue est mer",
"starts_with mer est bleue",
"starts_with mer bla1 bla2 bla3 bla4 bla5",
"starts_with bleue est la mer",
"starts_with la mer est bleue",
"starts_with la mer"
};
//write:
Set<String> stopWords = new HashSet<String>();
StandardAnalyzer stdAn = new StandardAnalyzer(Version.LUCENE_36, stopWords);
Directory fsDir = FSDirectory.open(INDEX_DIR);
IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_36,stdAn);
iwConf.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter indexWriter = new IndexWriter(fsDir,iwConf);
for(String text:texts) {
Document document = new Document();
document.add(new Field("title",text,Store.YES,Index.ANALYZED));
indexWriter.addDocument(document);
}
indexWriter.commit();
//read
IndexReader indexReader = IndexReader.open(fsDir);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
//get query:
//Query query = getQueryFromString("mer");
Query query = getQueryFromAPI("mer");
//explain
System.out.println("======== Query: "+query+"\n");
TopDocs hits = indexSearcher.search(query, 10);
for (ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = indexSearcher.doc(scoreDoc.doc);
System.out.println(">>> "+doc.get("title"));
System.out.println("Explain:");
System.out.println(indexSearcher.explain(query, scoreDoc.doc));
}
}
private static Query getQueryFromString(String searchString) throws Throwable {
Set<String> stopWords = new HashSet<String>();
Query query = new QueryParser(Version.LUCENE_36, "title",new StandardAnalyzer(Version.LUCENE_36, stopWords)).parse("("+searchString+") \"STARTS_WITH "+searchString+"\"");
return query;
}
private static Query getQueryFromAPI(String searchString) throws Throwable {
Set<String> stopWords = new HashSet<String>();
Query searchStringTermsMatchTitle = new QueryParser(Version.LUCENE_36, "title", new StandardAnalyzer(Version.LUCENE_36, stopWords)).parse(searchString);
PhraseQuery titleStartsWithSearchString = new PhraseQuery();
titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase()+" "+searchString));
BooleanQuery query = new BooleanQuery(true);
BooleanClause matchClause = new BooleanClause(searchStringTermsMatchTitle, Occur.SHOULD);
query.add(matchClause);
BooleanClause startsWithClause = new BooleanClause(titleStartsWithSearchString, Occur.SHOULD);
query.add(startsWithClause);
return query;
}
Basically I'm indexing some strings, and then I have two methods for creating a Lucene Query from user input, one that simply builds the corresponding Lucene query String "manually" (via string concatenation) and another that uses Lucene's API for building queries. They seem to be building the same query, as the debug output of the query shows the exact same query string, but the search results are not the same:
running the query built via String concatenation yields (for argument "mer"):
title:mer title:"starts_with mer"
and ideed in this case when I search with it I get documents that match the title:"starts_with mer" part first. Here's the explain on the first result:
>>> starts_with mer
Explain:
1.2329358 = (MATCH) sum of:
0.24658716 = (MATCH) weight(title:mer in 1), product of:
0.4472136 = queryWeight(title:mer), product of:
0.882217 = idf(docFreq=8, maxDocs=8)
0.50692016 = queryNorm
0.55138564 = (MATCH) fieldWeight(title:mer in 1), product of:
1.0 = tf(termFreq(title:mer)=1)
0.882217 = idf(docFreq=8, maxDocs=8)
0.625 = fieldNorm(field=title, doc=1)
0.9863486 = (MATCH) weight(title:"starts_with mer" in 1), product of:
0.8944272 = queryWeight(title:"starts_with mer"), product of:
1.764434 = idf(title: starts_with=8 mer=8)
0.50692016 = queryNorm
1.1027713 = fieldWeight(title:"starts_with mer" in 1), product of:
1.0 = tf(phraseFreq=1.0)
1.764434 = idf(title: starts_with=8 mer=8)
0.625 = fieldNorm(field=title, doc=1)
running the query built via Lucene query helper tools yields an apparently identical query:
title:mer title:"starts_with mer"
but this time the results are not the same, since in fact the title:"starts_with mer" part is not matched. Here's an explain of the first result:
>>> starts_with mer
Explain:
0.15185544 = (MATCH) sum of:
0.15185544 = (MATCH) weight(title:mer in 1), product of:
0.27540696 = queryWeight(title:mer), product of:
0.882217 = idf(docFreq=8, maxDocs=8)
0.312176 = queryNorm
0.55138564 = (MATCH) fieldWeight(title:mer in 1), product of:
1.0 = tf(termFreq(title:mer)=1)
0.882217 = idf(docFreq=8, maxDocs=8)
0.625 = fieldNorm(field=title, doc=1)
My question is: whay don't I get the same results? I'd really like to be able to use the Query helper tools here, especially since there's the BooleanQuery(disableCoord) option which I'd like to use and I really don't know how to express direclly into Lucene query string. (Yes, my example passes "true" there, I've also tried with "false", same result).
===UPDATE
femtoRgon's answer is great: the problem was that I was adding the whole search string as a term, instead of first splitting it into terms and then adding each one to the query.
The answer femtoRgon gives works ok if the input string consists of one term: in this case, separatedly adding the "STARTS_WITH" text as one term, and then adding the search string as a 2nd term works.
However if the user inputs something that would be tokenzied by more than one term, you'd have to first split it into terms (preferably using the same analyzers and/or tokenizers that you used when indexing - to get consistent results) and then add each term to the query.
What I ended up doing is making a function that splits the query string into terms, using the same analyzer that I used for indexing:
private static List<String> getTerms(String text) throws Throwable {
Analyzer analyzer = getAnalyzer();
StringReader textReader = new StringReader(text);
TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME_TITLE, textReader);
tokenStream.reset();
List<String> terms = new ArrayList<String>();
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
String term = charTermAttribute.toString();
terms.add(term);
}
textReader.close();
tokenStream.close();
analyzer.close();
return terms;
}
Then I first add the "STARTS_WITH" as one term, and then each of the elements in the list as a separate term:
PhraseQuery titleStartsWithSearchString = new PhraseQuery();
titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase()));
for(String term:getTerms(searchString)) {
titleStartsWithSearchString.add(new Term("title",term));
}
I believe the problem you are running into is that you are adding the entire phrase to your PhraseQuery as a single term. In the index, and in the query parsed by the QueryParser, this will be split into terms "starts_with" and "mer", which must be found consecutively. However, in the query you have constructed, you have a single term in your PhraseQuery instead, the term "starts_with mer", which doesn't exist as a single term in the index.
You should be able to change the bit where you are constructing the PhraseQuery to:
PhraseQuery titleStartsWithSearchString = new PhraseQuery();
titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase())
titleStartsWithSearchString.add(new Term("title",searchString));