Getting re-written list of terms for FuzzyQuery - lucene

I'm using Lucene.NET 4.8-beta00005
Given a fuzzy query
var fuzzyQuery = new FuzzyQuery(new Term(NameField, term));
What is the proper way of getting the list of terms/transpositions (and boosts assigned by FuzzyQuery to each of them) that will end up as a BooleanQuery after the query is re-written.
I'm looking at FuzzyQuery.Rewrite and then extracting the resulting list from the BooleanQuery from within the re-written query, but the documentation on Rewrite seems to suggest using FuzzyQuery.GetTermsEnum instead, and I cannot figure out how to use FuzzyQuery.GetTermsEnum.
I need this for my custom scoring using a CustomScoreProvider where I would score a document match at a 100 when all terms from the query exactly match the document and would also use the boosts assigned by FuzzyQuery to the transpositions to adjust the scoring so that all non-exact matches don't result into a score of 100.

Related

Lucene.NET 2.9 - MultiFieldQueryParser, boosted fields, stemming and prefixes

I have a system where the search queries multiple fields with different boost values. It is running on Lucene.NET 2.9.4 because it's an Umbraco (6.x) site, and that's what version of Lucene.NET the CMS uses.
My client asked me if I could add stemming, so I wrote a custom analyzer that does Standard / Lowercase / Stop / PorterStemmer. The stemming filter seems to work fine.
But now, when I try to use my new analyzer with the MultiFieldQueryParser, it's not finding anything.
The MultiFieldQueryParser is returning a query containing stemmed words - e.g. if I search for "the figure", what I get as part of the query it returns is:
keywords:figur^4.0 Title:figur^3.0 Collection:figur^2.0
i.e. it's searching the correct fields and applying the correct boosts, but trying to do an exact search on stemmed terms on indexes that contained unstemmed words.
I think what's actually needed is for the MultiFieldQueryParser to return a list of clauses which are of type PrefixQuery. so it'll output a query like
keywords:figur*^4.0 Title:figur*^3.0 Collection:figur*^2.0
If I try to just add a wildcard to the end of the term, and feed that into the parser, the stemmer doesn't kick in. i.e. it builds a query to look for "figure*".
Is there any way to combine MultiFieldQueryParser boosting and prefix queries?
You need to reindex using your custom analyzer. Applying a stemmer only at query time is useless. You might kludge together something using wildcards, but it would remain an ugly, unreliable kludge.

Examine lucene.net custom query after analyzer tokenizes

I'm using Examine in Umbraco to query Lucene index of content nodes. I have a field "completeNodeText" that is the concatenation of all the node properties (to keep things simple and not search across multiple fields).
I'm accepting user-submitted search terms. When the search term is multiple words (ie, "firstterm secondterm"), I want the resulting query to be an OR query: Bring me back results where fullNodeText is firstterm OR secondterm.
I want:
{+completeNodeText:"firstterm ? secondterm"}
but instead, I'm getting:
{+completeNodeText:"firstterm secondterm"}
If I search for "firstterm OR secondterm" instead of "firstterm secondterm", then the generated query is correctly: {+completeNodeText:"firstterm ? secondterm"}
I'm using the following API calls:
var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.Field("completeNodeText", term).Compile();
Is there an easy way to force Examine to generate this "OR" query? Or do I have to manually construct the raw query by calling the StandardAnalyzer to tokenize the user input and concatenating together a query by iterating through the tokens? And bypassing the entire Examine fluent query API?
I don't think that question mark means what you think it means.
It looks like you are generating a PhraseQuery, but you want two disjoint TermQueries. In Lucene query syntax, a phrase query is enclosed in quotes.
"firstterm secondterm"
A phrase query is looking for precisely that phrase, with the two terms appearing consecutively, and in order. Placing an OR within a phrase query does not perform any sort of boolean logic, but rather treats it as the word "OR". The question mark is a placeholder using in PhraseQuery.toString() to represent a removed stop word (See #Lucene-1396). You are still performing a phrasequery, but now it is expecting a three word phrase firstterm, followed by a removed stop word, followed by secondterm
To simply search for two separate terms, get rid of the quotes.
firstterm secondterm
Will search for any document with either of those terms (with higher score given to documents with both).

Lucene term query

In my application we have deals and each deal has a target user group which may include several fields like gender, age and city. For the gender part a deal's target could be MALE FEMALE or BOTH. I wanted to find deals which are either for males or both.I created the following query but it doesn't work...
TermQuery maleQuery = new TermQuery(new Term("gender","MALE"));
TermQuery bothQuery = new TermQuery(new Term("gender","BOTH"));
BooleanQuery query = new BooleanQuery();
query.add(maleQuery,BooleanClause.Occur.SHOULD);
query.add(bothQuery,BooleanClause.Occur.SHOULD);
Please suggest if I am making some mistake. Somehow it seems to spit out only MALE deals,and not BOTH.
I am using version 4.2.1 and Standard Analyzer as the analyzer.
Several solutions possible are:
Use a QueryParser to construct the query instead of using TermQuery using the same analyser used at indexing time.For eg in my case it would be:
Query query = new QueryParser(version,"gender",new StandardAnalyzer()).parse("MALE BOTH");
Use a different analyzer for indexing that does a case insensitive indexing.
(this one applies for StandardAnalyzer, for other analyzers solutions may be different) LOWERCASE your search terms before searching.
Explaination
A brief explaination to the situation would be:
I used a StandardAnalyzer for indexing which lower cases input tokens so that a case insensitive search could be materialised.
Then I used a QueryParser configured with the same analyzer to construct a query instance for searching the user's query at front end. The search worked because the parser working in accordance with standard analyzer lower cased the user's terms and made a case sensitive search.
Then I needed to filter search results for which I wrote TermQuerys instead of using parsers, in which I used capitalized text which was not indexed that way, so the search failed.
Your query seems perfectly valid, I'd look for something else that might be wrong (e.g. are you not using LowerCaseFilter for MALE/BOTH/FEMALE terms by a chance when indexing?).
You might want to read this article on how to combine various queries into a single BooleanQuery.

Lucene: how to boost some specific field

In my case, documents have two fields, for example, "title" and "views". "views" is represented the num of times that people have visited this document. like: "title":"iphone", "views":"10".
I have to develop a strategy that will assign some weights to views, such as the relevance score is calculated by score(title)*0.8+score(views)*0.2. Does lucene can do this? And I want to know whether there are some algorithms related to this question.
If you get here after 2020, in Lucene 8.5.2.
Document.setBoost() doesn't exist anymore.
Field.setBoost() doesn't exist anymore.
Query.setBoost() doesn't exist anymore.
The ways to go:
Wrap your Query (any Query but probably TermQuery in this case) in à BoostQuery
Query boosted = new BoostQuery(query, 2f);
Use the caret ^ symbol in your query parser syntax.
Specify boosts in MultiFiledQueryParser.
Use PerFieldSimilarityWrapper and adjust score per field.
Here is how you can do that:
Query titleQuery, viewsQuery;
titleQuery.setBoost(0.8);
viewsQuery.setBoost(0.2);
BooleanQuery query = new BooleanQuery();
query.add(titleQuery, Occur.MUST); // or Occur.SHOULD if this clause is optional
query.add(viewsQuery, Occur.SHOULD); // or Occur.MUST if this clause is required
// use query to search documents
The score will be proportional to 0.8*score(titleQuery) + 0.2*score(viewsQuery) (to a multiplicative constant).
To leverage your views field, you will probably need to use a ValueSourceQuery.
You can boost in 3 ways. Depending on your needs you might want to employ a combination
Document level boosting - while indexing - by calling
document.setBoost() before a document is added to the index.
Document's Field level boosting - while indexing - by calling
field.setBoost() before adding a field to the document (and before
adding the document to the index).
Query level boosting - during
search, by setting a boost on a query clause, calling
Query.setBoost().
source: http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/scoring.html

Lucene: search within search using FuzzyQuery

I need to make a FuzzyQuery using an index that contains around 8 million lines. That kind of query is pretty slow, needing about 20 seconds for every match. The fact is that I can narrow down the results using another field to about 5000 hits before doing the fuzzy search. For this to work, I should be able to make a search by the "narrower" field first, and then use the fuzzy search within those results.
According to the lucene FAQ, the only thing I have to do is a BooleanQuery, where the "narrower" should be required (BooleanClause.Occur.MUST in lucene 3).
Now I have tried two different approaches:
a) Using the Query Parser, with an input like:
narrower:+narrowing_text fuzzy:fuzzy_text~0.9
b) Constructing a BooleanQuery with a TermQuery and a FuzzyQuery
Neither did work, I'm getting about the same times than the ones when the narrower is not used.
Also, just to check that if the narrower was working the times should be much better, I reindexed only the 5000 items that match the narrower, and the search went fast as hell.
In case anyone wonders, I'm using pylucene 3.0.2.
Doppleganger, you can probably use a Filter, specifically a QueryWrapperFilter.
Follow the example from Lucene in Action. You may have to make some modifications for use in python, but otherwise it should be simple:
Create the query that narrows this down to 5000 hits.
Use it to build a QueryWrapperFilter.
Use the filter in a search involving the fuzzy query.