Lucene: how to boost some specific field - lucene

In my case, documents have two fields, for example, "title" and "views". "views" is represented the num of times that people have visited this document. like: "title":"iphone", "views":"10".
I have to develop a strategy that will assign some weights to views, such as the relevance score is calculated by score(title)*0.8+score(views)*0.2. Does lucene can do this? And I want to know whether there are some algorithms related to this question.

If you get here after 2020, in Lucene 8.5.2.
Document.setBoost() doesn't exist anymore.
Field.setBoost() doesn't exist anymore.
Query.setBoost() doesn't exist anymore.
The ways to go:
Wrap your Query (any Query but probably TermQuery in this case) in à BoostQuery
Query boosted = new BoostQuery(query, 2f);
Use the caret ^ symbol in your query parser syntax.
Specify boosts in MultiFiledQueryParser.
Use PerFieldSimilarityWrapper and adjust score per field.

Here is how you can do that:
Query titleQuery, viewsQuery;
titleQuery.setBoost(0.8);
viewsQuery.setBoost(0.2);
BooleanQuery query = new BooleanQuery();
query.add(titleQuery, Occur.MUST); // or Occur.SHOULD if this clause is optional
query.add(viewsQuery, Occur.SHOULD); // or Occur.MUST if this clause is required
// use query to search documents
The score will be proportional to 0.8*score(titleQuery) + 0.2*score(viewsQuery) (to a multiplicative constant).
To leverage your views field, you will probably need to use a ValueSourceQuery.

You can boost in 3 ways. Depending on your needs you might want to employ a combination
Document level boosting - while indexing - by calling
document.setBoost() before a document is added to the index.
Document's Field level boosting - while indexing - by calling
field.setBoost() before adding a field to the document (and before
adding the document to the index).
Query level boosting - during
search, by setting a boost on a query clause, calling
Query.setBoost().
source: http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/scoring.html

Related

Getting re-written list of terms for FuzzyQuery

I'm using Lucene.NET 4.8-beta00005
Given a fuzzy query
var fuzzyQuery = new FuzzyQuery(new Term(NameField, term));
What is the proper way of getting the list of terms/transpositions (and boosts assigned by FuzzyQuery to each of them) that will end up as a BooleanQuery after the query is re-written.
I'm looking at FuzzyQuery.Rewrite and then extracting the resulting list from the BooleanQuery from within the re-written query, but the documentation on Rewrite seems to suggest using FuzzyQuery.GetTermsEnum instead, and I cannot figure out how to use FuzzyQuery.GetTermsEnum.
I need this for my custom scoring using a CustomScoreProvider where I would score a document match at a 100 when all terms from the query exactly match the document and would also use the boosts assigned by FuzzyQuery to the transpositions to adjust the scoring so that all non-exact matches don't result into a score of 100.

Boost search results in Lucene via the presence of a field value

I am using Lucene.net via Kentico. I am trying to boost results that have a particular value in a field. For example:
myfield:"myvalue"^2
Unfortunately this is treated as a search term and alters the scores (via tf and idf etc) anyway.
Is there a way of boosting a result based on the presence of a value, but not including that value as a search term?
update
So I want to boost the score of records that contain that value in that field only, its not a search value in any way.
Failing that, as I am actually using two indexes, could I apply a boost to a particular index? For example, items from in index-1 have a slightly higher score overall than those from index-2
If you added this field in the "Search Condition" then behind the scenes it adds a "+" to the value, so the lucene is rendering:
+(myfield:"myvalue"^2)
Which then requires the field.
I believe (you will have to test) if you add a Smart Search Filter, set the value to myfield:"myValue"^2 and then set the "Filter is conditional" to false, this should properly add in your field to the lucene to boost, then just wrap the filter with some <div style="display:none"></div> to hide it.
Point that to your Results and see if it does the trick!

Hibernate Search - possible to get new Lucene query after facets applied?

A Lucene Query is generated as so:
Query luceneQuery = builder.all().createQuery();
Then facets are applied.
I'm not sure if when facets are applied the luceneQuery is ANDed and ORed with other Querys resulting in a new Lucene Query. Alternatively, perhaps a bunch of BitSets's are applied to the original Query to refine the results. (I don't know).
If a new query is generated I'd like to retrieve it. If not, I need a rethink. That's the crux of the question.
Why:
I'm applying a faceted search on a field with multiple possible values.
E.g. TMovie.class many-to-many TTag.class (multiple-value-facet)
I'm filtering on TMovie where TTag is some value.
Anyway, the filtering works but there is a known problem whereby the Facet-counts returned are incorrect.
Detailed here: Add faceting over multivalued to application using Hibernate Search and https://forum.hibernate.org/viewtopic.php?f=9&t=1010472
I'm using this solution:
http://sujitpal.blogspot.ie/2007/04/lucene-search-within-search-with.html (see comment on new API under article)
The BitSet solution (in this example at least) generates counts based on the original Lucene Query. This works perfectly. However.....
If alternate (different, not TTags) facets are applied to the original query some complications arise.
The Bitset solution calculates on the original Lucene query. It does not calculate on the lucene query now reduced by the application of alternate Facets (a different FacetSelection) (or even TTag Facets themselves for that matter). I.e. the count calculations are irrespective of any other FacetSelection Facets applied.
So...
A. can I get the new Lucene query after facets are applied? The BitSet solution applied to this would be correct.
B. Any other alternative suggestions?
Thanks so much.. All comments welcome.
John
Regarding your first question, applying a facet is not modifying the original query, it uses a custom Collector called FacetCollector - see https://github.com/hibernate/hibernate-search/blob/master/engine/src/main/java/org/hibernate/search/query/collector/impl/FacetCollector.java. Under the hood the collector uses a Lucene FieldCache for doing the facet count. There is also the root of the limitation for multi-value faceting. FieldCache does not support multiple values per field.
Anyways, no additional queries are applied during faceting and the original query is unmodified. The benefit of course is performance. The solution you are pointing to probably works as well, but relies on running multiple queries. However, it might be a valid work around for your use case.

Examine lucene.net custom query after analyzer tokenizes

I'm using Examine in Umbraco to query Lucene index of content nodes. I have a field "completeNodeText" that is the concatenation of all the node properties (to keep things simple and not search across multiple fields).
I'm accepting user-submitted search terms. When the search term is multiple words (ie, "firstterm secondterm"), I want the resulting query to be an OR query: Bring me back results where fullNodeText is firstterm OR secondterm.
I want:
{+completeNodeText:"firstterm ? secondterm"}
but instead, I'm getting:
{+completeNodeText:"firstterm secondterm"}
If I search for "firstterm OR secondterm" instead of "firstterm secondterm", then the generated query is correctly: {+completeNodeText:"firstterm ? secondterm"}
I'm using the following API calls:
var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.Field("completeNodeText", term).Compile();
Is there an easy way to force Examine to generate this "OR" query? Or do I have to manually construct the raw query by calling the StandardAnalyzer to tokenize the user input and concatenating together a query by iterating through the tokens? And bypassing the entire Examine fluent query API?
I don't think that question mark means what you think it means.
It looks like you are generating a PhraseQuery, but you want two disjoint TermQueries. In Lucene query syntax, a phrase query is enclosed in quotes.
"firstterm secondterm"
A phrase query is looking for precisely that phrase, with the two terms appearing consecutively, and in order. Placing an OR within a phrase query does not perform any sort of boolean logic, but rather treats it as the word "OR". The question mark is a placeholder using in PhraseQuery.toString() to represent a removed stop word (See #Lucene-1396). You are still performing a phrasequery, but now it is expecting a three word phrase firstterm, followed by a removed stop word, followed by secondterm
To simply search for two separate terms, get rid of the quotes.
firstterm secondterm
Will search for any document with either of those terms (with higher score given to documents with both).

Lucene: search within search using FuzzyQuery

I need to make a FuzzyQuery using an index that contains around 8 million lines. That kind of query is pretty slow, needing about 20 seconds for every match. The fact is that I can narrow down the results using another field to about 5000 hits before doing the fuzzy search. For this to work, I should be able to make a search by the "narrower" field first, and then use the fuzzy search within those results.
According to the lucene FAQ, the only thing I have to do is a BooleanQuery, where the "narrower" should be required (BooleanClause.Occur.MUST in lucene 3).
Now I have tried two different approaches:
a) Using the Query Parser, with an input like:
narrower:+narrowing_text fuzzy:fuzzy_text~0.9
b) Constructing a BooleanQuery with a TermQuery and a FuzzyQuery
Neither did work, I'm getting about the same times than the ones when the narrower is not used.
Also, just to check that if the narrower was working the times should be much better, I reindexed only the 5000 items that match the narrower, and the search went fast as hell.
In case anyone wonders, I'm using pylucene 3.0.2.
Doppleganger, you can probably use a Filter, specifically a QueryWrapperFilter.
Follow the example from Lucene in Action. You may have to make some modifications for use in python, but otherwise it should be simple:
Create the query that narrows this down to 5000 hits.
Use it to build a QueryWrapperFilter.
Use the filter in a search involving the fuzzy query.