I'm using Lucene to index components with names and types. Some components are more important, thus, get a bigger boost. However, I cannot get my boost to work properly. I sill get some components appear later (get worse score), even though they have a higher boost.
Note that the indexing is done on one field only and I've set the boost to that field alone. I'm using Lucene in Java.
I don't think it has anything to do with the field length. I've seen components with the same name (but different type) get the wrong score.
Use Searcher.explain to find out how the scores for each document are derived. One of the key criteria in score is length of the field. A match in shorter field gets higher score.
I suggest you use luke to see exactly what is stored in your index. Are you using document boosting? See the scoring documentation to check possible explanations.
Boost is just one factor in the Lucene score for a hit. But it should work. Can you give a more complete example of the behavior you are seeing, and what you would expect?
As I recall, boosting is intended to make one field more important than another. If you have only one field, boosting won't change the order of the results at all.
added: no, looks like you can indeed boost specific documents. oops!
Make sure field.omitNorms is set to false on the field you want to boost.
Related
I am trying to change the scoring in apache lucene 5.3, and for my formula I need the document length (the number of tokens in the document). I understood from answers to similar question, you don't have an easy way to do it. because lucene doesn't keep it at the index. so I thought maybe while indexing I will create an Map from docID to the document length, and then use it in query evaluation. But, I have no idea where I should put this map and where I will update it.
You are exactly right, storing this when the document is indexed is the best approach. The place to store it is in the norm (not to be confused with the queryNorm, that's something different). Norms provide a single value stored with the field, which is made available at query time for scoring.
In your Similarity implementation, this should go into the ComputeNorm method, which exposes the information you need through the FieldInvertState, particularly FieldInvertState.getLength(). Norms are made available at search time through LeafReader.GetNormValues.
If you are extending TFIDFSimilarity, instead, you just need to implement the encodeNormValue, decodeNormValue and lengthNorm methods.
So basically _boost, which was a mapping option to give a field a certain boost is now deprecated
The page suggest to use "function score instead of boost". But function score means:
The function_score allows you to modify the score of documents that are retrieved by a query
So it's not even an alternative. Function score just modifies the score of the documents at query time.
How i do alter the relevante of a field at the mapping time?
That option is not valid anymore? Removed and no replacement?
The option is no longer valid and there is no direct replacement. The problem is that index time boosting was removed from Lucene 4.0 upon which Elasticsearch runs. Elasticsearch then used it's own implementation which had it's own issues. A good write up on the issues can be found here: http://blog.brusic.com/2014/02/document-boosting-in-elasticsearch.html and the issue deprecating boost at index time here: https://github.com/elasticsearch/elasticsearch/issues/4664
To summarize, it basically was not working in a transparent and understandable way - you could boost one document by 100 and another by 50, hit the same keyword and yet get the same score. So the decision was made to remove it and rely on function score queries, which have the benefit of being much more transparent and predictable in their impact on scoring.
If you feel strongly that function score queries do not meet your needs and use case, I'd open an issue in github and explain your case.
Function score query can be used to boost the whole document. If you want to use field boost, you can do so with a multi match query or a term query.
I don't know about your case but I believe you have strong reasons to boost documents at index time. It is always recommended to boost at "Query" as "Index" time boosting will require reindexing the data again if ever your boost criteria changes. Being said that, in my application we have implemented both Index & Query time boosting, we are using
Index Time Boosting (document boosting), to boost some documents which we know will always be TOP HIT for our search. e.g searching with word "google" should always put a document containing "google.com" as top hit. We do this using a custom boost field and a custom boosting script to achieve this. Please see this link: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
Query time Boosting (Per field boosting), we are using ES java APIs to execute our queries, we apply field level boosting at query time to each field as its highly flexible & allows us to change the field level boosting without reindexing the whole data set again.
You can have a look at this, it might be helpful for you: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#_field_value_factor
I have described my complete use case here, hopefully you will find it helpful.
I'm looking to have the ability to access the length (in terms) of a specific field of a document post-indexing. Preferably, if there is a way without re-indexing I would like to do that. But if re-indexing in a certain way will give easy access to this value, that would also serve.
http://blog.mikemccandless.com/2012/03/new-index-statistics-in-lucene-40.html
That link (scoll down a bit and find the mention of length) talks of accessing the value at indexing time. I wish to be able to do so post-indexing. The link also talks about saving away the value to a doc value, but it gives no examples of how to do so.
If anyone could provide examples of saving the document length, or accessing it post-indexing, it would be incredibly helpful. Thanks.
The mention of that statistic in the article is in reference to a FieldInvertState. Once you have that, it should be fairly straightforward how to get the statistics you are looking for (Just call getLength, getUniquetermCount or whatever you need).
The FieldInvertState is passed into the Similarity, particularly to the call Similarity.computeNorm. The norm value is calculated and stored at index time, rather than evaluated at query time, so making effective use of it would require you to reindex.
The typical way to make use of this would be to create a custom Similarity, possibly extending DefaultSimilarity. Simply overriding the lengthNorm method of DefaultSimilarity would be the simplest approach. It's standard implementation is:
return (float)(1.0 / Math.sqrt(numTerms));
Which you could override with whatever you like.
That would work to tweak scoring based on a custom length-based calculation. If that's not what you are looking for, but rather need to be able to just fetch that information, I would think just storing and the field, and getting the length from the field value returned when you fetch a Document would be the simplest implementation.
Is there any way to retrieve all the terms in a particular field which is unfortunately not stored. I can not rebuild the index. Positional based information is not necessary. I just need the list of terms.
UPDATE
I've built a sample index with one stored, another unstored field and tested it with Luke. I was wondering whether I could get access to all terms just like Luke did. This may not be the brightest idea, but might work.
Lucene uses two different concepts: Indexing and Storing. If you want to extract the terms, you dont need to store anything. You can use luke, as well as iterate over the terms through the API. For the java API you can use [1]: How can I get the list of unique terms from a specific field in Lucene?.
Luke is Open Source, so just look at how Luke does it.
I know its possible to get the top terms within a Lucene Index, but is there a way to get the top terms based on a subset of a Lucene index?
I.e. What are the top terms in the Index for documents within a certain date range?
Ideally there'd be a utility somewhere to do this, but I'm not aware of one. However, it's not too hard to do this "by hand" in a reasonably efficient way. I'll assume that you already have a Query and/or Filter object that you can use to define the subset of interest.
First, build a list in memory of all of the document IDs in your index subset. You can use IndexSearcher.search(Query, Filter, HitCollector) to do this very quickly; the HitCollector documentation includes an example that seems like it ought to work, or you can use some other container to store your doc IDs.
Next, initialize an empty HashMap (or whatever) to map terms to total frequency, and populate the map by invoking one of the IndexReader.getTermFreqVector methods for every document and field of interest. The three-argument form seems simpler, but either should be just fine. For the three-argument form, you'd make a TermVectorMapper whose map method checks if term is in the map, associates it with frequency if not, or adds frequency to the existing value if so. Be sure to use the same TermVectorMapper object across all of the calls to getTermFreqVector in this pass, rather than instantiating a new one for each document in the loop. You can also speed things up quite a bit by overriding isIgnoringPositions() and isIgnoringOffsets(); your object should return true for both of those. It looks like your TermVectorMapper might also be forced to define a setExpectations method, but that one doesn't need to do anything.
Once you've built your map, just sort the map items by descending frequency and read off however many top terms you like. If you know in advance how many terms you want, you might prefer to do some kind of fancy heap-based algorithm to find the top k items in linear time instead of using an O(n log n) sort. I imagine the plain old sort will be plenty fast in practice. But it's up to you.
If you prefer, you can combine the first two stages by having your HitCollector invoke getTermFreqVector directly. This should certainly produce equally correct results, and intuitively seems like it would be simpler and better, but the docs seem to warn that doing so is likely to be quite a bit slower than the two-pass approach (on same page as the HitCollector example above). Or I could be misinterpreting their warning. If you're feeling ambitious you could try it both ways, compare, and let us know.
Counting up the TermVectors will work, but will be slow if there are a lot of documents to iterate. Also note if you mean docFreq by top terms, then don't use the count in the TermFreqVector just count the terms as binary.
Alternatively, you could iterate the terms like facet counts. Use a cached filter for every term; their BitSets can be used for a fast intersection count.