So basically _boost, which was a mapping option to give a field a certain boost is now deprecated
The page suggest to use "function score instead of boost". But function score means:
The function_score allows you to modify the score of documents that are retrieved by a query
So it's not even an alternative. Function score just modifies the score of the documents at query time.
How i do alter the relevante of a field at the mapping time?
That option is not valid anymore? Removed and no replacement?
The option is no longer valid and there is no direct replacement. The problem is that index time boosting was removed from Lucene 4.0 upon which Elasticsearch runs. Elasticsearch then used it's own implementation which had it's own issues. A good write up on the issues can be found here: http://blog.brusic.com/2014/02/document-boosting-in-elasticsearch.html and the issue deprecating boost at index time here: https://github.com/elasticsearch/elasticsearch/issues/4664
To summarize, it basically was not working in a transparent and understandable way - you could boost one document by 100 and another by 50, hit the same keyword and yet get the same score. So the decision was made to remove it and rely on function score queries, which have the benefit of being much more transparent and predictable in their impact on scoring.
If you feel strongly that function score queries do not meet your needs and use case, I'd open an issue in github and explain your case.
Function score query can be used to boost the whole document. If you want to use field boost, you can do so with a multi match query or a term query.
I don't know about your case but I believe you have strong reasons to boost documents at index time. It is always recommended to boost at "Query" as "Index" time boosting will require reindexing the data again if ever your boost criteria changes. Being said that, in my application we have implemented both Index & Query time boosting, we are using
Index Time Boosting (document boosting), to boost some documents which we know will always be TOP HIT for our search. e.g searching with word "google" should always put a document containing "google.com" as top hit. We do this using a custom boost field and a custom boosting script to achieve this. Please see this link: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
Query time Boosting (Per field boosting), we are using ES java APIs to execute our queries, we apply field level boosting at query time to each field as its highly flexible & allows us to change the field level boosting without reindexing the whole data set again.
You can have a look at this, it might be helpful for you: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#_field_value_factor
I have described my complete use case here, hopefully you will find it helpful.
Related
I am trying to change the scoring in apache lucene 5.3, and for my formula I need the document length (the number of tokens in the document). I understood from answers to similar question, you don't have an easy way to do it. because lucene doesn't keep it at the index. so I thought maybe while indexing I will create an Map from docID to the document length, and then use it in query evaluation. But, I have no idea where I should put this map and where I will update it.
You are exactly right, storing this when the document is indexed is the best approach. The place to store it is in the norm (not to be confused with the queryNorm, that's something different). Norms provide a single value stored with the field, which is made available at query time for scoring.
In your Similarity implementation, this should go into the ComputeNorm method, which exposes the information you need through the FieldInvertState, particularly FieldInvertState.getLength(). Norms are made available at search time through LeafReader.GetNormValues.
If you are extending TFIDFSimilarity, instead, you just need to implement the encodeNormValue, decodeNormValue and lengthNorm methods.
I don't quite understand the difference between apache solr's spell check vs fuzzy search functionality.
I understand that fuzzy search matches your search term with the indexed value based on some difference expressed in distance.
I also understand that spellcheck also give you suggestions based on how close your search term is to a value in the index.
So to me those two things are not that different though I am sure that this is due to my shortcoming in understanding each feature thoroughly.
If anyone could provide an explanation preferably via an example, I would greatly appreciate it.
Thanks,
Bob
I'm not a professional in the Solr but I try to explain.
Fuzzy search is a simple instruction for Solr to use a kind of spellchecking during requests - Solr’s standard query parser supports the fuzzy search and you can use this one without any additional settings, for example: roam~ or roam~1. And this so-colled spellcheking is used a Damerau-Levenshtein Distance or Edit Distance algorithm.
To use spellchecking you need to configure it in the solrconfig.xml (please, see here). It gives you sort of flexibility how to implement spellcheking (there are a couple of OOTB implementation) so, for example, you can use another index for spellcheck thereby you decrease load on main index. Also for spellchecking you use another URL: /spell so it is not a search query like fuzzy query.
Why should I use spellcheking or fuzzy search? I guess it is depended on your server loading because the fuzzy search is more expensive and not recommended by the Solr team.
P.S. It is my understanding of fuzzy and spellcheking so if somebody has more correct and clear explanation, please, give us advice how to deal with them.
I'm looking to have the ability to access the length (in terms) of a specific field of a document post-indexing. Preferably, if there is a way without re-indexing I would like to do that. But if re-indexing in a certain way will give easy access to this value, that would also serve.
http://blog.mikemccandless.com/2012/03/new-index-statistics-in-lucene-40.html
That link (scoll down a bit and find the mention of length) talks of accessing the value at indexing time. I wish to be able to do so post-indexing. The link also talks about saving away the value to a doc value, but it gives no examples of how to do so.
If anyone could provide examples of saving the document length, or accessing it post-indexing, it would be incredibly helpful. Thanks.
The mention of that statistic in the article is in reference to a FieldInvertState. Once you have that, it should be fairly straightforward how to get the statistics you are looking for (Just call getLength, getUniquetermCount or whatever you need).
The FieldInvertState is passed into the Similarity, particularly to the call Similarity.computeNorm. The norm value is calculated and stored at index time, rather than evaluated at query time, so making effective use of it would require you to reindex.
The typical way to make use of this would be to create a custom Similarity, possibly extending DefaultSimilarity. Simply overriding the lengthNorm method of DefaultSimilarity would be the simplest approach. It's standard implementation is:
return (float)(1.0 / Math.sqrt(numTerms));
Which you could override with whatever you like.
That would work to tweak scoring based on a custom length-based calculation. If that's not what you are looking for, but rather need to be able to just fetch that information, I would think just storing and the field, and getting the length from the field value returned when you fetch a Document would be the simplest implementation.
How can we know the index in Lucene is correct?
Detail
I created a simple program that created Lucene indexes and stored it in a folder. Using the diagnostic tool, Luke I could look inside an index and view the content.
I realise Lucene is a standard framework for building a search engine but I wanted to be sure that Lucene indexes every term that existed in a file.
Can I verify that the Lucene index creation is dependable? That not even a single term went missing?
You could always build a small program that will perform the same analysis you use when indexing your content. Then, for all the terms, query your index to make sure that the document is among the results. Repeat for all the content. But personally, I would not waste time on this. If you can open your index in Luke and if you can make a couple of queries, everything is most probably fine.
Often, the real question is whether or not the analysis you did on the content will be appropriate for the queries that will be made against your index. You have to make sure that your index will have a good balance between recall and precision.
I'm using Lucene to index components with names and types. Some components are more important, thus, get a bigger boost. However, I cannot get my boost to work properly. I sill get some components appear later (get worse score), even though they have a higher boost.
Note that the indexing is done on one field only and I've set the boost to that field alone. I'm using Lucene in Java.
I don't think it has anything to do with the field length. I've seen components with the same name (but different type) get the wrong score.
Use Searcher.explain to find out how the scores for each document are derived. One of the key criteria in score is length of the field. A match in shorter field gets higher score.
I suggest you use luke to see exactly what is stored in your index. Are you using document boosting? See the scoring documentation to check possible explanations.
Boost is just one factor in the Lucene score for a hit. But it should work. Can you give a more complete example of the behavior you are seeing, and what you would expect?
As I recall, boosting is intended to make one field more important than another. If you have only one field, boosting won't change the order of the results at all.
added: no, looks like you can indeed boost specific documents. oops!
Make sure field.omitNorms is set to false on the field you want to boost.