How to improve ranking of search in Apache Solr - apache

I am implementing search engine using Apache Solr. I want to improve results on the basis of most frequent searches. For example: Consider my index has 5 wordsDown 99 Drawn 46 Dark 86 Dull 75 Dirty 63
The numbers shows that how many times users searcded a particular word.
I want if a next user comes it and type D the response should be in descending order of previously searched and should be in order DownDarkDullDirtyDrawn
The results will change from time to time as word searched frequency will change after every search.. How can I implement this in Solr... Any help in this will help me a lot. Thanking you in anticipation
Regards A.S.Danyal

As vinod writes, you'll have to keep track of actual searches yourself - there is nothing built-in to Solr to handle this for you. However, when you DO have the search statistics available, you can implement the feature by having a separate collection / core with searches and their popularity that you search against. Each document would be a search term and the frequency of how often that document is searched, i.e. document: search, search_count.
You can also use a logarithmic function to use the score of a search_count to affect the score of the search terms, for example if you have more than just the search as a field to influence the score (such as active category, etc.).
Depending on search volume, you probably don't need to update these values after each single search - just updating it once a day or every other hour will usually be good enough. Keep track of the terms that have changed in search volume since the last update, and update those documents in a batch job in certain intervals.

Solr doesn't provide this kind of feature.
One way to achieve this is by using logs,
you will need to have an index of search terms entered. This can be built by mining your search logs.

Related

Querying Apache Solr based on score values

I am working on an image retrieval task. I have a dataset of wikipedia images with their textual description in xml files (1 xml file per image). I have indexed those xmls in Solr. Now while retrieving those, I want to maintain some threshold for Score values, so that docs with less score will not come in the result (because they are not of much importance). For example I want to retrieve all documents having similarity score greater than or equal to 2.0. I have already tried range queries like score:[2.0 TO *] but can't get it working. Does anyone have any idea how can I do that?
What's the motivation for wanting to do this? The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. It is only meaningful for comparing the results of a
specific query with a specific instance of the index. In other words,
it isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.
http://lucene.472066.n3.nabble.com/score-filter-td493438.html
Also, take a look here - http://wiki.apache.org/lucene-java/ScoresAsPercentages
So, in general it's bad to cut off by some value, because you'll never know which threshold value is best. In good query it could be score=2, in bad query score=0.5, etc.
These two links should explain you why you DONT want to do it.
P.S. If you still want to do it take a look here - https://stackoverflow.com/a/15765203/2663985
P.P.S. I recommend you to fix your search queries, so they will search better with high precision (http://en.wikipedia.org/wiki/Precision_and_recall)

Lucene Paging with search

Hello I am currently using Lucene 4.6.1
In my design I need to be able to search and page possibly many results, so i have some general questions for optimization.
First in the "search(query q, int n)" What is the goal of the variable "n" , Is "n" different from ".totalHits()" ? How should this number be chosen and with what specifications?
Second, it seems that there are two general algorithms for paging. I can either use "searchAfter" or process the "ScoreDoc[]" given a page size.
Currently what way do most people recommend, and what are the design ideas that are required?
searchAfter can be used for efficient "deep paging".
A tutorial on using it with Solr
http://heliosearch.org/solr/paging-and-deep-paging/
The int passed to search is the maximum number of hits the search will retrieve. totalHits, from the TopDocs is the total number of hits for the query. It may be more or less than the value passed in.
Not clear to me what you mean by processing the ScoreDoc array. searchAfter is specifically intended to be used for pagination. Use it.

Elasticsearch - higher scoring if higher frequency of term

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.
Document A has ONLY 1 term in the "tags" field, and it's "Twitter".
Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".
Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?
I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?
As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).
In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).
But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.
I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.
Are the documents found on different shards? From Elastic search documentation:
"When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first execute the query against all shards and gather the relevant term frequencies, and then, based on it, execute the query."
The solution is to specify the search type. Use dfs_query_and_fetch search type to execute an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
You can read more here.

understanding the relationship between boosting a document in lucene at index time and its corresponding score at search time

When indexing, I boost certain documents, but they do not appear on the top of the list of retrieved documents. I looked at the score of those documents, and somehow, the score of the documents retrieved is always NaN.
What is the relationship between a boost of a document at index time and its score at retrieve time? I thought these would be correlated, and further, I thought I would get a wide range of scores in my scoredocs, not just NaN. If you can shed some light on this I would be grateful.
I have read http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similarity.html
and cant figure out what is missing.
Here is the simple boosting code:
if (myCondition)
{
myDocument.SetBoost(1.1f);
}
myIndexWriter.AddDocument(document);
I'm gonna go on a wild guess here since you havent provide a sample of you search code, but a common reason why the score of retreived docs is NaN is because you use a Sort. When sorting, most of the time the score of the documents is not used, and therefore disabled by default.
If you use a Sort for your search, and want the score, check the method setDefaultFieldSortScoring of the IndexSearcher class. This method allows you to enable scoring the documents in a search that uses a Sort.
http://lucene.apache.org/java/2_9_4/api/all/org/apache/lucene/search/IndexSearcher.html#setDefaultFieldSortScoring(boolean, boolean)

Bubbling up newest content in lucene search results

I am storing various articles in my lucene index.
When user searches for articles which contain a specific term or phrase,I need to show all th articles (could be anywhere between 1000 to 10000 articles) but with newest articles "bubbled up" in the search results.
I believe you can bubble up a search result in Lucene using "Date field Boosting".
Can someone please give me the details of how to go about this?
Thanks in advance!
I would implement the SortComparatorSource interface. You should write a new ScoreDocComparator, whose compare() function compares two dates. Then you will need to sort your searches using the new sorter. This advice is taken from chapter 6 of Lucene in Action.
You can use the setBoost method to set the "boost" for a particular document in the index at index time. Since the default boost value is 1.0, setting a value less than 1.0 will make the document "less relevant" in search results. By tying the boost value of a document to its age (lower boost the older the document gets), you can make newer content seem more relevant in search results.
Note in the documentation for setBoost that the boost value set at indexing time is not available for retrieved documents (boost works, you just can't read the value back at retrieval time to see if you applied the correct value at index time).