Bubbling up newest content in lucene search results

Bubbling up newest content in lucene search results - lucene

I am storing various articles in my lucene index.
When user searches for articles which contain a specific term or phrase,I need to show all th articles (could be anywhere between 1000 to 10000 articles) but with newest articles "bubbled up" in the search results.
I believe you can bubble up a search result in Lucene using "Date field Boosting".
Can someone please give me the details of how to go about this?
Thanks in advance!

I would implement the SortComparatorSource interface. You should write a new ScoreDocComparator, whose compare() function compares two dates. Then you will need to sort your searches using the new sorter. This advice is taken from chapter 6 of Lucene in Action.

You can use the setBoost method to set the "boost" for a particular document in the index at index time. Since the default boost value is 1.0, setting a value less than 1.0 will make the document "less relevant" in search results. By tying the boost value of a document to its age (lower boost the older the document gets), you can make newer content seem more relevant in search results.
Note in the documentation for setBoost that the boost value set at indexing time is not available for retrieved documents (boost works, you just can't read the value back at retrieval time to see if you applied the correct value at index time).

Related

SOLR and Ratio of Matching Words

Using SOLR version 4.3, it appears that SOLR is valuing the percentage of matching terms more than the number of matching terms.
For example, we do a search for Dog and a document with just the word dog and a three other words returns. We have another article with hundreds of words, that has the word dog in it 27 times.
I would expect the second article to return first. However, the one with one word out of three returns first. I was hoping to find out what in SOLR controls this so I can make the appropriate modifications. I have looked the SOLR documentation and have seen COORD mentioned, but it seems to indicate that the article with 27 references should return first. Any help would be appreciated.

For 4.x Solr still used regular TF/IDF as its scoring formula, and you can see the Lucene implementation detailed in the documentation for TFIDFSimilarity.
For your question, the two factors that affect the score is:
The length of the field, as given in norm():
norm(t,d) encapsulates a few (indexing time) boost and length factors:
Field boost - set by calling field.setBoost() before adding the field to a document.
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
.. while the number of terms matching (not their frequency), is given by coord():
coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
There are a few settings in your schema that can affect how Solr scores the documents in your example:
omitNorms
If true, omits the norms associated with this field (this disables length normalization for the field, and saves some memory)
.. this will remove the norm() part of the score.
omitTermFreqAndPositions
If true, omits term frequency, positions, and payloads from postings for this field.
.. and this will remove the boost from multiple occurrences of the same term. Be aware that this will remove positions as well, making phrase queries impossible.
But you should also consider upgrading Solr, as the BM25 similarity that's the default from 6.x usually performs better. I can't remember if a version is available for 4.3.

How to improve ranking of search in Apache Solr

I am implementing search engine using Apache Solr. I want to improve results on the basis of most frequent searches. For example: Consider my index has 5 wordsDown 99 Drawn 46 Dark 86 Dull 75 Dirty 63
The numbers shows that how many times users searcded a particular word.
I want if a next user comes it and type D the response should be in descending order of previously searched and should be in order DownDarkDullDirtyDrawn
The results will change from time to time as word searched frequency will change after every search.. How can I implement this in Solr... Any help in this will help me a lot. Thanking you in anticipation
Regards A.S.Danyal

As vinod writes, you'll have to keep track of actual searches yourself - there is nothing built-in to Solr to handle this for you. However, when you DO have the search statistics available, you can implement the feature by having a separate collection / core with searches and their popularity that you search against. Each document would be a search term and the frequency of how often that document is searched, i.e. document: search, search_count.
You can also use a logarithmic function to use the score of a search_count to affect the score of the search terms, for example if you have more than just the search as a field to influence the score (such as active category, etc.).
Depending on search volume, you probably don't need to update these values after each single search - just updating it once a day or every other hour will usually be good enough. Keep track of the terms that have changed in search volume since the last update, and update those documents in a batch job in certain intervals.

Solr doesn't provide this kind of feature.
One way to achieve this is by using logs,
you will need to have an index of search terms entered. This can be built by mining your search logs.

Lucene query documents that don't have a specific field

I am using Lucene in android to search my content. I have two types of documents and one has a trashed field which is either true or false. The other type of document does not have that field. I want to return all documents that have trashed:false, or don't have the trashed field.
I have tried add -trashed:true to my query, which returns all the correct documents, but it messes up the offsets of the search surround a different word and not the one I am searching for.
EDIT:
I have to add this to every search query I perform. I have an index of approximately 20,000 documents and I would really like to not have to rebuild it because I had my users rebuild their indices my last release. Note: this is on android devices so it takes a long time and a lot of battery to reindex all of their documents.
Thanks for the help.

I can think of following solution.
1) If you can rebuild the index.
Add trashed:na field-value to the docs for which "trashed" is not applicable.
To get all the docs with trashed:false or "trashed" is not applicable, you can use following..
Query: trashed:false OR trashed:na
2) If you cannot rebuild the index, I am not sure...

Boosting Lucene Terms When Building the Index

Is it possible to determine that specific terms are more important then other when creating the index (not when querying it) ?
Consider for example a synonym filter:
doc 1: "this is a nice car"
doc 2: "this is a nice vehicle"
I want to add the term vehicle to the first doc and the term car to the second doc,
but I want that if later the index is queried with the word car then the first document will be scored higher then the second one and if queried for vehicle it will be the other way around.
Will calling setBoost on the fields before adding them to their respective documents do the trick?
Or maybe I should add the synonyms to a different field name?
Or am I looking at this from a wrong point of view ?
Thanks

Setting boost on a filed affects all terms in that field so this wouldn't work in your case.
But it should be posible using Lucene payloads (a byte array that can be set for every term). You would use them to set term specific boosts (vehicle to 0.5 for doc 1, for example). Then you'll implement your own Similarity and override scorePayload() method to decode that boost and then use PayloadTermQuery which allows you to contribute to the score based on the boots you have in the payload for that term.

understanding the relationship between boosting a document in lucene at index time and its corresponding score at search time

When indexing, I boost certain documents, but they do not appear on the top of the list of retrieved documents. I looked at the score of those documents, and somehow, the score of the documents retrieved is always NaN.
What is the relationship between a boost of a document at index time and its score at retrieve time? I thought these would be correlated, and further, I thought I would get a wide range of scores in my scoredocs, not just NaN. If you can shed some light on this I would be grateful.
I have read http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similarity.html
and cant figure out what is missing.
Here is the simple boosting code:
if (myCondition)
{
myDocument.SetBoost(1.1f);
}
myIndexWriter.AddDocument(document);

I'm gonna go on a wild guess here since you havent provide a sample of you search code, but a common reason why the score of retreived docs is NaN is because you use a Sort. When sorting, most of the time the score of the documents is not used, and therefore disabled by default.
If you use a Sort for your search, and want the score, check the method setDefaultFieldSortScoring of the IndexSearcher class. This method allows you to enable scoring the documents in a search that uses a Sort.
http://lucene.apache.org/java/2_9_4/api/all/org/apache/lucene/search/IndexSearcher.html#setDefaultFieldSortScoring(boolean, boolean)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas