Is it possible to boost recent documents in a RavenDB query?
This question is exactly what I want to do but refers to native Lucene, not RavenDB.
For example, if I have a Document like this
public class Document
{
public string Title { get; set; }
public DateTime DateCreated { get; set; }
}
How can I boost documents who's date are closer to a given date, e.g. DateTime.UtcNow?
I do not want to OrderByDecending(x => x.DateCreated) as there are other search parameters that need to affect the results.
You can boost during indexing, it's been in RavenDB for quite some time, but it's not in the documentation at all. However, there are some unit tests that illustrate here.
Those tests show a single boost value, but it can easily be calculated from other document values instead. You have the full document available to you since this is done when the index entries are written. You should be able to combine this with the technique described in the post you referenced.
Map = docs => from doc in docs
select new
{
Title = doc.Title.Boost(doc.DateCreated.Ticks / 1000000f)
};
You could also boost the entire document instead of just the Title field, which might be useful if you have other fields in your search algorithm:
Map = docs => from doc in docs
select new
{
doc.Title
}.Boost(doc.DateCreated.Ticks / 1000000f);
You may need to experiment with the right value to use for the boost amount. There are 10,000 ticks in a millisecond, so that's why i divide by such a large number.
Also, be careful that the DateTime you're working with is in UTC, or if you don't have control over where it comes from, then use a DateTimeOffset instead. Why? Because you're using a calculated duration from some reference point and you don't want the result to be ambiguous for different time zones or around daylight savings time changes.
Related
I'm using Sitecore with Lucene, and I'm trying to facet on an integer field, so that I can get all of the existing values for that field. I have the following search result class with a definition for the field:
public class ContentTypeSearchResultItem : Sitecore.ContentSearch.SearchTypes.SearchResultItem
{
[Sitecore.ContentSearch.IndexField("crop_heat_units")]
public int CropHeatUnits { get; set; }
}
in my query, I have
query = query.FacetOn.FacetOn(x => x.CropHeatUnits)
I have a number of other facets of type ID or IEnumerable<Guid> and these work as I expect, but the crop_heat_units string facet is giving me weird results, such as chufacet.Values[0].Name = \u0001\0\0\0\0\0\0\0\u000e\b. Some of the other values are #\b\0\0\0\0 and 8\u0010\0\0\0\0\0.
In Sitecore, the values of the Crop Heat Units field are things like "2075" and "2200".
Each numeric value is indexed as a trie structure within Lucene where each term is logically assigned to larger and larger predefined brackets which are simply lower precision representations of the value.
So, the simple solution is to change int to string for your CropHeatUnits field definition and remove it from the fieldmap. Then your queries and facets should work as expected. If you want to use the CropHeatUnits values as integers then you will need to cast their string values to integer after their retrieval from Lucene.
In my Elasticsearch index I have documents that have multiple tokens at the same position.
I want to get a document back when I match at least one token at every position.
The order of the tokens is not important.
How can I accomplish that? I use Elasticsearch 0.90.5.
Example:
I index a document like this.
{
"field":"red car"
}
I use a synonym token filter that adds synonyms at the same positions as the original token.
So now in the field, there are 2 positions:
Position 1: "red"
Position 2: "car", "automobile"
My solution for now:
To be able to ensure that all positions match, I index the maximum position as well.
{
"field":"red car",
"max_position": 2
}
I have a custom similarity that extends from DefaultSimilarity and returns 1 tf(), idf() and lengthNorm(). The resulting score is the number of matching terms in the field.
Query:
{
"custom_score": {
"query": {
"match": {
"field": "a car is an automobile"
}
},
"_script": "_score*100/doc[\"max_position\"]+_score"
},
"min_score":"100"
}
Problem with my solution:
The above search should not match the document, because there is no token "red" in the query string. But it matches, because Elasticsearch counts the matches for car and automobile as two matches and that gives a score of 2 which leads to a script score of 102, which satisfies the "min_score".
If you needed to guarantee 100% matches against the query terms you could use minimum_should_match. This is the more common case.
Unfortunately, in your case, you wish to provide 100% matches of the indexed terms. To do this, you'll have to drop down to the Lucene level and write a custom (java - here's boilerplate you can fork) Similarity class, because you need access to low-level index information that is not exposed to the Query DSL:
Per document/field scanned in the query scorer:
Number of analyzed terms matched (overlap is the Lucene terminology, it is used the the coord() method of the DefaultSimilarity class)
Number of total analyzed terms in the field: Look at this thread for a couple different ways to get this information: How to count the number of terms for each document in lucene index?
Then your custom similarity (you can probably even extend DefaultSimilarity) will need to detect queries where terms matched < total terms and multiply their score by zero.
Since query and index-time analysis have already happened at this level of scoring, the total number of indexed terms will already be expanded to include synonyms, as should the query terms, avoiding the false-positive "a car is an automobile" issue above.
I am trying to boost certain documents. But they dont get boosted. Please tell me what I am missing. Thanks!
In my index code I have:
if (myCondition)
{
myDocument.SetBoost(1.1f);
}
myIndexWriter.AddDocument(document);
then in my search code I retrieve the collection of documents from the ScoreDocs object into myDocuments collection and:
foreach (Lucene.Net.Documents.Document doc in myDocuments)
{
float tempboost = doc.GetBoost();
}
and I place a breakpoint in the foreach clause to break if the tempboost is not 1. and the breakpoint is never hit.
What did I miss?
Many thanks!
From javadoc of Lucene (Java version but same behaviors apply):
public float getBoost()
Returns, at indexing time, the boost factor as
set by setBoost(float).
Note that once a document is indexed this value is no longer available
from the index. At search time, for retrieved documents, this method
always returns 1. This however does not mean that the boost value set
at indexing time was ignored - it was just combined with other
indexing time factors and stored elsewhere, for better indexing and
search performance.
note: for those of you who get NaN when retrieving the score please use the following line
searcher.SetDefaultFieldSortScoring(true,true);
Imagine I have a huge database of threads and posts (about 10.000.000 records) from different forum sites including several subforums that serve as my lucene documents.
Now I am trying to calculate a feature called "OnTopicness" for each post based on the terms used in it. In fact, this feature is not much more than a simple cosine similarity between two document vectors that will be stored in the database and therefore has to be calculated only once per post. :
Forum-OnTopicness: cosine similarity between my post and a virtual
document consisting of all other posts in the specified forum (including
all threads in the forum)
Thread-OnTopicness: cosine similarity between my post and a virtual
document consisting of all other posts in the specified thread
Since the Lucene.NET API doesn't offer a method to calculate a document-document or document-index cosine similarity, I read that I could either parse one of the documents as query and search for the other document in the results or that I could manually calculate the similarity using TermFreqVectors and DocFrequencies.
I tried the second attempt because it sounds faster but ran into a problem: The IndexReader.GetTermFreqVector() method takes the internal docNumber as parameter which I don't know if I just pass two documents to my GetCosineSimilarity method:
public void GetCosineSimilarity(Document doc1, Document doc2)
{
using (IndexReader reader = IndexReader.Open(FSDirectory.Open(indexDir), true))
{
// how do I get the docNumbers?
TermFreqVector tfv1 = reader.GetTermFreqVector(???, "PostBody");
TermFreqVector tfv2 = reader.GetTermFreqVector(???, "PostBody");
...
// assuming that I have the TermFreqVectors, how would I continue here?
}
}
Besides that, how would you create the mentioned "virtual document" for either a whole forum or a thread? Should I just concatenate the PostBody fields of all contained posts and parse them into a new document or can I just create an index them for them and somehow compare my post to this entire index?
As you can see, as a Lucene newbie, I am still not sure about my overall index design and could definitely use some general advice. Help is highly appreciated - thanks!
Take a look at MoreLikeThisQuery in
https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Queries/Similar/
Its source may be useful.
Take a look at S-Space. It is a free open-source Java package that does a lot of the things you want to do, e.g. compute cosine similarity between documents.
I understand how Lucene.net can work for text indexing. Will I be able to efficiently search for documents based on a given date range? Or will Lucene.net just use text matching to match the dates?
Lucene.Net will just use text matching, so you'd need to format the dates correctly before adding to the index:
public static string Serialize(DateTime dateTime)
{
return dateTime.ToString("yyyyMMddHHmmss", CultureInfo.InvariantCulture);
}
public static DateTime Deserialize(string str)
{
return DateTime.ParseExact(str, "yyyyMMddHHmmss", CultureInfo.InvariantCulture);
}
You can then, for example, perform a range based query to filter by date (e.g. 2006* to 2007* to include all dates in 2006 and 2007).
I went in to trouble when i converted date in to yyyymmddHHmmssff. When i tried sorting the data, it gave me an exception that too big to convert..something. Hence i search and found then you need to have two columns. one in yyyymmdd and the other HHmmss, and then use Sort[] and give these two columns and then use. This will solve the issue.