Using Lucene when indexing I boost certain documents, but their score at search is still 1 - lucene

I am trying to boost certain documents. But they dont get boosted. Please tell me what I am missing. Thanks!
In my index code I have:
if (myCondition)
{
myDocument.SetBoost(1.1f);
}
myIndexWriter.AddDocument(document);
then in my search code I retrieve the collection of documents from the ScoreDocs object into myDocuments collection and:
foreach (Lucene.Net.Documents.Document doc in myDocuments)
{
float tempboost = doc.GetBoost();
}
and I place a breakpoint in the foreach clause to break if the tempboost is not 1. and the breakpoint is never hit.
What did I miss?
Many thanks!

From javadoc of Lucene (Java version but same behaviors apply):
public float getBoost()
Returns, at indexing time, the boost factor as
set by setBoost(float).
Note that once a document is indexed this value is no longer available
from the index. At search time, for retrieved documents, this method
always returns 1. This however does not mean that the boost value set
at indexing time was ignored - it was just combined with other
indexing time factors and stored elsewhere, for better indexing and
search performance.
note: for those of you who get NaN when retrieving the score please use the following line
searcher.SetDefaultFieldSortScoring(true,true);

Related

How can I write a ravenDB query in the studio that finds all fields that are not empty

I'm new to lucene, day 1 new. So I've read a tutorial on lucene and spent a while trying to work out how to find a non null value in lucene.
So I have a document called Inspect
The document has two fields I'm interested in: Inspect and Direct.
{
"Inspect": "Feather",
"Direct": {}
}
I want to find all documents where Inspect = "Feather" and Direct is not empty.
I am also interested in finding documents where Direct is also empty.
I am doing this in the ravenDB studio, so I am using lucene. I have tried a few things like
Inspect: Feather
And NOT
Direct: [[NULL_VALUE]]
However this doesn't seem to work. Any advice or some direction would be much appreciated.
Cheers
You need to run a query like this:
Inspect: Feather AND NOT Direct.Count: 0
When you are comparing to a null object, it fails, Direct is not null, but with the .Count there you are actually counting the number of properties in the object, which seems to be what you want.
#stacka Hi! I'm also rather new to RavenDB, but I have some ideas that may help you. First of all, use the '-' (minus) character instead of NOT. It's a convention. Second, you may face the problem that query cannot be run against db, when any property is not indexed. So, you should create one including the field you want to query against. Hope, this would help.

Sitecore 7 + Lucene: Query-Time Boosting: how?

for items of a certain template, our users can indicate that the item should be shown on top of the list.
For this, we have added a field in the index "ShowOnTop".
Now when searching for items of this template (to build the list page), we would like to have these "ShowOnTop" items to effectively be returned on top of the other items.
This field however should not affect other site search (general search).
We think this could be possible by applying Query-Time Boosting to these documents. But, how can we achieve this?
To do boosting at query-time simply use Boost(value) method (using a search predicate as it sounds like you might be doing some advanced searching where added flexibility of predicates might come in handy) -
var queryPredicate = PredicateBuilder.True<SearchResult>();
queryPredicate = queryPredicate.And(i =>
i.Headline.Contains(model.Query).Boost(50));
Probably the best way would be to apply a Sort based on that field, something along the lines of:
Sort sort = new Sort(new SortField("ShowOnTop", SortField.STRING, true), true);
var hits = new SearchHits(context.Searcher.Search(query, sort));
You could also add it as a heavily boosted optional query term, something along the lines of, and make the rest of the query is required (as a whole), like:
ShowOnTop:true^10000 +(the rest of the query)
With a large enough boost factor, those terms should always come up first unless there is a really drastic difference in relevance.
Easiest is creating a rule under /sitecore/system/Settings/Rules/Indexing and Search/... that filters on your ShowOnTop field (I used a checkbox and compared the value with 1) and adjust the boost by 99999999
You can either add this rule as Global Rule or you can add it as Item rule and assign the rule from within the item.
Good luck!

compute term vector without indexing in lucene 4

I am migrating my code from Lucene 3.5 to Lucene 4.1 but I am having some problems with getting the term vector without indexing.
The problem is, given a text string and an Analyzer, I need to compute the term vector (technically, find the terms and their frequencies tf). Obviously, it can be achieved by writing the index (using IndexWriter) and then reading them back (using IndexReader) but I reckon it would be expensive. Furthermore, I don't need document frequency (df). Thus, I think an indexing-free solution is suitable.
In Lucene 2 and 3, a simple technique for the above purpose is to use QueryTermVector which extends TermFreqVector and has a constructor taking a string and an Analyzer. Unfortunately, QueryTermVector (along with TermFreqVector) has been removed in Lucene 4 and it seems the migration documentation did not mention anything about QueryTermVector.
Do you have a solution for this problem in Lucene 4? Thank you very much.
If you just need to know the terms/frequencies, you can just obtain the single tokens directly from the analyzer (you can get the TF by counting them, e.g. by using a Map or a Multiset).
This is how you do it in Lucene 4.0:
TokenStream ts = analyzer.tokenStream(field, new StringReader(text));
CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);
while (ts.incrementToken()) {
String term = charTermAttribute.toString();
//term contains your token
}

How to calculate "OnTopicness" of documents using Lucene.NET

Imagine I have a huge database of threads and posts (about 10.000.000 records) from different forum sites including several subforums that serve as my lucene documents.
Now I am trying to calculate a feature called "OnTopicness" for each post based on the terms used in it. In fact, this feature is not much more than a simple cosine similarity between two document vectors that will be stored in the database and therefore has to be calculated only once per post. :
Forum-OnTopicness: cosine similarity between my post and a virtual
document consisting of all other posts in the specified forum (including
all threads in the forum)
Thread-OnTopicness: cosine similarity between my post and a virtual
document consisting of all other posts in the specified thread
Since the Lucene.NET API doesn't offer a method to calculate a document-document or document-index cosine similarity, I read that I could either parse one of the documents as query and search for the other document in the results or that I could manually calculate the similarity using TermFreqVectors and DocFrequencies.
I tried the second attempt because it sounds faster but ran into a problem: The IndexReader.GetTermFreqVector() method takes the internal docNumber as parameter which I don't know if I just pass two documents to my GetCosineSimilarity method:
public void GetCosineSimilarity(Document doc1, Document doc2)
{
using (IndexReader reader = IndexReader.Open(FSDirectory.Open(indexDir), true))
{
// how do I get the docNumbers?
TermFreqVector tfv1 = reader.GetTermFreqVector(???, "PostBody");
TermFreqVector tfv2 = reader.GetTermFreqVector(???, "PostBody");
...
// assuming that I have the TermFreqVectors, how would I continue here?
}
}
Besides that, how would you create the mentioned "virtual document" for either a whole forum or a thread? Should I just concatenate the PostBody fields of all contained posts and parse them into a new document or can I just create an index them for them and somehow compare my post to this entire index?
As you can see, as a Lucene newbie, I am still not sure about my overall index design and could definitely use some general advice. Help is highly appreciated - thanks!
Take a look at MoreLikeThisQuery in
https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Queries/Similar/
Its source may be useful.
Take a look at S-Space. It is a free open-source Java package that does a lot of the things you want to do, e.g. compute cosine similarity between documents.

FastVectorHighlighter.Net returning null on GetBestFragment

I have a large index, on which Highlighter.Net works fine, but FastVectorHighlighter returns null as a Best Fragment on Some documents.
the searcher works fine. It is just the highlighter. The field has been indexed in the same manner for all documents, so I fail to understand Why it highlights some documents but not all.
Using Lucene.Net 2.9.2, built from trunk rev942061
are you setting FieldMatch to true?
Problem solved by the nice people at lucene-net-user. I was passing the document sequence number in the Hits object, where I should have been passing the sequence number in the lucene index. Read full mailing list thread starting here