Lucene payload scoring - lucene

I want to figure out how payload scoring works in lucene. Since I don't understand where PayloadFunction fits in, I think I don't really understand how it works. Tried googling for it, but couldn't find much apart from advice to go through source. Well, it would be nice if someone can explain it here, else source code it is :)

There are three parts of it. First of all you should generate payloads during analysis. This could be done using PayloadAttribute. You just need to add this attribute to terms you want during analysis.
class MyFilter extends TokenFilter {
private PayloadAttribute attr;
public MyFilter() {
attr = addAttribute(PayloadAttribute.class);
}
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
Payload p = new Payload(PayloadHelper.encodeFloat(42));
attr.setPayload(p);
} else {
attr.setPayload(null);
}
}
Then during searching you should use special query class PayloadTermQuery. This class behaves as SpanTermQuery but do track of payloads in index. Using custom Similarity implementation you could score each payload occurrence in document.
public class MySimilarity extends DefaultSimilarity {
public float scorePayload(int docID, String fieldName,
int start, int end, byte[] payload,
int offset, int length) {
if (payload != null) {
return PayloadHelper.decodeFloat(payload, offset);
} else {
return 1.0f;
}
}
}
Finally, using PayloadFunction you could aggregate payload scores over document to produce final document score.

Related

I need a way to score the lucene documents using term frequency only. Is there any flag that needs to be changed for this?

If i have two documents with D1 having term "lucene" twice and D2 having term "lucene" thrice. I want lucene to score D2 higher than D1. Note here that, D1 has only two words (i.e. lucene lucene) while D3 has 100 words out of which 3 words are lucene. Default lucene scoring model will score D1 higher than D2. I want to disable this mode and rank D2 higher than D1. That's my project requirement.
You'll need to implement a Similarity which does what you want. You could implement directly on Similarity, but you'll probably find it's simpler to just copy ClassicSimilarity (DefaultSimilarity, before version 5.4), and stub out the things you don't want to impact your score (ie. return a constant). For instance, here's a very simple implementation that would simply return the frequency of the terms in the query:
import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.TFIDFSimilarity;
import org.apache.lucene.util.BytesRef;
public class SimpleSimilarity extends TFIDFSimilarity {
//Comments describe briefly what these methods do in the *standard* implementation.
//Not what they do in this implementation (which, for most of them, is nothing at all)
public SimpleSimilarity() {}
//boosts results which match more query terms
#Override
public float coord(int overlap, int maxOverlap) {
return 1f;
}
//constant per query, normalizes scores somewhat based on query
#Override
public float queryNorm(float sumOfSquaredWeights) {
return 1f;
}
//Norms should be disabled when using this similarity
//They are useless to it, and would just be wasted space.
#Override
public final long encodeNormValue(float f) {
return 1L;
}
#Override
public final float decodeNormValue(long norm) {
return 1f;
}
//Weighs shorter fields more heavily
#Override
public float lengthNorm(FieldInvertState state) {
return 1f;
}
//Higher frequency terms (more matches) scored higher
#Override
public float tf(float freq) {
//return (float)Math.sqrt(freq); The standard tf impl
return freq;
}
//Scores closer matches higher when using a sloppy phrase query
#Override
public float sloppyFreq(int distance) {
return 1.0f;
}
//ClassicSimilarity doesn't really do much with payloads. This is unmodified
#Override
public float scorePayload(int doc, int start, int end, BytesRef payload) {
return 1f;
}
//Weigh matches on rarer terms more heavily.
#Override
public float idf(long docFreq, long numDocs) {
return 1f;
}
#Override
public String toString() {
return "SimpleSimilarity";
}
}

Disable length normalization on a single field?

I have a field with the following requirements:
It must be boosted at index time, therefore 'omitNorms' must remain 'false'
However, it must NOT be subject to field length normalization (ie. just because a term is found in 1:10 words vs. 1:1000 should not affect scoring -- both should be equally weighted)
On at least one other field, I DO in fact want field length normalization, and so I do not suspect applying a custom Similarity broadly on the Searcher is appropriate.
How does one boost a single field at index time, but disable the effect of field length normalization?
You can use PerFieldSimilarityWrapper to use a different similarity implementation for each field:
public class MySimilarity extends PerFieldSimilarityWrapper {
Similarity standardSim = new ClassicSimilarity();
Similarity nolengthSim = new SimilarityWithoutLengthNorm();
#Override
public Similarity get(String fieldName) {
if (fieldName.equals("someField")) {
return nolengthSim;
}
else {
return standardSim;
}
}
//These two methods must be implemented here, as their
//calculation is not field specific
#Override
public float queryNorm (float valueForNormalization) {
return standardSim.queryNorm(valueForNormalization);
}
#Override
public float coord (int overlap, int maxOverlap) {
return standardSim.coord(overlap, maxOverlap);
}
}
Where SimilarityWithoutLengthNorm looks something like:
public class SimilarityWithoutLengthNorm extends ClassicSimilarity{
#Override
public float lengthNorm(FieldInvertState state) {
return 1;
}
}

Store and retrieve string arrays in HBase

I've read this answer (How to store complex objects into hadoop Hbase?) regarding the storing of string arrays with HBase.
There it is said to use the ArrayWritable Class to serialize the array. With WritableUtils.toByteArray(Writable ... writable) I'll get a byte[] which I can store in HBase.
When I now try to retrieve the rows again, I get a byte[] which I have somehow to transform back again into an ArrayWritable.
But I don't find a way to do this. Maybe you know an answer or am I doing fundamentally wrong serializing my String[]?
You may apply the following method to get back the ArrayWritable (taken from my earlier answer, see here) .
public static <T extends Writable> T asWritable(byte[] bytes, Class<T> clazz)
throws IOException {
T result = null;
DataInputStream dataIn = null;
try {
result = clazz.newInstance();
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
dataIn = new DataInputStream(in);
result.readFields(dataIn);
}
catch (InstantiationException e) {
// should not happen
assert false;
}
catch (IllegalAccessException e) {
// should not happen
assert false;
}
finally {
IOUtils.closeQuietly(dataIn);
}
return result;
}
This method just deserializes the byte array to the correct object type, based on the provided class type token.
E.g:
Let's assume you have a custom ArrayWritable:
public class TextArrayWritable extends ArrayWritable {
public TextArrayWritable() {
super(Text.class);
}
}
Now you issue a single HBase get:
...
Get get = new Get(row);
Result result = htable.get(get);
byte[] value = result.getValue(family, qualifier);
TextArrayWritable tawReturned = asWritable(value, TextArrayWritable.class);
Text[] texts = (Text[]) tawReturned.toArray();
for (Text t : texts) {
System.out.print(t + " ");
}
...
Note:
You may have already found the readCompressedStringArray() and writeCompressedStringArray() methods in WritableUtils
which seem to be suitable if you have your own String array-backed Writable class.
Before using them, I'd warn you that these can cause serious performance hit due to
the overhead caused by the gzip compression/decompression.

How to customize data serialization based on content in WCF?

Trying to serialize a union-like data-type. There is an enum field indicating the type of data stored in the union, and a variety of possible field types.
The desired result is DataContractSerializer produced XML which contains just the enum, and the relevant field.
Possible solutions, none of which have been attempted yet, are:
Use a custom serializer and mark the union properties with a custom attribute, similar to this question. The custom serializer would strip out the members not required.
Use ISerializationSurrogate and serialize a different object which just contains the relevant data.
Don't use separate fields in the union, use one object field (this could be used as part of the implementation of the ISerializationSurrogate approach).
Other... ?
For example:
[DataContract]
public class WCFTestUnion
{
public enum EUnionType
{
[EnumMember]
Bool,
[EnumMember]
String,
[EnumMember]
Dictionary,
[EnumMember]
Invalid
};
EUnionType unionType = EUnionType.Invalid;
bool boolValue = true;
string stringValue = "Hello";
IDictionary<object, object> dictionaryValue = null;
// Could use custom attribute here ?
[DataMember]
public bool BoolValue
{
get { return this.boolValue; }
set { this.boolValue = value; }
}
// Could use custom attribute here ?
[DataMember]
public string StringValue
{
get { return this.stringValue; }
set { this.stringValue = value; }
}
// Could use custom attribute here ?
[DataMember]
public IDictionary<object, object> DictionaryValue
{
get { return this.dictionaryValue; }
set { this.dictionaryValue = value; }
}
[DataMember]
public EUnionType UnionType
{
get { return this.unionType; }
set { this.unionType = value; }
}
} // Ends class WCFTestUnion
Test
class TestSerializeUnion
{
internal static void Test()
{
Console.WriteLine("===TestSerializeUnion.Test()===");
WCFTestUnion u = new WCFTestUnion();
u.UnionType = WCFTestUnion.EUnionType.Dictionary;
u.DictionaryValue = new Dictionary<object, object>();
u.DictionaryValue[1] = "one";
u.DictionaryValue["two"] = 2;
System.Runtime.Serialization.DataContractSerializer serialize = new System.Runtime.Serialization.DataContractSerializer(typeof(WCFTestUnion));
System.IO.Stream stream = new System.IO.MemoryStream();
serialize.WriteObject(stream, u);
stream.Seek(0, System.IO.SeekOrigin.Begin);
byte[] buffer = new byte[stream.Length];
int length = checked((int)stream.Length);
int read = stream.Read(buffer, 0, length);
while (read < stream.Length)
{
read += stream.Read(buffer, 0, length - read);
}
string xml = Encoding.Default.GetString(buffer);
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.LoadXml(xml);
System.Xml.XmlTextWriter xmlwriter = new System.Xml.XmlTextWriter(Console.Out);
xmlwriter.Formatting = System.Xml.Formatting.Indented;
doc.WriteContentTo(xmlwriter);
xmlwriter.Flush();
Console.WriteLine();
}
} // Ends class TestSerializeUnion
Output:
<WCFTestUnion xmlns="http://schemas.datacontract.org/2004/07/WCFTestServiceContracts" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<BoolValue>true</BoolValue>
<DictionaryValue xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">1</a:Key>
<a:Value i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">one</a:Value>
</a:KeyValueOfanyTypeanyType>
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">two</a:Key>
<a:Value i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">2</a:Value>
</a:KeyValueOfanyTypeanyType>
</DictionaryValue>
<StringValue>Hello </StringValue>
<UnionType>Dictionary</UnionType>
</WCFTestUnion>
Desired Output (only field being used is serialized, along with enum):
<WCFTestUnion xmlns="http://schemas.datacontract.org/2004/07/WCFTestServiceContracts" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<DictionaryValue xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">1</a:Key>
<a:Value i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">one</a:Value>
</a:KeyValueOfanyTypeanyType>
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">two</a:Key>
<a:Value i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">2</a:Value>
</a:KeyValueOfanyTypeanyType>
</DictionaryValue>
<UnionType>Dictionary</UnionType>
</WCFTestUnion>
You do have several options here. What you use depends on the complexity of this scenario (where else you have to do something like this, how often and in what ways you have to serialize this data, performance, etc.) Take a look at these options, ask away if you have more questions, but mostly, I recommend you just play and experiment with multiple strategies from the list below before picking one or a hybrid solution.
Use a data contract resolver. Provides a mechanism for dynamically mapping types to and from wire representations during serialization and deserialization, giving you flexibility to support far more types than you can out-of-the-box.
Use IObjectReference. You can have a class which implements and returns a reference to a different object after it has been deserialized.
Use a data contract surrogate. This is different from the serialization surrogates you're referring to, but also similar. I think these might work out nicely for you

How can I use Lucene's PriorityQueue when I don't know the max size at create time?

I built a custom collector for Lucene.Net, but I can't figure out how to order (or page) the results. Everytime Collect gets called, I can add the result to an internal PriorityQueue, which I understand is the correct way to do this.
I extended the PriorityQueue, but it requires a size parameter on creation. You have to call Initialize in the constructor and pass in the max size.
However, in a collector, the searcher just calls Collect when it gets a new result, so I don't know how many results I have when I create the PriorityQueue. Based on this, I can't figure out how to make the PriorityQueue work.
I realize I'm probably missing something simple here...
PriorityQueue is not SortedList or SortedDictionary.
It is a kind of sorting implementation where it returns the top M results(your PriorityQueue's size) of N elements. You can add with InsertWithOverflow as many items as you want, but it will only hold only the top M elements.
Suppose your search resulted in 1000000 hits. Would you return all of the results to user?
A better way would be to return the top 10 elements to the user(using PriorityQueue(10)) and
if the user requests for the next 10 result, you can make a new search with PriorityQueue(20) and return the next 10 elements and so on.
This is the trick most search engines like google uses.
Everytime Commit gets called, I can add the result to an internal PriorityQueue.
I can not undestand the relationship between Commit and search, Therefore I will append a sample usage of PriorityQueue:
public class CustomQueue : Lucene.Net.Util.PriorityQueue<Document>
{
public CustomQueue(int maxSize): base()
{
Initialize(maxSize);
}
public override bool LessThan(Document a, Document b)
{
//a.GetField("field1")
//b.GetField("field2");
return //compare a & b
}
}
public class MyCollector : Lucene.Net.Search.Collector
{
CustomQueue _queue = null;
IndexReader _currentReader;
public MyCollector(int maxSize)
{
_queue = new CustomQueue(maxSize);
}
public override bool AcceptsDocsOutOfOrder()
{
return true;
}
public override void Collect(int doc)
{
_queue.InsertWithOverflow(_currentReader.Document(doc));
}
public override void SetNextReader(IndexReader reader, int docBase)
{
_currentReader = reader;
}
public override void SetScorer(Scorer scorer)
{
}
}
searcher.Search(query,new MyCollector(10)) //First page.
searcher.Search(query,new MyCollector(20)) //2nd page.
searcher.Search(query,new MyCollector(30)) //3rd page.
EDIT for #nokturnal
public class MyPriorityQueue<TObj, TComp> : Lucene.Net.Util.PriorityQueue<TObj>
where TComp : IComparable<TComp>
{
Func<TObj, TComp> _KeySelector;
public MyPriorityQueue(int size, Func<TObj, TComp> keySelector) : base()
{
_KeySelector = keySelector;
Initialize(size);
}
public override bool LessThan(TObj a, TObj b)
{
return _KeySelector(a).CompareTo(_KeySelector(b)) < 0;
}
public IEnumerable<TObj> Items
{
get
{
int size = Size();
for (int i = 0; i < size; i++)
yield return Pop();
}
}
}
var pq = new MyPriorityQueue<Document, string>(3, doc => doc.GetField("SomeField").StringValue);
foreach (var item in pq.Items)
{
}
The reason Lucene's Priority Queue is size limited is because it uses a fixed size implementation that is very fast.
Think about what is the reasonable maximum number of results to get back at a time and use that number, the "waste" for when the results are few is not that bad for the benefit it gains.
On the other hand, if you have such a huge number of results that you cannot hold them, then how are you going to be serving/displaying them? Keep in mind that this is for "top" hits so as you iterate through the results you will be hitting less and less relevant ones anyway.