Disable length normalization on a single field? - lucene

I have a field with the following requirements:
It must be boosted at index time, therefore 'omitNorms' must remain 'false'
However, it must NOT be subject to field length normalization (ie. just because a term is found in 1:10 words vs. 1:1000 should not affect scoring -- both should be equally weighted)
On at least one other field, I DO in fact want field length normalization, and so I do not suspect applying a custom Similarity broadly on the Searcher is appropriate.
How does one boost a single field at index time, but disable the effect of field length normalization?

You can use PerFieldSimilarityWrapper to use a different similarity implementation for each field:
public class MySimilarity extends PerFieldSimilarityWrapper {
Similarity standardSim = new ClassicSimilarity();
Similarity nolengthSim = new SimilarityWithoutLengthNorm();
#Override
public Similarity get(String fieldName) {
if (fieldName.equals("someField")) {
return nolengthSim;
}
else {
return standardSim;
}
}
//These two methods must be implemented here, as their
//calculation is not field specific
#Override
public float queryNorm (float valueForNormalization) {
return standardSim.queryNorm(valueForNormalization);
}
#Override
public float coord (int overlap, int maxOverlap) {
return standardSim.coord(overlap, maxOverlap);
}
}
Where SimilarityWithoutLengthNorm looks something like:
public class SimilarityWithoutLengthNorm extends ClassicSimilarity{
#Override
public float lengthNorm(FieldInvertState state) {
return 1;
}
}

Related

Best Practice for OOP function with multiple possible control flows

In my project, I have this special function that does needs to evaluate the following:
State -- represented by an enum -- and there are about 6 different states
Left Argument
Right Argument
Left and Right arguments are represented by strings, but their values can be the following:
"_" (a wildcard)
"1" (an integer string)
"abc" (a normal string)
So as you can see, to cover all every single possibility, there's about 2 * 3 * 6 = 36 different logics to evaluate and of course, using if-else in one giant function will not be feasible at all. I have encapsulated the above 3 input into an object that I'll pass to my function.
How would one try to use OOP to solve this. Would it make sense to have 6 different subclasses of the main State class with an evaluate() method, and then in their respective methods, I have if else statements to check:
if left & right arg are wildcards, do something
if left is number, right is string, do something else
Repeat for all the valid combinations in each State subclass
This feels like the right direction, but it also feels like theres alot of duplicate logic (for example check if both args are wildcards, or both strings etc.) for all 6 subclasses. Then my thought is to abstract it abit more and make another subclass:
For each state subclass, I have stateWithTwoWildCards, statewithTwoString etc.
But I feel like this is going way overboard and over-engineering and being "too" specific (I get that this technically adheres tightly to SOLID, especially SRP and OCP concepts). Any thoughts on this?
Possibly something like template method pattern can be useful in this case. I.e. you will encapsulate all the checking logic in the base State.evaluate method and create several methods which subclasses will override. Something along this lines:
class StateBase
def evaluate():
if(bothWildcards)
evalBothWildcards()
else if(bothStrings)
evalBothStrings()
else if ...
def evalBothWildcards():
...
def evalBothStrings():
...
Where evalBothWildcards, evalBothStrings, etc. will be overloaded in inheritors.
there's about 2 * 3 * 6 = 36 different logics to evaluate
We can apply divide and conquer technique.
you have 6 states. It is possible to use Chain of Responibility pattern here to choose appropriate state handler
when desired state handler is found, then we can apply desired function. The appropriate function can be considered as strategy. So it is a place where Strategy pattern can be applied.
we can separate strategies by appropriate states and put them in simple factory to get desired strategy by key.
This is what we will do. So let's see it more thoroughly.
Chain of responsibility pattern
If you have a lot if else statements, it is possible to use Chain of Responsibility pattern. As wiki says about Chain of Responsibility:
The chain-of-responsibility pattern is a behavioral design pattern
consisting of a source of command objects and a series of processing
objects. Each processing object contains logic that defines the
types of command objects that it can handle; the rest are passed to
the next processing object in the chain. A mechanism also exists for
adding new processing objects to the end of this chain
So let's dive in code. Let me show an example via C#.
So this is our Argument class which has Left and Right operands:
public class Arguments
{
public string Left { get; private set; }
public string Right { get; private set; }
public MyState MyState { get; private set; }
public MyKey MyKey => new MyKey(MyState, Left);
public Arguments(string left, string right, MyState myState)
{
Left = left;
Right = right;
MyState = myState;
}
}
And this is your 6 states:
public enum MyState
{
One, Two, Three, Four, Five, Six
}
This is start of Decorator pattern. This is an abstraction of StateHandler which defines behaviour to to set next handler:
public abstract class StateHandler
{
public abstract MyState State { get; }
private StateHandler _nextStateHandler;
public void SetSuccessor(StateHandler nextStateHandler)
{
_nextStateHandler = nextStateHandler;
}
public virtual IDifferentLogicStrategy Execute(Arguments arguments)
{
if (_nextStateHandler != null)
return _nextStateHandler.Execute(arguments);
return null;
}
}
and its concrete implementations of StateHandler:
public class OneStateHandler : StateHandler
{
public override MyState State => MyState.One;
public override IDifferentLogicStrategy Execute(Arguments arguments)
{
if (arguments.MyState == State)
return new StrategyStateFactory().GetInstanceByMyKey(arguments.MyKey);
return base.Execute(arguments);
}
}
public class TwoStateHandler : StateHandler
{
public override MyState State => MyState.Two;
public override IDifferentLogicStrategy Execute(Arguments arguments)
{
if (arguments.MyState == State)
return new StrategyStateFactory().GetInstanceByMyKey(arguments.MyKey);
return base.Execute(arguments);
}
}
and the third state handler looks like this:
public class ThreeStateHandler : StateHandler
{
public override MyState State => MyState.Three;
public override IDifferentLogicStrategy Execute(Arguments arguments)
{
if (arguments.MyState == State)
return new StrategyStateFactory().GetInstanceByMyKey(arguments.MyKey);
return base.Execute(arguments);
}
}
Strategy pattern
Let's pay attention to the following row of code:
return new StrategyStateFactory().GetInstanceByMyKey(arguments.MyKey);
The above code is an example of using Strategy pattern. We have different ways or strategies to handle
your cases. Let me show a code of strategies of evaluation of your expressions.
This is an abstraction of strategy:
public interface IDifferentLogicStrategy
{
string Evaluate(Arguments arguments);
}
And its concrete implementations:
public class StrategyWildCardStateOne : IDifferentLogicStrategy
{
public string Evaluate(Arguments arguments)
{
// your logic here to evaluate "_" (a wildcard)
return "StrategyWildCardStateOne";
}
}
public class StrategyIntegerStringStateOne : IDifferentLogicStrategy
{
public string Evaluate(Arguments arguments)
{
// your logic here to evaluate "1" (an integer string)
return "StrategyIntegerStringStateOne";
}
}
And the third strategy:
public class StrategyNormalStringStateOne : IDifferentLogicStrategy
{
public string Evaluate(Arguments arguments)
{
// your logic here to evaluate "abc" (a normal string)
return "StrategyNormalStringStateOne";
}
}
Simple factory
There is no pattern like simple factory. However, it is a place where we can get instances of strategies by key. So by doing this we avoided to use multiple if else statements to choose correct strategy.
So, we need a place where we can store strategies by state and argument value. At first, let's create MyKey struct. It will have help us to differentiate State and arguments:
public struct MyKey
{
public readonly MyState MyState { get; }
public readonly string ArgumentValue { get; } // your three cases: "_",
// an integer string, a normal string
public MyKey(MyState myState, string argumentValue)
{
MyState = myState;
ArgumentValue = argumentValue;
}
public override bool Equals([NotNullWhen(true)] object? obj)
{
return obj is MyKey mys
&& mys.MyState == MyState
&& mys.ArgumentValue == ArgumentValue;
}
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
hash = hash * 23 + MyState.GetHashCode();
hash = hash * 23 + ArgumentValue.GetHashCode();
return hash;
}
}
}
and then we can create a simple factory:
public class StrategyStateFactory
{
private Dictionary<MyKey, IDifferentLogicStrategy>
_differentLogicStrategyByStateAndValue =
new Dictionary<MyKey, IDifferentLogicStrategy>()
{
{ new MyKey(MyState.One, "_"), new StrategyWildCardStateOne() },
{ new MyKey(MyState.One, "intString"),
new StrategyIntegerStringStateOne() },
{ new MyKey(MyState.One, "normalString"),
new StrategyNormalStringStateOne() }
};
public IDifferentLogicStrategy GetInstanceByMyKey(MyKey myKey)
{
return _differentLogicStrategyByStateAndValue[myKey];
}
}
So we've written our strategies and we've stored these strategies in simple factory StrategyStateFactory.
Then we need to check the above implementation:
StateHandler chain = new OneStateHandler();
StateHandler secondStateHandler = new TwoStateHandler();
StateHandler thirdStateHandler = new ThreeStateHandler();
chain.SetSuccessor(secondStateHandler);
secondStateHandler.SetSuccessor(thirdStateHandler);
Arguments arguments = new Arguments("_", "_", MyState.One);
IDifferentLogicStrategy differentLogicStrategy = chain.Execute(arguments);
string evaluatedResult =
differentLogicStrategy.Evaluate(arguments); // output: "StrategyWildCardStateOne"
I believe I gave basic idea how it can be done.

I need a way to score the lucene documents using term frequency only. Is there any flag that needs to be changed for this?

If i have two documents with D1 having term "lucene" twice and D2 having term "lucene" thrice. I want lucene to score D2 higher than D1. Note here that, D1 has only two words (i.e. lucene lucene) while D3 has 100 words out of which 3 words are lucene. Default lucene scoring model will score D1 higher than D2. I want to disable this mode and rank D2 higher than D1. That's my project requirement.
You'll need to implement a Similarity which does what you want. You could implement directly on Similarity, but you'll probably find it's simpler to just copy ClassicSimilarity (DefaultSimilarity, before version 5.4), and stub out the things you don't want to impact your score (ie. return a constant). For instance, here's a very simple implementation that would simply return the frequency of the terms in the query:
import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.TFIDFSimilarity;
import org.apache.lucene.util.BytesRef;
public class SimpleSimilarity extends TFIDFSimilarity {
//Comments describe briefly what these methods do in the *standard* implementation.
//Not what they do in this implementation (which, for most of them, is nothing at all)
public SimpleSimilarity() {}
//boosts results which match more query terms
#Override
public float coord(int overlap, int maxOverlap) {
return 1f;
}
//constant per query, normalizes scores somewhat based on query
#Override
public float queryNorm(float sumOfSquaredWeights) {
return 1f;
}
//Norms should be disabled when using this similarity
//They are useless to it, and would just be wasted space.
#Override
public final long encodeNormValue(float f) {
return 1L;
}
#Override
public final float decodeNormValue(long norm) {
return 1f;
}
//Weighs shorter fields more heavily
#Override
public float lengthNorm(FieldInvertState state) {
return 1f;
}
//Higher frequency terms (more matches) scored higher
#Override
public float tf(float freq) {
//return (float)Math.sqrt(freq); The standard tf impl
return freq;
}
//Scores closer matches higher when using a sloppy phrase query
#Override
public float sloppyFreq(int distance) {
return 1.0f;
}
//ClassicSimilarity doesn't really do much with payloads. This is unmodified
#Override
public float scorePayload(int doc, int start, int end, BytesRef payload) {
return 1f;
}
//Weigh matches on rarer terms more heavily.
#Override
public float idf(long docFreq, long numDocs) {
return 1f;
}
#Override
public String toString() {
return "SimpleSimilarity";
}
}

Refactoring code using Strategy Pattern

I have a GiftCouponPayment class. It has a business strategy logic which can change frequently - GetCouponValue(). At present the logic is “The coupon value should be considered as zero when the Coupon Number is less than 2000”. In a future business strategy it may change as “The coupon value should be considered as zero when the Coupon Issued Date is less than 1/1/2000”. It can change to any such strategies based on the managing department of the company.
How can we refactor the GiftCouponPayment class using Strategy pattern so that the class need not be changed when the strategy for GetCouponValue method?
UPDATE: After analyzing the responsibilities, I feel, "GiftCoupon" will be a better name for "GiftCouponPayment" class.
C# CODE
public int GetCouponValue()
{
int effectiveValue = -1;
if (CouponNumber < 2000)
{
effectiveValue = 0;
}
else
{
effectiveValue = CouponValue;
}
return effectiveValue;
}
READING
Strategy Pattern - multiple return types/values
GiftCouponPayment class should pass GiftCoupon to different strategy classes. So your strategy interface (CouponValueStrategy) should contain a method:
int getCouponValue(GiftCoupon giftCoupon)
Since each Concrete strategy implementing CouponValueStrategy has access to GiftCoupon, each can implement an algorithm based on Coupon number or Coupon date etc.
You can inject a "coupon value policy" into the coupon object itself and call upon it to compute the coupon value. In such cases, it is acceptable to pass this into the policy so that the policy can ask the coupon for its required attributes (such as coupon number):
public interface ICouponValuePolicy
{
int ComputeCouponValue(GiftCouponPayment couponPayment);
}
public class GiftCouponPayment
{
public ICouponValuePolicy CouponValuePolicy {
get;
set;
}
public int GetCouponValue()
{
return CouponValuePolicy.ComputeCouponValue(this);
}
}
Also, it seems like your GiftCouponPayment is really responsible for two things (the payment and the gift coupon). It might make sense to extract a GiftCoupon class that contains CouponNumber, CouponValue and GetCouponValue(), and refer to this from the GiftCouponPayment.
When your business - logic changes, it's quite natural that your code will have to change as well.
You could perhaps opt to move the expiration-detection logic into a specification class:
public class CouponIsExpiredBasedOnNumber : ICouponIsExpiredSpecification
{
public bool IsExpired( Coupon c )
{
if( c.CouponNumber < 2000 )
return true;
else
return false;
}
}
public class CouponIsExpiredBasedOnDate : ICouponIsExpiredSpecification
{
public readonly DateTime expirationDate = new DateTime (2000, 1, 1);
public bool IsExpired( Coupon c )
{
if( c.Date < expirationDate )
return true;
else
return false;
}
}
public class Coupon
{
public int GetCouponValue()
{
ICouponIsExpiredSpecification expirationRule = GetExpirationRule();
if( expirationRule.IsExpired(this) )
return 0;
else
return this.Value;
}
}
The question you should ask yourself: is it necessary to make it this complex right now ? Can't you make it as simple as possible to satisfy current needs, and refactor it later, when the expiration-rule indeed changes ?
The behavior that you wish to be dynamic is the coupon calculation - which can dependent on any number of things: coupon date, coupon number, etc. I think that a provider pattern would be more appropriate, to inject a service class which calculates the coupon value.
The essence of this is moving the business logic outside of the GiftCouponPayment class, and using a class I'll call "CouponCalculator" to encapsulate the business logic. This class uses an interface.
interface ICouponCalculator
{
int Calculate (GiftCouponPayment payment);
}
public class CouponCalculator : ICouponCalculator
{
public int Calculate (GiftCouponPayment payment)
{
if (payment.CouponNumber < 2000)
{
return 0;
}
else
{
return payment.CouponValue;
}
}
}
Now that you have this interface and class, add a property to the GiftCouponPayment class, then modify your original GetCouponValue() method:
public class GiftCouponPayment
{
public int CouponNumber;
public int CouponValue;
public ICouponCalculator Calculator { get; set; }
public int GetCouponValue()
{
return Calculator.Calculate(this);
}
}
When you construct the GiftCouponPayment class, you will assign the Calculator property:
var payment = new GiftCouponPayment() { Calculator = new CouponCalculator(); }
var val = payment.GetCouponValue(); // uses CouponCalculator class to get value
If this seems like a lot of work just to move the calculation logic outside of the GiftCouponPayment class, well, it is! But if this is your requirement, it does provide several things:
1. You won't need to change the GiftCouponPayment class to adjust the calculation logic.
2. You could create additional classes that implement ICalculator, and a factory pattern to decide which class to inject into GiftCouponPayment when it is constructed. This speaks more to your original desire for a "strategy" pattern - as this would be useful if the logic becomes very complex.

How can I use Lucene's PriorityQueue when I don't know the max size at create time?

I built a custom collector for Lucene.Net, but I can't figure out how to order (or page) the results. Everytime Collect gets called, I can add the result to an internal PriorityQueue, which I understand is the correct way to do this.
I extended the PriorityQueue, but it requires a size parameter on creation. You have to call Initialize in the constructor and pass in the max size.
However, in a collector, the searcher just calls Collect when it gets a new result, so I don't know how many results I have when I create the PriorityQueue. Based on this, I can't figure out how to make the PriorityQueue work.
I realize I'm probably missing something simple here...
PriorityQueue is not SortedList or SortedDictionary.
It is a kind of sorting implementation where it returns the top M results(your PriorityQueue's size) of N elements. You can add with InsertWithOverflow as many items as you want, but it will only hold only the top M elements.
Suppose your search resulted in 1000000 hits. Would you return all of the results to user?
A better way would be to return the top 10 elements to the user(using PriorityQueue(10)) and
if the user requests for the next 10 result, you can make a new search with PriorityQueue(20) and return the next 10 elements and so on.
This is the trick most search engines like google uses.
Everytime Commit gets called, I can add the result to an internal PriorityQueue.
I can not undestand the relationship between Commit and search, Therefore I will append a sample usage of PriorityQueue:
public class CustomQueue : Lucene.Net.Util.PriorityQueue<Document>
{
public CustomQueue(int maxSize): base()
{
Initialize(maxSize);
}
public override bool LessThan(Document a, Document b)
{
//a.GetField("field1")
//b.GetField("field2");
return //compare a & b
}
}
public class MyCollector : Lucene.Net.Search.Collector
{
CustomQueue _queue = null;
IndexReader _currentReader;
public MyCollector(int maxSize)
{
_queue = new CustomQueue(maxSize);
}
public override bool AcceptsDocsOutOfOrder()
{
return true;
}
public override void Collect(int doc)
{
_queue.InsertWithOverflow(_currentReader.Document(doc));
}
public override void SetNextReader(IndexReader reader, int docBase)
{
_currentReader = reader;
}
public override void SetScorer(Scorer scorer)
{
}
}
searcher.Search(query,new MyCollector(10)) //First page.
searcher.Search(query,new MyCollector(20)) //2nd page.
searcher.Search(query,new MyCollector(30)) //3rd page.
EDIT for #nokturnal
public class MyPriorityQueue<TObj, TComp> : Lucene.Net.Util.PriorityQueue<TObj>
where TComp : IComparable<TComp>
{
Func<TObj, TComp> _KeySelector;
public MyPriorityQueue(int size, Func<TObj, TComp> keySelector) : base()
{
_KeySelector = keySelector;
Initialize(size);
}
public override bool LessThan(TObj a, TObj b)
{
return _KeySelector(a).CompareTo(_KeySelector(b)) < 0;
}
public IEnumerable<TObj> Items
{
get
{
int size = Size();
for (int i = 0; i < size; i++)
yield return Pop();
}
}
}
var pq = new MyPriorityQueue<Document, string>(3, doc => doc.GetField("SomeField").StringValue);
foreach (var item in pq.Items)
{
}
The reason Lucene's Priority Queue is size limited is because it uses a fixed size implementation that is very fast.
Think about what is the reasonable maximum number of results to get back at a time and use that number, the "waste" for when the results are few is not that bad for the benefit it gains.
On the other hand, if you have such a huge number of results that you cannot hold them, then how are you going to be serving/displaying them? Keep in mind that this is for "top" hits so as you iterate through the results you will be hitting less and less relevant ones anyway.

Lucene payload scoring

I want to figure out how payload scoring works in lucene. Since I don't understand where PayloadFunction fits in, I think I don't really understand how it works. Tried googling for it, but couldn't find much apart from advice to go through source. Well, it would be nice if someone can explain it here, else source code it is :)
There are three parts of it. First of all you should generate payloads during analysis. This could be done using PayloadAttribute. You just need to add this attribute to terms you want during analysis.
class MyFilter extends TokenFilter {
private PayloadAttribute attr;
public MyFilter() {
attr = addAttribute(PayloadAttribute.class);
}
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
Payload p = new Payload(PayloadHelper.encodeFloat(42));
attr.setPayload(p);
} else {
attr.setPayload(null);
}
}
Then during searching you should use special query class PayloadTermQuery. This class behaves as SpanTermQuery but do track of payloads in index. Using custom Similarity implementation you could score each payload occurrence in document.
public class MySimilarity extends DefaultSimilarity {
public float scorePayload(int docID, String fieldName,
int start, int end, byte[] payload,
int offset, int length) {
if (payload != null) {
return PayloadHelper.decodeFloat(payload, offset);
} else {
return 1.0f;
}
}
}
Finally, using PayloadFunction you could aggregate payload scores over document to produce final document score.