I need a way to score the lucene documents using term frequency only. Is there any flag that needs to be changed for this? - lucene

If i have two documents with D1 having term "lucene" twice and D2 having term "lucene" thrice. I want lucene to score D2 higher than D1. Note here that, D1 has only two words (i.e. lucene lucene) while D3 has 100 words out of which 3 words are lucene. Default lucene scoring model will score D1 higher than D2. I want to disable this mode and rank D2 higher than D1. That's my project requirement.

You'll need to implement a Similarity which does what you want. You could implement directly on Similarity, but you'll probably find it's simpler to just copy ClassicSimilarity (DefaultSimilarity, before version 5.4), and stub out the things you don't want to impact your score (ie. return a constant). For instance, here's a very simple implementation that would simply return the frequency of the terms in the query:
import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.TFIDFSimilarity;
import org.apache.lucene.util.BytesRef;
public class SimpleSimilarity extends TFIDFSimilarity {
//Comments describe briefly what these methods do in the *standard* implementation.
//Not what they do in this implementation (which, for most of them, is nothing at all)
public SimpleSimilarity() {}
//boosts results which match more query terms
#Override
public float coord(int overlap, int maxOverlap) {
return 1f;
}
//constant per query, normalizes scores somewhat based on query
#Override
public float queryNorm(float sumOfSquaredWeights) {
return 1f;
}
//Norms should be disabled when using this similarity
//They are useless to it, and would just be wasted space.
#Override
public final long encodeNormValue(float f) {
return 1L;
}
#Override
public final float decodeNormValue(long norm) {
return 1f;
}
//Weighs shorter fields more heavily
#Override
public float lengthNorm(FieldInvertState state) {
return 1f;
}
//Higher frequency terms (more matches) scored higher
#Override
public float tf(float freq) {
//return (float)Math.sqrt(freq); The standard tf impl
return freq;
}
//Scores closer matches higher when using a sloppy phrase query
#Override
public float sloppyFreq(int distance) {
return 1.0f;
}
//ClassicSimilarity doesn't really do much with payloads. This is unmodified
#Override
public float scorePayload(int doc, int start, int end, BytesRef payload) {
return 1f;
}
//Weigh matches on rarer terms more heavily.
#Override
public float idf(long docFreq, long numDocs) {
return 1f;
}
#Override
public String toString() {
return "SimpleSimilarity";
}
}

Related

Disable length normalization on a single field?

I have a field with the following requirements:
It must be boosted at index time, therefore 'omitNorms' must remain 'false'
However, it must NOT be subject to field length normalization (ie. just because a term is found in 1:10 words vs. 1:1000 should not affect scoring -- both should be equally weighted)
On at least one other field, I DO in fact want field length normalization, and so I do not suspect applying a custom Similarity broadly on the Searcher is appropriate.
How does one boost a single field at index time, but disable the effect of field length normalization?
You can use PerFieldSimilarityWrapper to use a different similarity implementation for each field:
public class MySimilarity extends PerFieldSimilarityWrapper {
Similarity standardSim = new ClassicSimilarity();
Similarity nolengthSim = new SimilarityWithoutLengthNorm();
#Override
public Similarity get(String fieldName) {
if (fieldName.equals("someField")) {
return nolengthSim;
}
else {
return standardSim;
}
}
//These two methods must be implemented here, as their
//calculation is not field specific
#Override
public float queryNorm (float valueForNormalization) {
return standardSim.queryNorm(valueForNormalization);
}
#Override
public float coord (int overlap, int maxOverlap) {
return standardSim.coord(overlap, maxOverlap);
}
}
Where SimilarityWithoutLengthNorm looks something like:
public class SimilarityWithoutLengthNorm extends ClassicSimilarity{
#Override
public float lengthNorm(FieldInvertState state) {
return 1;
}
}

Layering repeated logic atop methods (OOP architecture)

All 3 methods below need to perform a cache check (pseudo code):
public class Calculus
{
private map<string, IntegralResult> _intCache;
private map<string, DerivativeResult> _derCache;
private map<string, decimal> _mulCache;
public IntegralResult integrate(string func)
{
// result already cached?
if (_intCache.contains(func)) return _intCache.get(func);
// perform integration
_intCache.put(func, result);
return result;
}
public DerivativeResult derivative(string func)
{
// result already cached?
if (_derCache.contains(func)) return _derCache.get(func);
// perform derivative
_derCache.put(func, result);
return result;
}
/* notice private */
private decimal multiply(decimal a, decimal b)
{
// result already cached?
string key = (string)a + ";" + (string)b;
if (_mulCache.contains(key)) return _mulCache.get(key);
// perform multiplication
_mulCache.put(key, result);
return result;
}
}
What's the best way to rewrite this class? I am looking for applicable design patterns that can be used to layer a specific logic on top of methods.

Trouble employing BeanItemContainer and TreeTable in Vaadin

I have reviewed multiple examples for how to construct a TreeTable from from a Container datasource and just adding items iterating over an Object[][]. Still I'm stuck for my use case.
I have a bean like so...
public class DSRUpdateHourlyDTO implements UniquelyKeyed<AssetOwnedHourlyLocatableId>, Serializable
{
private static final long serialVersionUID = 1L;
private final AssetOwnedHourlyLocatableId id = new AssetOwnedHourlyLocatableId();
private String commitStatus;
private BigDecimal economicMax;
private BigDecimal economicMin;
public void setCommitStatus(String commitStatus) { this.commitStatus = commitStatus; }
public void setEconomicMax(BigDecimal economicMax) { this.economicMax = economicMax; }
public void setEconomicMin(BigDecimal economicMin) { this.economicMin = economicMin; }
public String getCommitStatus() { return commitStatus; }
public BigDecimal getEconomicMax() { return economicMax; }
public BigDecimal getEconomicMin() { return economicMin; }
public AssetOwnedHourlyLocatableId getId() { return id; }
#Override
public AssetOwnedHourlyLocatableId getKey() {
return getId();
}
}
The AssetOwnedHourlyLocatableId is a compound id. It looks like...
public class AssetOwnedHourlyLocatableId implements Serializable, AssetOwned, HasHour, Locatable,
UniquelyKeyed<AssetOwnedHourlyLocatableId> {
private static final long serialVersionUID = 1L;
private String location;
private String hour;
private String assetOwner;
#Override
public String getLocation() {
return location;
}
#Override
public void setLocation(final String location) {
this.location = location;
}
#Override
public String getHour() {
return hour;
}
#Override
public void setHour(final String hour) {
this.hour = hour;
}
#Override
public String getAssetOwner() {
return assetOwner;
}
#Override
public void setAssetOwner(final String assetOwner) {
this.assetOwner = assetOwner;
}
}
I want to generate a grid where the hours are pivoted into column headers and the location is the only other additional column header.
E.g.,
Location 1 2 3 4 5 6 ... 24
would be the column headers.
Underneath each column you might see...
> L1
> Commit Status Status1 .... Status24
> Eco Min EcoMin1 .... EcoMin24
> Eco Max EcoMax1 .... EcoMax24
> L2
> Commit Status Status1 .... Status24
> Eco Min EcoMin1 .... EcoMin24
> Eco Max EcoMax1 .... EcoMax24
So, if I'm provided a List<DSRUpdateHourlyDTO> I want to convert it into the presentation format described above.
What would be the best way to do this?
I have a few additional functional requirements.
I want to be able to toggle between read-only and editable views of the same table.
I want to be able to complete a round-trip to a datasource (e.g., JPAContainerSource).
I (will eventually) want to filter items by any part of the compound id.
My challenge is in the adaptation. I well understand the simple use case where I could take the list and simply splat it into a BeanItemContainer and use addNestedContainerProperty and setVisibleColumns. Pivoting properties into columns seems to be what's stumping me.
As it turns out this was an ill-conceived question.
For data entry purposes, one could use a BeanItemContainer and have the columns include nested container property hour from the composite id and instead of a TreeTable, use a Table that has commitStatus, ecoMin and ecoMax as columns. Limitation: you'd only ever query for / submit one assetOwner and location's worth of data.
As for display, where you don't care to filter one assetOwner and location's worth of data, you could pivot the hour info as originally described. You could just convert the original bean into another bean suitable for display (where each hour is its own column).

How can I use Lucene's PriorityQueue when I don't know the max size at create time?

I built a custom collector for Lucene.Net, but I can't figure out how to order (or page) the results. Everytime Collect gets called, I can add the result to an internal PriorityQueue, which I understand is the correct way to do this.
I extended the PriorityQueue, but it requires a size parameter on creation. You have to call Initialize in the constructor and pass in the max size.
However, in a collector, the searcher just calls Collect when it gets a new result, so I don't know how many results I have when I create the PriorityQueue. Based on this, I can't figure out how to make the PriorityQueue work.
I realize I'm probably missing something simple here...
PriorityQueue is not SortedList or SortedDictionary.
It is a kind of sorting implementation where it returns the top M results(your PriorityQueue's size) of N elements. You can add with InsertWithOverflow as many items as you want, but it will only hold only the top M elements.
Suppose your search resulted in 1000000 hits. Would you return all of the results to user?
A better way would be to return the top 10 elements to the user(using PriorityQueue(10)) and
if the user requests for the next 10 result, you can make a new search with PriorityQueue(20) and return the next 10 elements and so on.
This is the trick most search engines like google uses.
Everytime Commit gets called, I can add the result to an internal PriorityQueue.
I can not undestand the relationship between Commit and search, Therefore I will append a sample usage of PriorityQueue:
public class CustomQueue : Lucene.Net.Util.PriorityQueue<Document>
{
public CustomQueue(int maxSize): base()
{
Initialize(maxSize);
}
public override bool LessThan(Document a, Document b)
{
//a.GetField("field1")
//b.GetField("field2");
return //compare a & b
}
}
public class MyCollector : Lucene.Net.Search.Collector
{
CustomQueue _queue = null;
IndexReader _currentReader;
public MyCollector(int maxSize)
{
_queue = new CustomQueue(maxSize);
}
public override bool AcceptsDocsOutOfOrder()
{
return true;
}
public override void Collect(int doc)
{
_queue.InsertWithOverflow(_currentReader.Document(doc));
}
public override void SetNextReader(IndexReader reader, int docBase)
{
_currentReader = reader;
}
public override void SetScorer(Scorer scorer)
{
}
}
searcher.Search(query,new MyCollector(10)) //First page.
searcher.Search(query,new MyCollector(20)) //2nd page.
searcher.Search(query,new MyCollector(30)) //3rd page.
EDIT for #nokturnal
public class MyPriorityQueue<TObj, TComp> : Lucene.Net.Util.PriorityQueue<TObj>
where TComp : IComparable<TComp>
{
Func<TObj, TComp> _KeySelector;
public MyPriorityQueue(int size, Func<TObj, TComp> keySelector) : base()
{
_KeySelector = keySelector;
Initialize(size);
}
public override bool LessThan(TObj a, TObj b)
{
return _KeySelector(a).CompareTo(_KeySelector(b)) < 0;
}
public IEnumerable<TObj> Items
{
get
{
int size = Size();
for (int i = 0; i < size; i++)
yield return Pop();
}
}
}
var pq = new MyPriorityQueue<Document, string>(3, doc => doc.GetField("SomeField").StringValue);
foreach (var item in pq.Items)
{
}
The reason Lucene's Priority Queue is size limited is because it uses a fixed size implementation that is very fast.
Think about what is the reasonable maximum number of results to get back at a time and use that number, the "waste" for when the results are few is not that bad for the benefit it gains.
On the other hand, if you have such a huge number of results that you cannot hold them, then how are you going to be serving/displaying them? Keep in mind that this is for "top" hits so as you iterate through the results you will be hitting less and less relevant ones anyway.

Lucene payload scoring

I want to figure out how payload scoring works in lucene. Since I don't understand where PayloadFunction fits in, I think I don't really understand how it works. Tried googling for it, but couldn't find much apart from advice to go through source. Well, it would be nice if someone can explain it here, else source code it is :)
There are three parts of it. First of all you should generate payloads during analysis. This could be done using PayloadAttribute. You just need to add this attribute to terms you want during analysis.
class MyFilter extends TokenFilter {
private PayloadAttribute attr;
public MyFilter() {
attr = addAttribute(PayloadAttribute.class);
}
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
Payload p = new Payload(PayloadHelper.encodeFloat(42));
attr.setPayload(p);
} else {
attr.setPayload(null);
}
}
Then during searching you should use special query class PayloadTermQuery. This class behaves as SpanTermQuery but do track of payloads in index. Using custom Similarity implementation you could score each payload occurrence in document.
public class MySimilarity extends DefaultSimilarity {
public float scorePayload(int docID, String fieldName,
int start, int end, byte[] payload,
int offset, int length) {
if (payload != null) {
return PayloadHelper.decodeFloat(payload, offset);
} else {
return 1.0f;
}
}
}
Finally, using PayloadFunction you could aggregate payload scores over document to produce final document score.