Lucene Scoring for Overlap ranking - lucene

I'm new to working with Lucene and trying to understand how I can use Lucene for a simpler scoring function.
I have objects in my dataset with 5-10 terms attached to each of them. Lucene uses TFIDF similarity by default to rank the objects.
TFIDF does not make sense as my data does not varying term frequencies. How can I change the default scoring function so that I rank based on the overlapping keywords?
Doc1 = {system engineering artificial intelligence}
Doc2 = {architecture logic programming}
Doc3 = {sytem architecture engineering}
For the query Query = {system architecture}, I want a ranking where Doc3 is ranked higher than Doc1 and Doc2.

I could propose to use something like this:
Query query = new BooleanQuery.Builder()
.add(new TermQuery(new Term("text", "system")), Occur.SHOULD)
.add(new TermQuery(new Term("text", "architecture")), Occur.SHOULD)
.build();
in this case doc3 will be ranked higher than doc1 and doc2, but the should clause nature will allow to rank other documents as well.

Related

How to rank by the specific field in the same score when using DisjunctionMaxQuery in Lucene?

I am searching an index with 3 it's fields ("name", "addr" and "fullname"), and using a DisjunctionMaxQuery to rank the results by the max score of 3 fields. When the hits have same score, lucene ranks them by doc Id (low doc Id would be first).
But I don't want to rank by doc Id in that case. I would like to rank by field. If the hits have the same score, I expect that the hit whose score (max score) is from the field "name" would be before the hit whose score is from another field.
I think the customer Collector & HitQueue is good idea and rewrite the method PriorityQueue.lessThan could change rank in priority queue. Unfortunately, the info in ScoreDoc is too little, and it's hard to get the source of max score for every hit.
Someone else know how to solve it?
The simplest approach to this would be to simply boost the fields you want to come first in a tie slightly more than the other:
Query query = new DisjunctionMaxQuery(0);
Query subQueryOne = new TermQuery(new Term("one", searchterm))
subQueryOne.setBoost(1.2);
Query subQueryTwo = new TermQuery(new Term("two", searchterm))
subQueryOne.setBoost(1.1);
Query subQueryThree = new TermQuery(new Term("three", searchterm))
subQueryOne.setBoost(1.0);
query.add(subQueryOne);
query.add(subQueryTwo);
query.add(SubQueryThree);

How to compare a tsvector against another tsvector?

I'm trying to get "possibly related" records of a given record.
There's a tsvector (tsv) on the table, so I'm thinking how to convert the source tsv to a tsquery format to then find the most closely related matches like a normal ranked search.
SELECT title,
link,
IMAGE,
intro,
created_at,
updated_at,
ts_rank_cd(tsv, q.match::tsquery) AS rank
FROM items,
(
SELECT tsv AS match
FROM items
WHERE id = 1234
) AS q
WHERE id <> 1234
ORDER BY rank DESC LIMIT 10;
Is there a nice way to achieve this?
I did some poking around and it didn't seem like there was an easy way of doing this. I think to do it effectively you would probably need your own C functions which could provide a distance from one tsvector to another (then you could use KNN searches).
Either way there is not a very easy way to do this and it is likely a significant amount of work, but it seems like it should be a generally applicable problem so the general community might be interested in a solution.
Note this is not as trivial as it sounds. Suppose I write a book about Albert Lord's the Singer of Tales and his emphasis on poetic formulas. Suppose I call it "Albert Lord and the Ring of Words." This would create a tsvector of Albert:1 Lord:2 Ring:5 Words:7, The Lord of the Rings is Lord:2 Ring:5 which would create a very false sense of similarity. If you have any categorization involved, you would want to leverage that as well.
You could perhaps compare tsvector with similarity from the pg_trgm extension.
Something like this:
SELECT title, similarity(STRIP(to_tsvector('english', title))::text, STRIP(to_tsvector('english', 'The Lord of the Rings'))::text) sim
FROM (VALUES
('Albert Lord and the Ring of Words'),
('The Ring of Words'),
('Albert Lord')
) t(title)
ORDER BY sim DESC

Lucene inverted index access count

In Lucene, I want to know about the number of accesses in inverted index.
Maybe, Lucene has the inverted index like this,
cat dog
----- -----
d01 d02
d02 d01
d03 d03
----- -----
If I use query "cat dog", Lucene will access the inverted index consecutively.
I ask top-2 result then, with only 4 accesses Lucene will return d01, d02.
In that case, I want to know the access time (in this example "4").
Currently, I use Lucene like this.
Query q = new QueryParser(Version.LUCENE_35, "title", analyzer).parse(querystr);
int hitsPerPage = 10;
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Thank you.
Asymptotically, if there are p matches and you're finding the top k, the time will be p log k. So in your case, 6 log 2 = 6. (Of course with such small numbers, this formula gives ridiculous results).
See this for more info.
Note that "top two" doesn't mean "first two", but rather "two highest scoring". Depending on the weights in your example, it's possible that Lucene could ignore d03.

Count and sub count in lucene

My fields in lucene are product_name, type and sub_types.
I am querying on type with abc, this results me in products whose type is abc.
This abc type products have sub_types as pqr and xyz.
I can get total count of the xyz type using TopScoreDocCollector.getTotalHits().
But I want to get the count of sub_types. ie. pqr and xyz.
How can I get it?
Any reply would be of great help for me.
Thanks in advance.
One way to do this is to create a filter based on your abc query, and then use that filter to constrain results for the sub-type queries.
IndexSearcher searcher = // searcher to use
int nDocs = 100; // number of docs to retrieve
QueryParser parser = // query parser to use
Query typeQuery = parser.parse("type:abc");
Filter f = CachingWrapperFilter(new QueryWrapperFilter(typeQuery));
Query subtypeQuery = parser.parse("sub_type:xyz");
TopDocs results = searcher.search(subtypeQuery, f, nDocs);
Another thought: if you know up-front which sub-type you're interested in, you can simply add both a type and a sub-type to the query: +type:abc +sub_type:xyz.
Finally, you might consider using Solr to index your data if you have these kinds of queries.

Weighted Keyword Search

Hello: I want to do a "weighted search" on product that are tagged with keywords.
(So: not fulltext search, but n-to-m-relation). So here it is:
Table 'product':
sku - the primary key
name
Table 'keywords':
kid - keyword idea
keyword_de - German language String (e.g. 'Hund','Katze','Maus')
keyword_en - English language String (e.g. 'Dog','Cat','Mouse')
Table 'product_keyword' (the cross-table)
sku \__ combined primary key
kid /
What I want is to get a score for all products that at least "contain" one relevant keyword. If I search for ('Dog','Elephant','Maus') I want that
Dog credits a score of 1.003,
Elephant of 1.002
Maus of 1.001
So least important search term starts at 1.001, everything else 0.001++. That way, a lower score limit of 3.0 would equal "AND" query (all three keywords must be found), a lower score limit of 1.0 would equal an "OR". Anything in between something more or less matching. In particular by sorting according to this score, most relevant search results would be first (regardless of lower limit)...
I guess I will have to do something with
IF( keyword1 == 'dog', 1.001, 0) + IF...
maybe inside a SUM() and probably with a GROUP BY at the end of a JOIN over the cross table, eh? But I am fairly clueless how to tackle this.
What would be feasible, is to get the keyword id's from the keywords beforehand. That's a cheap query. So the keywords table can be left ignored and it's all about the other of the cross and product table...
I have PHP at hand to automatically prepare a fairly lengthy PHP statement, but I would like to avoid further multiple SQL statements. In particular since I will limit the query outcome (most often to "LIMIT 0, 20") for paging mode results, so looping a very large number of in between results through a script would be no good...
DANKESCHÖN, if you can help me on this :-)
I think a lot of this is in the Lucene engine (http://lucene.apache.org/java/docs/index.html), which is available for PHP in the Zend Framework: http://framework.zend.com/manual/en/zend.search.lucene.html.
EDIT:
If you want to do the weighted thing you are talking about, I guess you could use something like this:
select p.sku, sum(case k.keyword_en when 'Dog' then 1001 when 'Cat' then 1002 when 'Mouse' then 1003 else 0 end) as totalscore
from products p
left join product_keyword pk on p.sku = pk.sku
inner join keywords k on k.kid = pk.kid
where k.keyword_en in ('Dog', 'Cat', 'Mouse')
group by p.sku
(Edit 2: forgot the group by clause.)