Is it possible to obtain, alter and replace the tfidf document representations in Lucene? - lucene

Hej guys,
I'm working on some ranking related research. I would like to index a collection of documents with Lucene, take the tfidf representations (of each document) it generates, alter them, put them back into place and observe how the ranking over a fixed set of queries changes accordingly.
Is there any non-hacky way to do this?

Your question is too vague to have a clear answer, esp. on what you plan to do with :
take the tfidf representations (of each document) it generates, alter them
Lucene stores raw values for scoring :
CollectionStatistics
TermStatistics
Per term/doc pair stats : PostingsEnum
Per field/doc pair : norms
All this data is managed by lucene and will be used to compute a score for a given query term. A custom Similarity class can be used to change the formula that generates this score.
But you have to consider that a search query is made of multiple terms, and the way the scores of individual terms are combined can be changed as well. You could use existing Query classes (e.g. BooleanQuery, DisjunctionMax) but you could also write your own.
So it really depends on what you want to do with of all this but note that if you want to change the raw values stored by lucene this is going to be rather hard. You'll have to write a custom lucene codec and probably most the query stack to take benefit of your new data.
One nice thing you should consider is the possibility to store an arbitrary byte[] payloads. This way you could store a value that would have been computed outside of lucene and use it in a custom similarity or query.
Please see the following tutorials: Getting Started with Payloads and Custom Scoring with Lucene Payloads it may you give some ideas.

Related

Dictionary API used for stressed syllables

This might end up being a very general question, but hopefully it will be useful to others as well.
I want to be able to request a word that is x number of syllables with a stress on x.[y] syllable. I've found plenty of APIs that return both of these such as Wordnik, but I'm not sure how to approach the search aspect. The URL to get the syllables is
GET /word.json/{word}/hyphenation
but I won't know the word ahead of time to make this request. They also have this:
GET /words.json/randomWords
which returns a list random words.
Is there a way to achieve what I want with this API without asking for random words over and over and checking if they meet my needs? That just seems like it would be really slow and push me over my usage limits.
Do I need to build my own data structure with the words and syllables to query locally?
I doubt you'll find this kind of specialized query on any of the big dictionary APIs. You'll need to download an English dictionary and create your own data structure to do this kind of thing.
The Moby Project has a hyphenated dictionary with about 185,000 words in it. There are many other dictionary projects available. A good place to start looking is http://www.dicts.info/dictionaries.php.
Once you've downloaded the dictionary, you'll need to preprocess it to build your data structure. You should be able to construct a dictionary or hash map that is indexed by (syllables, emphasis), and whose data member is a list of words. So you'd have an entry like (4, 2) (4-syllable word with emphasis on the 2nd syllable), and a list of all such words.
To query it, then, you'd just pack the query into a structure and look up that key in the hash map. Then pick a random word from the resulting list.

Querying Apache Solr based on score values

I am working on an image retrieval task. I have a dataset of wikipedia images with their textual description in xml files (1 xml file per image). I have indexed those xmls in Solr. Now while retrieving those, I want to maintain some threshold for Score values, so that docs with less score will not come in the result (because they are not of much importance). For example I want to retrieve all documents having similarity score greater than or equal to 2.0. I have already tried range queries like score:[2.0 TO *] but can't get it working. Does anyone have any idea how can I do that?
What's the motivation for wanting to do this? The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. It is only meaningful for comparing the results of a
specific query with a specific instance of the index. In other words,
it isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.
http://lucene.472066.n3.nabble.com/score-filter-td493438.html
Also, take a look here - http://wiki.apache.org/lucene-java/ScoresAsPercentages
So, in general it's bad to cut off by some value, because you'll never know which threshold value is best. In good query it could be score=2, in bad query score=0.5, etc.
These two links should explain you why you DONT want to do it.
P.S. If you still want to do it take a look here - https://stackoverflow.com/a/15765203/2663985
P.P.S. I recommend you to fix your search queries, so they will search better with high precision (http://en.wikipedia.org/wiki/Precision_and_recall)

Lucene: how do I assign weights to the different search terms at query time?

I have a Lucene indexed corpus of more than 1 million documents.
I am searching for named entities such as "Susan Witting" by using the the Lucene java API for queries.
I would like to expand my queries by also searching for "Sue Witting" for example but would like that term to have a lower weight than the main query term.
How can I go about doing that?
I found infos about the boosting option in the Lucene Manual. But it seems to be set at indexing and it needs fields.
You can boost each query clause independently. See the Query Javadoc.
If you want to give different weight to the words of a term. Then
Query#setBoost(float)
is not useful. A better way is:
Term term = new Term("some_key", "stand^3 firm^2 always");
This allows to give different weight to words in the same term query. Here, the word stand boosted by three but always is has the default boost value.

Boosting Lucene Terms When Building the Index

Is it possible to determine that specific terms are more important then other when creating the index (not when querying it) ?
Consider for example a synonym filter:
doc 1: "this is a nice car"
doc 2: "this is a nice vehicle"
I want to add the term vehicle to the first doc and the term car to the second doc,
but I want that if later the index is queried with the word car then the first document will be scored higher then the second one and if queried for vehicle it will be the other way around.
Will calling setBoost on the fields before adding them to their respective documents do the trick?
Or maybe I should add the synonyms to a different field name?
Or am I looking at this from a wrong point of view ?
Thanks
Setting boost on a filed affects all terms in that field so this wouldn't work in your case.
But it should be posible using Lucene payloads (a byte array that can be set for every term). You would use them to set term specific boosts (vehicle to 0.5 for doc 1, for example). Then you'll implement your own Similarity and override scorePayload() method to decode that boost and then use PayloadTermQuery which allows you to contribute to the score based on the boots you have in the payload for that term.

Compound Queries with Redis

For learning purposes I'm trying to write a simple structured document store in Redis. In my example application I'm indexing millions of documents that look a little like the following.
<book id="1234">
<title>Quick Brown Fox</title>
<year>1999</year>
<isbn>309815</isbn>
<author>Fred</author>
</book>
I'm writing a little query language that allows me to say YEAR = 1999 AND TITLE="Quick Brown Fox" (again, just for my learning, I don't care that I'm reinventing the wheel!) and this should return the ID's of the matching documents (1234 in this case). The AND and OR expressions can be arbitrarily nested.
For each document I'm generating keys as follows
BOOK_TITLE.QUICK_BROWN_FOX = 1234
BOOK_YEAR.1999 = 1234
I'm using SADD to plop these documents in a series of sets in the form KEYNAME.VALUE = { REFS }.
When I do the querying, I parse the expression into an AST. A simple expression such as YEAR=1999 maps directly to a SMEMBERS command which gets me the set of matching documents back. However, I'm not sure how to most efficiently perform the AND and OR parts.
Given a query such as:
(TITLE=Dental Surgery OR TITLE=DIY Appendectomy)
AND
(YEAR = 1999 AND AUTHOR = FOO)
I currently make the following requests to Redis to answer these queries.
-- Stage one generates the intermediate results and returns RANDOM_GENERATED_KEY3
SUNIONSTORE RANDOMLY_GENERATED_KEY1 BOOK_TITLE.DENTAL_SURGERY BOOK_TITLE.DIY_APPENDECTOMY
SINTERSTORE RANDOMLY_GENERATED_KEY2 BOOK_YEAR.1999 BOOK_YEAR.1998
SINTERSTORE RANDOMLY_GENERATED_KEY3 RANDOMLY_GENERATED_KEY1 RANDOMLY_GENERATED_KEY2
-- Retrieving the top level results just requires the last key generated
SMEMBERS RANDOMLY_GENERATED_KEY3
When I encounter an AND I use SINTERSTORE based on the two child keys (and similarly for OR I use SUNIONSTORE). I randomly generate a key to store the results in (and set a short TTL so I don't fill Redis up with cruft). By the end of this series of commands the return value is a key that I can use to retrieve the results with SMEMBERS. The reason I've used the store functions is that I don't want to transport all the matching document references back to the server, so I use temporary keys to store the result on the Redis instance and then only bring back the matching results at the end.
My question is simply, is this the best way to make use of Redis as a document store?
I'm using a similar approach with sorted sets to implement full text indexing. The overall approach is good, though there are a couple of fairly simple improvements you could make.
Rather than using randomly generated keys, you can use the query (or a short form thereof) as the key. That lets you reuse the sets that have already been calculated, which could significantly improve performance if you have queries across two large sets that are commonly combined in similar ways.
Handling title as a complete string will result in a very large number of single member sets. It may be better to index individual words in the title and filter the final results for an exact match if you really need it.