How do you read the values of individual features from a FeatureField in Lucene? - lucene

I'm using Lucene 7.6.0 and I've indexed a series of documents with a FeatureField named "features", that stores query-independent evidence (e.g., "indegree", "pagerank"). If I'm not mistaken, the theory is that these are stored as a term vector, where "indegree" and "pagerank" are stored as terms and their values are stored as the corresponding term frequencies.
I've tested some queries where I combined BM25 and each individual feature, and some return a different ranking, when compared to BM25 alone, but some others seem to have no effect. This might just be a coincidence, which is fine, but I would like to check whether the values were correctly indexed. How do I do this?
I've tried using Luke to inspect the index, but there is no term vector associated with the "features" field. The active flags for "features" are only "Idf", but I honestly can't find a way to access the frequencies for each document. The best I was able to do, in order to check whether the field had any value, was something like:
IndexReader reader = DirectoryReader.open(
FSDirectory.open(Paths.get("/tmp/lucene-index")));
reader.totalTermFreq(new Term("features", "indegree"));
This printed the number 33344, which does not match the value I indexed (a single document with indegree 10), however I suspect this might be codified somehow.
I know this API is still experimental, but I was wondering if anyone knew if it would be possible to retrieve the feature values, either for each document or globally somehow (maybe an anonymous vector, without a link to the corresponding documents).

I was able to verify the that the ranking by each feature matches the order for the data that I have. I also believe I was able to fairly reverse the provided relevance score to obtain the original feature value (I say "fairly", because I found what seem to be slightly rounding errors; let me know if it's an error instead). The code I used was the following:
IndexReader reader = DirectoryReader.open(
FSDirectory.open(Paths.get("/tmp/lucene-index")));
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(new BM25Similarity(1.2f, 0.75f));
float w = 1.8f;
float k = 1f;
float a = 0.6f;
Query query = FeatureField.newSigmoidQuery("features", "indegree", w, k, a);
TopDocs hits = searcher.search(query, 5);
for (int i = 0; i < hits.scoreDocs.length; i++) {
Document doc = searcher.doc(hits.scoreDocs[i].doc);
float featureValue = (float) Math.pow(
(hits.scoreDocs[i].score / w * Math.pow(k, a))
/ (1 - hits.scoreDocs[i].score / w),
1 / a
);
System.out.println(featureValue + "\t" + doc.get("doc_id"));
}
reader.close();
The equation for featureValue is simply the sigmoid scaling of the static feature S (the "indegree" in this case) solved for S, based on the relevance score. You can find the original equation in the paper cited in Lucene's JavaDoc for FeatureField: https://dl.acm.org/citation.cfm?doid=1076034.1076106
Please let me know if you find any error with this solution, or if there is an easier way to inspect the index.

Related

Lucene ignores / overwrite fuzzy edit distance in QueryParser

Given the following QueryParser with a FuzzySearch term in the query string:
fun fuzzyquery() {
val query = QueryParser("term", GermanAnalyzer()).parse("field:search~4")
println(query)
}
The resulting Query will actually have this representation:
field:search~2
So, the ~4 gets rewritten to ~2. I traced the code down to the following implementation:
QueryParserBase
protected Query newFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) {
String text = term.text();
int numEdits = FuzzyQuery.floatToEdits(minimumSimilarity, text.codePointCount(0, text.length()));
return new FuzzyQuery(term, numEdits, prefixLength);
}
FuzzyQuery
public static int floatToEdits(float minimumSimilarity, int termLen) {
if (minimumSimilarity >= 1.0F) {
return (int)Math.min(minimumSimilarity, 2.0F);
} else {
return minimumSimilarity == 0.0F ? 0 : Math.min((int)((1.0D - (double)minimumSimilarity) * (double)termLen), 2);
}
}
As is clearly visible, any value higher than 2 will always get reset to 2. Why is this and how can I correctly get the fuzzy edit distance I want into the query parser?
This may cross the border into "not an answer" - but it is too long for a comment (or a few comments):
Why is this?
That was a design decision, it would seem. It's mentioned in the documentation here.
"The value is between 0 and 2"
There is an old article here which gives an explanation:
"Larger differences are far more expensive to compute efficiently and are not processed by Lucene.".
I don't know how official that is, however.
More officially, from the JavaDoc for the FuzzyQuery class, it states:
"At most, this query will match terms up to 2 edits. Higher distances (especially with transpositions enabled), are generally not useful and will match a significant amount of the term dictionary."
How can I correctly get the fuzzy edit distance I want into the query parser?
You cannot, unless you customize the source code.
The best (least worst?) alternative, I think, is probably the one mentioned in the above referenced FuzzyQuery Javadoc:
"If you really want this, consider using an n-gram indexing technique (such as the SpellChecker in the suggest module) instead."
In this case, one price to be paid will be a potentially much larger index - and even then, n-grams are not really equivalent to edit distances. I don't know if this would meet your needs.

Getting Term Frequencies For Query

In Lucene, a query can be composed of many sub-queries. (such as TermQuery objects)
I'd like a way to iterate over the documents returned by a search, and for each document, to then iterate over the sub-queries.
For each sub-query, I'd like to get the number of times it matched. (I'm also interested in the fieldNorm, etc.)
I can get access to that data by using indexSearcher.explain, but that feels quite hacky because I would then need to parse the "description" member of each nested Explanation object to try and find the term frequency, etc. (also, calling "explain" is very slow, so I'm hoping for a faster approach)
The context here is that I'd like to experiment with re-ranking Lucene's top N search results, and to do that it's obviously helpful to extract as many "features" as possible about the matches.
Via looking at the source code for classes like TermQuery, the following appears to be a basic approach:
// For each document... (scoreDoc.doc is an integer)
Weight weight = weightCache.get(query);
if (weight == null)
{
weight = query.createWeight(indexSearcher, true);
weightCache.put(query, weight);
}
IndexReaderContext context = indexReader.getContext();
List<LeafReaderContext> leafContexts = context.leaves();
int n = ReaderUtil.subIndex(scoreDoc.doc, leafContexts);
LeafReaderContext leafReaderContext = leafContexts.get(n);
Scorer scorer = weight.scorer(leafReaderContext);
int deBasedDoc = scoreDoc.doc - leafReaderContext.docBase;
int thisDoc = scorer.iterator().advance(deBasedDoc);
float freq = 0;
if (thisDoc == deBasedDoc)
{
freq = scorer.freq();
}
The 'weightCache' is of type Map and is useful so that you don't have to re-create the Weight object for every document you process. (otherwise, the code runs about 10x slower)
Is this approximately what I should be doing? Are there any obvious ways to make this run faster? (it takes approx 2 ms for 280 documents, as compared to about 1 ms to perform the query itself)
Another challenge with this approach is that it requires code to navigate through your Query object to try and find the sub-queries. For example, if it's a BooleanQuery, you call query.clauses() and recurse on them to look for all leaf TermQuery objects, etc. Not sure if there is a more elegant / less brittle way to do that.

Lucene 4.9: Get TF-IDF for a few selected documents from an Index

I've seen this or similar question a lot on stackoverflow as well as other online sources. However, it looks like the corresponding part of Lucene's API changed quite a lot so to sum it up: I did not find any example which would work on the latest Lucene version.
What I have:
Lucene Index + IndexReader + IndexSearcher
a bunch of documents (and their IDs)
What I want:
For all terms that occur only in at least one of the selected documents I want to get TF-IDF for each document.
Or to say it differently: I want to get for any term that occurs in any of the selected documents its TF-IDF value, e.g., as an array (i.e., one TF-IDF value for each of the selected documents).
Any help is highly appreciated! :-)
Here's what I've come up with so far, but there are 2 problems:
It is using a temporarily created RAMDirectory which contains only the selected Documents. Is there any way to work on the original Index or does that not make sense?
It does not get document based TF IDF but somehow only index based, ie., all documents. Which means for each term I only get one TF-IDF value but not one for each document and term.
public void getTfidf(IndexReader reader, Writer out, String field) throws IOException {
Bits liveDocs = MultiFields.getLiveDocs(reader);
TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
BytesRef term = null;
TFIDFSimilarity tfidfSim = new DefaultSimilarity();
int docCount = reader.numDocs();
while ((term = termEnum.next()) != null) {
String termText = term.utf8ToString();
Term termInstance = new Term(field, term);
// term and doc frequency in all documents
long indexTf = reader.totalTermFreq(termInstance);
long indexDf = reader.docFreq(termInstance);
double tfidf = tfidfSim.tf(indexTf) * tfidfSim.idf(docCount, indexDf);
// store it, but that's not the problem
totalTermFreq does what it sounds like, provide the frequency across the entire index. The TF in the calculation should be the term frequency within the document, not across the entire index.. That's why everything you get here is constant, all of your variables are constant across the entire index, non are dependant on the document. In order to get term frequency for a document, you should use DocsEnum.freq(). Perhaps something like:
while ((term = termEnum.next()) != null) {
Term termInstance = new Term(field, term);
long indexDf = reader.docFreq(termInstance);
DocsEnum docs = termEnum.docs(reader.getLiveDocs())
while(docs.next() != DocsEnum.NO_MORE_DOCS) {
double tfidf = tfidfSim.tf(docs.freq()) * tfidfSim.idf(docCount, indexDf);
// store it

Lucene SpellChecker Prefer Permutations or special scoring

I'm using Lucene.NET 3.0.3
How can I modify the scoring of the SpellChecker (or queries in general) using a given function?
Specifically, I want the SpellChecker to score any results that are permutations of the searched word higher than the the rest of the suggestions, but I don't know where this should be done.
I would also accept an answer explaining how to do this with a normal query. I have the function, but I don't know if it would be better to make it a query or a filter or something else.
I think the best way to go about this would be to use a customized Comparator in the SpellChecker object.
Check out the source code of the default comparator here:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-spellchecker/3.6.0/org/apache/lucene/search/spell/SuggestWordScoreComparator.java?av=f
Pretty simple stuff, should be easy to extend if you already have the algorithm you want to use to compare the two Strings.
Then you can use set it up to use your comparator with SpellChecker.SetComparator
I think I mentioned the possiblity of using a Filter for this in a previous question to you, but I don't think that's really the right way to go, looking at it a bit more.
EDIT---
Indeed, No Comparator is available in 3.0.3, So I believe you'll need to access the scoring through the a StringDistance object. The Comparator would be nicer, since the scoring has already been applied and is passed into it to do what you please with it. Extending a StringDistance may be a bit less concrete since you will have to apply your rules as a part of the score.
You'll probably want to extend LevensteinDistance (source code), which is the default StringDistance implementation, but of course, feel free to try JaroWinklerDistance as well. Not really that familiar with the algorithm.
Primarily, you'll want to override getDistance and apply your scoring rules there, after getting a baseline distance from the standard (parent) implementation's getDistance call.
I would probably implement something like (assuming you ahve a helper method boolean isPermutation(String, String):
class CustomDistance() extends LevensteinDistance{
float getDistance(String target, String other) {
float distance = super.getDistance();
if (isPermutation(target, other)) {
distance = distance + (1 - distance) / 2;
}
return distance;
}
}
To calculate a score half again closer to 1 for a result that is a permuation (that is, if the default algorithm gave distance = .6, this would return distance = .8, etc.). Distances returned must be between 0 and 1. My example is just one idea of a possible scoring for it, but you will likely need to tune your algorithm somewhat. I'd be cautious about simply returning 1.0 for all permutations, since that would be certain to prefer 'isews' over 'weis' when looking with 'weiss', and it would also lose the ability to sort the closeness of different permutations ('isews' and 'wiess' would be equal matches to 'weiss' in that case).
Once you have your Custom StringDistance it can be passed to SpellChecker either through the Constructor, or with SpellChecker.setStringDistance
From femtoRgon's advice, here's what I ended up doing:
public class PermutationDistance: SpellChecker.Net.Search.Spell.StringDistance
{
public PermutationDistance()
{
}
public float GetDistance(string target, string other)
{
LevenshteinDistance l = new LevenshteinDistance();
float distance = l.GetDistance(target, other);
distance = distance + ((1 - distance) * PermutationScore(target, other));
return distance;
}
public bool IsPermutation(string a, string b)
{
char[] ac = a.ToLower().ToCharArray();
char[] bc = b.ToLower().ToCharArray();
Array.Sort(ac);
Array.Sort(bc);
a = new string(ac);
b = new string(bc);
return a == b;
}
public float PermutationScore(string a, string b)
{
char[] ac = a.ToLower().ToCharArray();
char[] bc = b.ToLower().ToCharArray();
Array.Sort(ac);
Array.Sort(bc);
a = new string(ac);
b = new string(bc);
LevenshteinDistance l = new LevenshteinDistance();
return l.GetDistance(a, b);
}
}
Then:
_spellChecker.setStringDistance(new PermutationDistance());
List<string> suggestions = _spellChecker.SuggestSimilar(word, 10).ToList();

Image comparison against a database of images or keys

I've just spent most of today trying to find some sort of function to generate keys for known images, for later comparison to determine what the image is. I have attempted to use SIFT and SURF descriptors, both of which are too slow (and patented for commercial use). My latest attempt was creating a dct hash using:
int mm_dct_imagehash(const char* file, float sigma, uint64_t *hash){
if (!file) return -1;
if (!hash) return -2;
*hash = 0;
IplImage *img = cvLoadImage(file, CV_LOAD_IMAGE_GRAYSCALE);
if (!img) return -3;
cvSmooth(img, img, CV_GAUSSIAN, 7, 7, sigma, sigma);
IplImage *img_resized = cvCreateImage(cvSize(32,32), img->depth, img->nChannels);
if (!img_resized) return -4;
cvResize(img, img_resized, CV_INTER_CUBIC);
IplImage *img_prime = cvCreateImage(cvSize(32,32), IPL_DEPTH_32F, img->nChannels);
if (!img_prime) return -5;
cvConvertScale(img_resized, img_prime,1, 0);
IplImage *dct_img = cvCreateImage(cvSize(32,32), IPL_DEPTH_32F, img->nChannels);
if (!dct_img) return -6;
cvDCT(img_prime, dct_img, CV_DXT_FORWARD);
cvSetImageROI(dct_img, cvRect(1,1,8,8));
double minval, maxval;
cvMinMaxLoc(dct_img, &minval, &maxval, NULL, NULL, NULL);
double medval = (maxval + minval)/2;
int i,j;
for (i=1;i<=8;i++){
const float *row = (const float*)(dct_img->imageData + i*dct_img->widthStep);
for (j=1;j<=8;j++){
if (row[j] > medval){
(*hash) |= 1;
}
(*hash) <<= 1;
}
}
cvReleaseImage(&img);
cvReleaseImage(&img_resized);
cvReleaseImage(&img_prime);
cvReleaseImage(&dct_img);
return 0;
}
This did generate something of the type I was looking for, but when I tried comparing it to a database of known hashes, I had as many false positives as I had positives. And so, I'm back at it and thought I might ask the experts.
Would any of you know/have a function that could give me some sort of identifier/checksum for provided images, which would remain similar across similar images so it could be used to quickly identify images via comparison to a database? In short, which category of checksums the image best matches to?
I'm not looking for theories, concepts, papers or ideas, but actually working solutions. I'm not spending another day digging at a dead end, and appreciate anyone who takes the time to put together some code.
With a bit more research, I know that the autoit devs designed pixelchecksum to use the "Adler-32" algorithm. I guess the next step is to find a c implementation and to get it to process pixel data. Any suggestions are welcome!
A google search for "microsoft image hashing" has near the top the two best papers on the subject I am aware of. Both offer practical solutions.
The short answer is that there's no out of the box working solution for your problem. Additionally, the Adler-32 algorithm will not solve your problem.
Unfortunately, comparing image by visual similarity using image signatures (or a related concept) is a very active and open research topic. For example, you said that you had many false positives in your tests. However, what is a correct or incorrect result is subjective and will depend on your application.
In my opinion, the only way to solve your problem is find a adequate image descriptor for your problem and use then to compare the images. Note that comparing descriptors extracted from image is not a trivial task.