Getting Term Frequencies For Query

Getting Term Frequencies For Query - lucene

In Lucene, a query can be composed of many sub-queries. (such as TermQuery objects)
I'd like a way to iterate over the documents returned by a search, and for each document, to then iterate over the sub-queries.
For each sub-query, I'd like to get the number of times it matched. (I'm also interested in the fieldNorm, etc.)
I can get access to that data by using indexSearcher.explain, but that feels quite hacky because I would then need to parse the "description" member of each nested Explanation object to try and find the term frequency, etc. (also, calling "explain" is very slow, so I'm hoping for a faster approach)
The context here is that I'd like to experiment with re-ranking Lucene's top N search results, and to do that it's obviously helpful to extract as many "features" as possible about the matches.
Via looking at the source code for classes like TermQuery, the following appears to be a basic approach:
// For each document... (scoreDoc.doc is an integer)
Weight weight = weightCache.get(query);
if (weight == null)
{
weight = query.createWeight(indexSearcher, true);
weightCache.put(query, weight);
}
IndexReaderContext context = indexReader.getContext();
List<LeafReaderContext> leafContexts = context.leaves();
int n = ReaderUtil.subIndex(scoreDoc.doc, leafContexts);
LeafReaderContext leafReaderContext = leafContexts.get(n);
Scorer scorer = weight.scorer(leafReaderContext);
int deBasedDoc = scoreDoc.doc - leafReaderContext.docBase;
int thisDoc = scorer.iterator().advance(deBasedDoc);
float freq = 0;
if (thisDoc == deBasedDoc)
{
freq = scorer.freq();
}
The 'weightCache' is of type Map and is useful so that you don't have to re-create the Weight object for every document you process. (otherwise, the code runs about 10x slower)
Is this approximately what I should be doing? Are there any obvious ways to make this run faster? (it takes approx 2 ms for 280 documents, as compared to about 1 ms to perform the query itself)
Another challenge with this approach is that it requires code to navigate through your Query object to try and find the sub-queries. For example, if it's a BooleanQuery, you call query.clauses() and recurse on them to look for all leaf TermQuery objects, etc. Not sure if there is a more elegant / less brittle way to do that.

Related

Lucene ignores / overwrite fuzzy edit distance in QueryParser

Given the following QueryParser with a FuzzySearch term in the query string:
fun fuzzyquery() {
val query = QueryParser("term", GermanAnalyzer()).parse("field:search~4")
println(query)
}
The resulting Query will actually have this representation:
field:search~2
So, the ~4 gets rewritten to ~2. I traced the code down to the following implementation:
QueryParserBase
protected Query newFuzzyQuery(Term term, float minimumSimilarity, int prefixLength) {
String text = term.text();
int numEdits = FuzzyQuery.floatToEdits(minimumSimilarity, text.codePointCount(0, text.length()));
return new FuzzyQuery(term, numEdits, prefixLength);
}
FuzzyQuery
public static int floatToEdits(float minimumSimilarity, int termLen) {
if (minimumSimilarity >= 1.0F) {
return (int)Math.min(minimumSimilarity, 2.0F);
} else {
return minimumSimilarity == 0.0F ? 0 : Math.min((int)((1.0D - (double)minimumSimilarity) * (double)termLen), 2);
}
}
As is clearly visible, any value higher than 2 will always get reset to 2. Why is this and how can I correctly get the fuzzy edit distance I want into the query parser?

This may cross the border into "not an answer" - but it is too long for a comment (or a few comments):
Why is this?
That was a design decision, it would seem. It's mentioned in the documentation here.
"The value is between 0 and 2"
There is an old article here which gives an explanation:
"Larger differences are far more expensive to compute efficiently and are not processed by Lucene.".
I don't know how official that is, however.
More officially, from the JavaDoc for the FuzzyQuery class, it states:
"At most, this query will match terms up to 2 edits. Higher distances (especially with transpositions enabled), are generally not useful and will match a significant amount of the term dictionary."
How can I correctly get the fuzzy edit distance I want into the query parser?
You cannot, unless you customize the source code.
The best (least worst?) alternative, I think, is probably the one mentioned in the above referenced FuzzyQuery Javadoc:
"If you really want this, consider using an n-gram indexing technique (such as the SpellChecker in the suggest module) instead."
In this case, one price to be paid will be a potentially much larger index - and even then, n-grams are not really equivalent to edit distances. I don't know if this would meet your needs.

How do you read the values of individual features from a FeatureField in Lucene?

I'm using Lucene 7.6.0 and I've indexed a series of documents with a FeatureField named "features", that stores query-independent evidence (e.g., "indegree", "pagerank"). If I'm not mistaken, the theory is that these are stored as a term vector, where "indegree" and "pagerank" are stored as terms and their values are stored as the corresponding term frequencies.
I've tested some queries where I combined BM25 and each individual feature, and some return a different ranking, when compared to BM25 alone, but some others seem to have no effect. This might just be a coincidence, which is fine, but I would like to check whether the values were correctly indexed. How do I do this?
I've tried using Luke to inspect the index, but there is no term vector associated with the "features" field. The active flags for "features" are only "Idf", but I honestly can't find a way to access the frequencies for each document. The best I was able to do, in order to check whether the field had any value, was something like:
IndexReader reader = DirectoryReader.open(
FSDirectory.open(Paths.get("/tmp/lucene-index")));
reader.totalTermFreq(new Term("features", "indegree"));
This printed the number 33344, which does not match the value I indexed (a single document with indegree 10), however I suspect this might be codified somehow.
I know this API is still experimental, but I was wondering if anyone knew if it would be possible to retrieve the feature values, either for each document or globally somehow (maybe an anonymous vector, without a link to the corresponding documents).

I was able to verify the that the ranking by each feature matches the order for the data that I have. I also believe I was able to fairly reverse the provided relevance score to obtain the original feature value (I say "fairly", because I found what seem to be slightly rounding errors; let me know if it's an error instead). The code I used was the following:
IndexReader reader = DirectoryReader.open(
FSDirectory.open(Paths.get("/tmp/lucene-index")));
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(new BM25Similarity(1.2f, 0.75f));
float w = 1.8f;
float k = 1f;
float a = 0.6f;
Query query = FeatureField.newSigmoidQuery("features", "indegree", w, k, a);
TopDocs hits = searcher.search(query, 5);
for (int i = 0; i < hits.scoreDocs.length; i++) {
Document doc = searcher.doc(hits.scoreDocs[i].doc);
float featureValue = (float) Math.pow(
(hits.scoreDocs[i].score / w * Math.pow(k, a))
/ (1 - hits.scoreDocs[i].score / w),
1 / a
);
System.out.println(featureValue + "\t" + doc.get("doc_id"));
}
reader.close();
The equation for featureValue is simply the sigmoid scaling of the static feature S (the "indegree" in this case) solved for S, based on the relevance score. You can find the original equation in the paper cited in Lucene's JavaDoc for FeatureField: https://dl.acm.org/citation.cfm?doid=1076034.1076106
Please let me know if you find any error with this solution, or if there is an easier way to inspect the index.

Kotlin stdlib operatios vs for loops

I wrote the following code:
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
for (i in src)
{
if (i % 2 == 0) dest.add(Math.sqrt(i.toDouble()))
}
IntellJ (in my case AndroidStudio) is asking me if I want to replace the for loop with operations from stdlib. This results in the following code:
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
src.filter { it % 2 == 0 }
.mapTo(dest) { Math.sqrt(it.toDouble()) }
Now I must say, I like the changed code. I find it easier to write than for loops when I come up with similar situations. However upon reading what filter function does, I realized that this is a lot slower code compared to the for loop. filter function creates a new list containing only the elements from src that match the predicate. So there is one more list created and one more loop in the stdlib version of the code. Ofc for small lists it might not be important, but in general this does not sound like a good alternative. Especially if one should chain more methods like this, you can get a lot of additional loops that could be avoided by writing a for loop.
My question is what is considered good practice in Kotlin. Should I stick to for loops or am I missing something and it does not work as I think it works.

If you are concerned about performance, what you need is Sequence. For example, your above code will be
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
src.asSequence()
.filter { it % 2 == 0 }
.mapTo(dest) { Math.sqrt(it.toDouble()) }
In the above code, filter returns another Sequence, which represents an intermediate step. Nothing is really created yet, no object or array creation (except a new Sequence wrapper). Only when mapTo, a terminal operator, is called does the resulting collection is created.
If you have learned java 8 stream, you may found the above explaination somewhat familiar. Actually, Sequence is roughly the kotlin equivalent of java 8 Stream. They share similiar purpose and performance characteristic. The only difference is Sequence isn't designed to work with ForkJoinPool, thus a lot easier to implement.
When there is multiple steps involved or the collection may be large, it's suggested to use Sequence instead of plain .filter {...}.mapTo{...}. I also suggest you to use the Sequence form instead of your imperative form because it's easier to understand. Imperative form may become complex, thus hard to understand, when there are 5 or more steps involved in the data processing. If there is just one step, you don't need a Sequence, because it just creates garbage and gives you nothing useful.

You're missing something. :-)
In this particular case, you can use an IntProgression:
val progression = 0 until 1_000_000 step 2
You can then create your desired list of squares in various ways:
// may make the list larger than necessary
// its internal array is copied each time the list grows beyond its capacity
// code is very straight forward
progression.map { Math.sqrt(it.toDouble()) }
// will make the list the exact size needed
// no copies are made
// code is more complicated
progression.mapTo(ArrayList(progression.last / 2 + 1)) { Math.sqrt(it.toDouble()) }
// will make the list the exact size needed
// a single intermediate list is made
// code is minimal and makes sense
progression.toList().map { Math.sqrt(it.toDouble()) }

My advice would be to choose whichever coding style you prefer. Kotlin is both object-oriented and functional language, meaning both of your propositions are correct.
Usually, functional constructs favor readability over performance; however, in some cases, procedural code will also be more readable. You should try to stick with one style as much as possible, but don't be afraid to switch some code if you feel like it's better suited to your constraints, either readability, performance, or both.

The converted code does not need the manual creation of the destination list, and can be simplified to:
val src = (0 until 1000000).toList()
val dest = src.filter { it % 2 == 0 }
.map { Math.sqrt(it.toDouble()) }
And as mentioned in the excellent answer by #glee8e you can use a sequence to do a lazy evaluation. The simplified code for using a sequence:
val src = (0 until 1000000).toList()
val dest = src.asSequence() // change to lazy
.filter { it % 2 == 0 }
.map { Math.sqrt(it.toDouble()) }
.toList() // create the final list
Note the addition of the toList() at the end is to change from a sequence back to a final list which is the one copy made during the processing. You can omit that step to remain as a sequence.
It is important to highlight the comments by #hotkey saying that you should not always assume that another iteration or a copy of a list causes worse performance than lazy evaluation. #hotkey says:
Sometimes several loops. even if they copy the whole collection, show good performance because of good locality of reference. See: Kotlin's Iterable and Sequence look exactly same. Why are two types required?
And excerpted from that link:
... in most cases it has good locality of reference thus taking advantage of CPU cache, prediction, prefetching etc. so that even multiple copying of a collection still works good enough and performs better in simple cases with small collections.
#glee8e says that there are similarities between Kotlin sequences and Java 8 streams, for detailed comparisons see: What Java 8 Stream.collect equivalents are available in the standard Kotlin library?

Lucene 4.9: Get TF-IDF for a few selected documents from an Index

I've seen this or similar question a lot on stackoverflow as well as other online sources. However, it looks like the corresponding part of Lucene's API changed quite a lot so to sum it up: I did not find any example which would work on the latest Lucene version.
What I have:
Lucene Index + IndexReader + IndexSearcher
a bunch of documents (and their IDs)
What I want:
For all terms that occur only in at least one of the selected documents I want to get TF-IDF for each document.
Or to say it differently: I want to get for any term that occurs in any of the selected documents its TF-IDF value, e.g., as an array (i.e., one TF-IDF value for each of the selected documents).
Any help is highly appreciated! :-)
Here's what I've come up with so far, but there are 2 problems:
It is using a temporarily created RAMDirectory which contains only the selected Documents. Is there any way to work on the original Index or does that not make sense?
It does not get document based TF IDF but somehow only index based, ie., all documents. Which means for each term I only get one TF-IDF value but not one for each document and term.
public void getTfidf(IndexReader reader, Writer out, String field) throws IOException {
Bits liveDocs = MultiFields.getLiveDocs(reader);
TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
BytesRef term = null;
TFIDFSimilarity tfidfSim = new DefaultSimilarity();
int docCount = reader.numDocs();
while ((term = termEnum.next()) != null) {
String termText = term.utf8ToString();
Term termInstance = new Term(field, term);
// term and doc frequency in all documents
long indexTf = reader.totalTermFreq(termInstance);
long indexDf = reader.docFreq(termInstance);
double tfidf = tfidfSim.tf(indexTf) * tfidfSim.idf(docCount, indexDf);
// store it, but that's not the problem

totalTermFreq does what it sounds like, provide the frequency across the entire index. The TF in the calculation should be the term frequency within the document, not across the entire index.. That's why everything you get here is constant, all of your variables are constant across the entire index, non are dependant on the document. In order to get term frequency for a document, you should use DocsEnum.freq(). Perhaps something like:
while ((term = termEnum.next()) != null) {
Term termInstance = new Term(field, term);
long indexDf = reader.docFreq(termInstance);
DocsEnum docs = termEnum.docs(reader.getLiveDocs())
while(docs.next() != DocsEnum.NO_MORE_DOCS) {
double tfidf = tfidfSim.tf(docs.freq()) * tfidfSim.idf(docCount, indexDf);
// store it

how to optimize search difference between array / list of object

Premesis:
I am using ActionScript with two arraycollections containing objects with values to be matched...
I need a solution for this (if in the framework there is a library that does it better) otherwise any suggestions are appreciated...
Let's assume I have two lists of elements A and B (no duplicate values) and I need to compare them and remove all the elements present in both, so at the end I should have
in A all the elements that are in A but not in B
in B all the elements that are in B but not in A
now I do something like that:
for (var i:int = 0 ; i < a.length ;)
{
var isFound:Boolean = false;
for (var j:int = 0 ; j < b.length ;)
{
if (a.getItemAt(i).nome == b.getItemAt(j).nome)
{
isFound = true;
a.removeItemAt(i);
b.removeItemAt(j);
break;
}
j++;
}
if (!isFound)
i++;
}
I cycle both the arrays and if I found a match I remove the items from both of the arrays (and don't increase the loop value so the for cycle progress in a correct way)
I was wondering if (and I'm sure there is) there is a better (and less CPU consuming) way to do it...

If you must use a list, and you don't need the abilities of arraycollection, I suggest simply converting it to using AS3 Vectors. The performance increase according to this (http://www.mikechambers.com/blog/2008/09/24/actioscript-3-vector-array-performance-comparison/) are 60% compared to Arrays. I believe Arrays are already 3x faster than ArrayCollections from some article I once read. Unfortunately, this solution is still O(n^2) in time.
As an aside, the reason why Vectors are faster than ArrayCollections is because you provide type-hinting to the VM. The VM knows exactly how large each object is in the collection and performs optimizations based on that.
Another optimization on the vectors is to sort the data first by nome before doing the comparisons. You add another check to break out of the loop if the nome of list b simply wouldn't be found further down in list A due to the ordering.
If you want to do MUCH faster than that, use an associative array (object in as3). Of course, this may require more refactoring effort. I am assuming object.nome is a unique string/id for the objects. Simply assign that the value of nome as the key in objectA and objectB. By doing it this way, you might not need to loop through each element in each list to do the comparison.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas