How can I get top terms for a subset of documents in a Lucene index? - lucene

I know its possible to get the top terms within a Lucene Index, but is there a way to get the top terms based on a subset of a Lucene index?
I.e. What are the top terms in the Index for documents within a certain date range?

Ideally there'd be a utility somewhere to do this, but I'm not aware of one. However, it's not too hard to do this "by hand" in a reasonably efficient way. I'll assume that you already have a Query and/or Filter object that you can use to define the subset of interest.
First, build a list in memory of all of the document IDs in your index subset. You can use IndexSearcher.search(Query, Filter, HitCollector) to do this very quickly; the HitCollector documentation includes an example that seems like it ought to work, or you can use some other container to store your doc IDs.
Next, initialize an empty HashMap (or whatever) to map terms to total frequency, and populate the map by invoking one of the IndexReader.getTermFreqVector methods for every document and field of interest. The three-argument form seems simpler, but either should be just fine. For the three-argument form, you'd make a TermVectorMapper whose map method checks if term is in the map, associates it with frequency if not, or adds frequency to the existing value if so. Be sure to use the same TermVectorMapper object across all of the calls to getTermFreqVector in this pass, rather than instantiating a new one for each document in the loop. You can also speed things up quite a bit by overriding isIgnoringPositions() and isIgnoringOffsets(); your object should return true for both of those. It looks like your TermVectorMapper might also be forced to define a setExpectations method, but that one doesn't need to do anything.
Once you've built your map, just sort the map items by descending frequency and read off however many top terms you like. If you know in advance how many terms you want, you might prefer to do some kind of fancy heap-based algorithm to find the top k items in linear time instead of using an O(n log n) sort. I imagine the plain old sort will be plenty fast in practice. But it's up to you.
If you prefer, you can combine the first two stages by having your HitCollector invoke getTermFreqVector directly. This should certainly produce equally correct results, and intuitively seems like it would be simpler and better, but the docs seem to warn that doing so is likely to be quite a bit slower than the two-pass approach (on same page as the HitCollector example above). Or I could be misinterpreting their warning. If you're feeling ambitious you could try it both ways, compare, and let us know.

Counting up the TermVectors will work, but will be slow if there are a lot of documents to iterate. Also note if you mean docFreq by top terms, then don't use the count in the TermFreqVector just count the terms as binary.
Alternatively, you could iterate the terms like facet counts. Use a cached filter for every term; their BitSets can be used for a fast intersection count.

Related

Dict vs Record in elm

While implementing a simple app I ran into the problem of trying to update a nested record. I found a solution online but it really seems like a whole lot of bloated code.
As I was looking for alternatives I found Dictionaries. This seem like a solution to that problem -- If I use a dictionary inside of a record I can avoid all that bloated code and get nested updates.
Seeing dictionaries and records next to each other made me wonder, why would I use a record instead of a dictionary, or vice versa? The two seem really similar to me, so I am not sure I see the advantage of one or the other. Of course I can see that there is a difference in syntax, but is that all ?
I learned somewhere that the access time complexity of Dict is O(log(n)) -- does it do a binary search on the keys ? -- but I can't find the access time complexity for record, but I am wondering if that is O(1) and that is one of the advantages.
Either way, they both seem to map to 1 single data structure in other languages (e.g Python's dictionaries, JS objects, Java hash-tables), why do we need two in elm ?
Dicts and records might seem very similar when coming from JavaScript, but in a statically typed language they are actually very different. I think just about the only property they have in common is that they are both key-value containers.
The biggest differences, I think, are that Dicts are homogeneous, meaning values must be of the same type, and "dynamically" keyed and sized, meaning keys are not statically checked (ie. at compile-time) and that key-value pairs can be added at runtime. Records on the other hand includes the key names and value types in the record type, which means they can hold values of different types, but also can't have keys added or removed at runtime without changing the type itself.
The benefits of easily being able to insert and update a Dict is something you pay for when you try to get it back out. Dict.get returns a Maybe which you'll then have to handle, because the type doesn't give any guarantee that it contains anything at all. You also won't get a compiler error if you mistype the name of a key.
Overall, a Dict forsakes most of the benefits of static typing. I think a good rule of thumb is that if you know the key names, you should most likely go with records. If you don't, go with Dict.
You also seem about right regarding performance, but I think that's a secondary concern. Record access should be equivalent to accessing the elements of an array by index, since so much information is known at compile time that it can essentially be compiled down to a fixed-size array.

getting the document length during query evaluation apache lucene 5.3

I am trying to change the scoring in apache lucene 5.3, and for my formula I need the document length (the number of tokens in the document). I understood from answers to similar question, you don't have an easy way to do it. because lucene doesn't keep it at the index. so I thought maybe while indexing I will create an Map from docID to the document length, and then use it in query evaluation. But, I have no idea where I should put this map and where I will update it.
You are exactly right, storing this when the document is indexed is the best approach. The place to store it is in the norm (not to be confused with the queryNorm, that's something different). Norms provide a single value stored with the field, which is made available at query time for scoring.
In your Similarity implementation, this should go into the ComputeNorm method, which exposes the information you need through the FieldInvertState, particularly FieldInvertState.getLength(). Norms are made available at search time through LeafReader.GetNormValues.
If you are extending TFIDFSimilarity, instead, you just need to implement the encodeNormValue, decodeNormValue and lengthNorm methods.

Finding document/field length in Lucene 4

I'm looking to have the ability to access the length (in terms) of a specific field of a document post-indexing. Preferably, if there is a way without re-indexing I would like to do that. But if re-indexing in a certain way will give easy access to this value, that would also serve.
http://blog.mikemccandless.com/2012/03/new-index-statistics-in-lucene-40.html
That link (scoll down a bit and find the mention of length) talks of accessing the value at indexing time. I wish to be able to do so post-indexing. The link also talks about saving away the value to a doc value, but it gives no examples of how to do so.
If anyone could provide examples of saving the document length, or accessing it post-indexing, it would be incredibly helpful. Thanks.
The mention of that statistic in the article is in reference to a FieldInvertState. Once you have that, it should be fairly straightforward how to get the statistics you are looking for (Just call getLength, getUniquetermCount or whatever you need).
The FieldInvertState is passed into the Similarity, particularly to the call Similarity.computeNorm. The norm value is calculated and stored at index time, rather than evaluated at query time, so making effective use of it would require you to reindex.
The typical way to make use of this would be to create a custom Similarity, possibly extending DefaultSimilarity. Simply overriding the lengthNorm method of DefaultSimilarity would be the simplest approach. It's standard implementation is:
return (float)(1.0 / Math.sqrt(numTerms));
Which you could override with whatever you like.
That would work to tweak scoring based on a custom length-based calculation. If that's not what you are looking for, but rather need to be able to just fetch that information, I would think just storing and the field, and getting the length from the field value returned when you fetch a Document would be the simplest implementation.

What is the easiest way to implement terms association mining in Solr?

Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.
So, my needs are:
To retrieve the most associated terms within particular field.
To retrieve the term, that is closest to the specified one within particular field.
I will rate answers in the following way:
Ideally I would like to find Solr's component that directly covers specified needs, that is, something to get associated terms directly.
If this is not possible, I'm seeking for the way to get co-occurrence matrix information for specified field.
If this is not an option too, I would like to know the most straightforward way to 1) get all terms and 2) get ids (numbers) of documents these terms occur in.
You can export a Lucene (or Solr) index to Mahout, and then use Latent Dirichlet Allocation. If LDA is not close enough to LSA for your needs, you can just take the correlation matrix from Mahout, and then use Mahout to take the singular value decomposition.
I don't know of any LSA components for Solr.
Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.
I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:
All terms or at least the most frequent ones, because rare terms won't affect result of association mining by their nature.
Documents where these terms occur, again, at least top documents.
Both these tasks may be easily done with standard Solr components.
To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).
Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.
Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.
You can get the count of occurrences of the current term in found document in the following query:
http://ip:port/solr/someinstance/select?defType=func&fl=termfreq(field,xxx),*&fq={!frange l=1}termfreq(field,xxx)&indent=on&q=termfreq(field,xxx)&sort=termfreq(field,xxx) desc&wt=json

Lucene boost: I need to make it work better

I'm using Lucene to index components with names and types. Some components are more important, thus, get a bigger boost. However, I cannot get my boost to work properly. I sill get some components appear later (get worse score), even though they have a higher boost.
Note that the indexing is done on one field only and I've set the boost to that field alone. I'm using Lucene in Java.
I don't think it has anything to do with the field length. I've seen components with the same name (but different type) get the wrong score.
Use Searcher.explain to find out how the scores for each document are derived. One of the key criteria in score is length of the field. A match in shorter field gets higher score.
I suggest you use luke to see exactly what is stored in your index. Are you using document boosting? See the scoring documentation to check possible explanations.
Boost is just one factor in the Lucene score for a hit. But it should work. Can you give a more complete example of the behavior you are seeing, and what you would expect?
As I recall, boosting is intended to make one field more important than another. If you have only one field, boosting won't change the order of the results at all.
added: no, looks like you can indeed boost specific documents. oops!
Make sure field.omitNorms is set to false on the field you want to boost.