How to get document ID in CustomScoreProvider? - lucene

In short, I am trying to determine a document's true document ID in method CustomScoreProvider.CustomScore which only provides a document "ID" relative to a sub-IndexReader.
More info: I am trying to boost my documents' scores by precomputed boost factors (imagine an in-memory structure that maps Lucene's document ids to boost factors). Unfortunately I cannot store the boosts in the index for a couple of reasons: boosting will not be used for all queries, plus the boost factors can change regularly and that would trigger a lot of reindexing.
Instead I'd like to boost the score at query time and thus I've been working with CustomScoreQuery/CustomScoreProvider. The boosting takes place in method CustomScoreProvider.CustomScore:
public override float CustomScore(int doc, float subQueryScore, float valSrcScore) {
float baseScore = subQueryScore * valSrcScore; // the default computation
// boost -- THIS IS WHERE THE PROBLEM IS
float boostedScore = baseScore * MyBoostCache.GetBoostForDocId(doc);
return boostedScore;
}
My problem is with the doc parameter passed to CustomScore. It is not the true document id -- it is relative to the subreader used for that index segment. (The MyBoostCache class is my in-memory structure mapping Lucene's doc ids to boost factors.) If I knew the reader's docBase I could figure out the true id (id = doc + docBase).
Any thoughts on how I can determine the true id, or perhaps there's a better way to accomplish what I'm doing?
(I am aware that the id I'm trying to get is subject to change and I've already taken steps to make sure the MyBoostCache is always up to date with the latest ids.)

I was able to achieve this by passing the IndexSearcher to my CustomScoreProvider, using it to determine which of its subreaders is being used by the CustomScoreProvider, and then getting the MaxDoc for the prior subreaders from the IndexSearcher to determine the docBase.
private int DocBase { get; set; }
public MyScoreProvider(IndexReader reader, IndexSearcher searcher) {
DocBase = GetDocBaseForIndexReader(reader, searcher);
}
private static int GetDocBaseForIndexReader(IndexReader reader, IndexSearcher searcher) {
// get all segment readers for the searcher
IndexReader rootReader = searcher.GetIndexReader();
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, rootReader);
// sequentially loop through the subreaders until we find the specified reader, adjusting our offset along the way
int docBase = 0;
for (int i = 0; i < subReaders.Count; i++)
{
if (subReaders[i] == reader)
break;
docBase += subReaders[i].MaxDoc();
}
return docBase;
}
public override float CustomScore(int doc, float subQueryScore, float valSrcScore) {
float baseScore = subQueryScore * valSrcScore;
float boostedScore = baseScore * MyBoostCache.GetBoostForDocId(doc + DocBase);
return boostedScore;
}

Related

Android custom keyboard suggestions

I am building a custom keyboard for android, the one that atleast supports autocomplete suggestions. To achieve this, I am storing every word that user types (not password fields) in a Room database table which has a simple model, the word and its frequency. Now for showing the suggestions, I am using a Trie which is populated by words from this database table. My query is basically to order by the table based on frequency of the word and limit the results to 5K (I do not feel like overpopulating the Trie, these 5K words can be considered as the users' favourite words that he uses often and needs suggestions for). Now my actual problem is the ORDER BY clause, this is a rapidly growing data set, sorting lets say 0.1M words to get 5K words seems like an overkill. How can i rework this approach to improve the efficiency of this entire suggestions logic.
If not already implemented, an index on the frequency #ColumnInfo(index = true).
Another could be to add a table that maintains the highest 5k. Supported by yet another table (the support table) that has 1 row, with columns for; the highest frequency (not really required), the lowest frequency in the current 5k, and a 3rd column for the number currently held. So you could then, after adding an existing word get whether or not the new/updated word should be added to the 5k table (perhaps a 4th column for the primary key of the lowest to facilitate efficient deletion).
So
if the number currently held is less than 5k insert or update the 5k table and increment the number currently held in the support table.
otherwise if the number is lower then the lowest skip it.
otherwise update if it already exists.
otherwise delete the lowest, insert the replacement and then update the lowest accordingly in the support table.
Note that the 5K table would probably only need to store the rowid as a pointer/reference/map to the core table.
rowid is a column that virtually all tables will have in Room (virtual tables an exception as are table that have the WITHOUT ROWID attribute but Room does not facilitate (as far as I am aware) WITHOUT ROWID table).
The rowid can be up to twice as fast as other indexes. I would suggest using #PrimaryKey Long id=null; (java) or #PrimaryKey var id: Long?=null (Kotlin) and NOT using #PrimaryKey(autogenerate = true).
autogenerate = true equates to SQLite's AUTOINCREMENT, about which the SQLite documentation says "The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed."
see https://www.sqlite.org/rowidtable.html, and also https://sqlite.org/autoinc.html
curiously/funnily the support table mentioned isn't that far away from what coding AUTOINCREMENT does.
a table with a row per table that has AUTOINCREMENT, is used (sqlite_sequence) that stores the table name and the highest ever allocated rowid.
Without AUTOINCREMENT but with <column_name> INTEGER PRIMARY KEY and no value or null for the primary key column's value then SQLite generates a value that is 1 greater than max(rowid).
With AUTOINCREMENT/autogenerate=true then the generated value is the greater of max(rowid) and the value stored, for that table, in the sqlite_sequence table (hence the overheads).
of course those overheads, will very likely be insignificant in comparison to sorting 0.1M rows.
Demonstration
The following is a demonstration albeit just using a basic Word table as the source.
First the 2 tables (#Entity annotated classes)
Word
#Entity (
indices = {#Index(value = {"word"},unique = true)}
)
class Word {
#PrimaryKey Long wordId=null;
#NonNull
String word;
#ColumnInfo(index = true)
long frequency;
Word(){}
#Ignore
Word(String word, long frequency) {
this.word = word;
this.frequency = frequency;
}
}
WordSubset aka the table with the highest occurring 5000 frequencies, it simply has a reference/map/link to the underlying/actual word. :-
#Entity(
foreignKeys = {
#ForeignKey(
entity = Word.class,
parentColumns = {"wordId"},
childColumns = {"wordIdMap"},
onDelete = ForeignKey.CASCADE,
onUpdate = ForeignKey.CASCADE
)
}
)
class WordSubset {
public static final long SUBSET_MAX_SIZE = 5000;
#PrimaryKey
long wordIdMap;
WordSubset(){};
#Ignore
WordSubset(long wordIdMap) {
this.wordIdMap = wordIdMap;
}
}
note the constant SUBSET_MAX_SIZE, hard coded just the once so a simple single change to adjust (lowering it after rows have been added may cause issues)
WordSubsetSupport this will be a single row table that contains the highest and lowest frequencies (highest is not really needed), the number of rows in the WordSubset table and a reference/map to the word with the lowest frequency.
#Entity(
foreignKeys = {
#ForeignKey(
entity = Word.class,
parentColumns = {"wordId"},
childColumns = {"lowestWordIdMap"}
)
}
)
class WordSubsetSupport {
#PrimaryKey
Long wordSubsetSupportId=null;
long highestFrequency;
long lowestFrequency;
long countOfRowsInSubsetTable;
#ColumnInfo(index = true)
long lowestWordIdMap;
WordSubsetSupport(){}
#Ignore
WordSubsetSupport(long highestFrequency, long lowestFrequency, long countOfRowsInSubsetTable, long lowestWordIdMap) {
this.highestFrequency = highestFrequency;
this.lowestFrequency = lowestFrequency;
this.countOfRowsInSubsetTable = countOfRowsInSubsetTable;
this.lowestWordIdMap = lowestWordIdMap;
this.wordSubsetSupportId = 1L;
}
}
For access an abstract class (rather than interface, as this, in Java, allows methods/functions with a body, a Kotlin interface allows these) CombinedDao :-
#Dao
abstract class CombinedDao {
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(Word word);
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(WordSubset wordSubset);
#Insert(onConflict = OnConflictStrategy.IGNORE)
abstract long insert(WordSubsetSupport wordSubsetSupport);
#Query("SELECT * FROM wordsubsetsupport LIMIT 1")
abstract WordSubsetSupport getWordSubsetSupport();
#Query("SELECT count() FROM wordsubsetsupport")
abstract long getWordSubsetSupportCount();
#Query("SELECT countOfRowsInSubsetTable FROM wordsubsetsupport")
abstract long getCountOfRowsInSubsetTable();
#Query("UPDATE wordsubsetsupport SET countOfRowsInSubsetTable=:updatedCount")
abstract void updateCountOfRowsInSubsetTable(long updatedCount);
#Query("UPDATE wordsubsetsupport " +
"SET countOfRowsInSubsetTable = (SELECT count(*) FROM wordsubset), " +
"lowestWordIdMap = (SELECT word.wordId FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId ORDER BY frequency ASC LIMIT 1)," +
"lowestFrequency = (SELECT coalesce(min(frequency),0) FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId)," +
"highestFrequency = (SELECT coalesce(max(frequency),0) FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId)")
abstract void autoUpdateWordSupportTable();
#Query("DELETE FROM wordsubset WHERE wordIdMap= (SELECT wordsubset.wordIdMap FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId ORDER BY frequency ASC LIMIT 1)")
abstract void deleteLowestFrequency();
#Transaction
#Query("")
int addWord(Word word) {
/* try to add the add word, setting the wordId value according to the result.
The result will be the wordId generated (1 or greater) or if the word already exists -1
*/
word.wordId = insert(word);
/* If the word was added and not rejected as a duplicate, then it may need to be added to the WordSubset table */
if (word.wordId > 0) {
/* Are there any rows in the support table? if not then add the very first entry/row */
if (getWordSubsetSupportCount() < 1) {
/* Need to add the word to the subset */
insert(new WordSubset(word.wordId));
/* Can now add the first (and only) row to the support table */
insert(new WordSubsetSupport(word.frequency,word.frequency,1,word.wordId));
autoUpdateWordSupportTable();
return 1;
}
/* If there are less than the maximum number of rows in the subset table then
1) insert the new subset row, and
2) update the support table accordingly
*/
if (getCountOfRowsInSubsetTable() < WordSubset.SUBSET_MAX_SIZE) {
insert(new WordSubset(word.wordId));
autoUpdateWordSupportTable();
return 2;
}
/*
Last case is that the subset table is at the maximum number of rows and
the frequency of the added word is greater than the lowest frequency in the
subset, so
1) the row with the lowest frequency is removed from the subset table and
2) the added word is added to the subset
3) the support table is updated accordingly
*/
if (getCountOfRowsInSubsetTable() >= WordSubset.SUBSET_MAX_SIZE) {
WordSubsetSupport currentWordSubsetSupport = getWordSubsetSupport();
if (word.frequency > currentWordSubsetSupport.lowestFrequency) {
deleteLowestFrequency();
insert(new WordSubset(word.wordId));
autoUpdateWordSupportTable();
return 3;
}
}
return 4; /* indicates word added but does not qualify for addition to the subset */
}
return -1;
}
}
The addWord method/function is the only method that is used as this automatically maintains the WordSubset and the WordSubsetSupport tables.
TheDatabase is a pretty standard #Database annotated class, other than that it allows use the main thread for the sake of convenience and brevity of the demo:-
#Database( entities = {Word.class,WordSubset.class,WordSubsetSupport.class}, version = TheDatabase.DATABASE_VERSION, exportSchema = false)
abstract class TheDatabase extends RoomDatabase {
abstract CombinedDao getCombinedDao();
private static volatile TheDatabase instance = null;
public static TheDatabase getInstance(Context context) {
if (instance == null) {
instance = Room.databaseBuilder(context,TheDatabase.class,DATABASE_NAME)
.addCallback(cb)
.allowMainThreadQueries()
.build();
}
return instance;
}
private static Callback cb = new Callback() {
#Override
public void onCreate(#NonNull SupportSQLiteDatabase db) {
super.onCreate(db);
}
#Override
public void onOpen(#NonNull SupportSQLiteDatabase db) {
super.onOpen(db);
}
};
public static final String DATABASE_NAME = "the_database.db";
public static final int DATABASE_VERSION = 1;
}
Finally activity code that randomly generates and adds 10,000 words (or thereabouts as some could be duplicate words), each word having a frequency that is also randomly generated (between 1 and 10000) :-
public class MainActivity extends AppCompatActivity {
TheDatabase db;
CombinedDao dao;
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
db = TheDatabase.getInstance(this);
dao = db.getCombinedDao();
for (int i=0; i < 10000; i++) {
Word currentWord = generateRandomWord();
Log.d("ADDINGWORD","Adding word " + currentWord.word + " frequency is " + currentWord.frequency);
dao.addWord(generateRandomWord());
}
}
public static final String ALPHABET = "abcdefghijklmnopqrstuvwxyz";
private Word generateRandomWord() {
Random r = new Random();
int wordLength = (abs(r.nextInt()) % 24) + 1;
int frequency = abs(r.nextInt()) % 10000;
StringBuilder sb = new StringBuilder();
for (int i=0; i < wordLength; i++) {
int letter = abs(r.nextInt()) % (ALPHABET.length());
sb.append(ALPHABET.substring(letter,letter+1));
}
return new Word(sb.toString(),frequency);
}
}
Obviously the results would differ per run, also the demo is only really designed to be run once (although it could be run more).
After running, using AppInspection, then
The support table (in this instance) is :-
So as countOfRowsInSubsetTable is 5000 then the subset table has been filled to it's capacity/limit.
The highest frequency encountered as 9999 (as could well be expected)
The lowest frequency in the subset is 4690 and that is for the word with the wordId that is 7412.
The subset table on it's own means little as it just contains a map to the actual word. So it's more informative to use a query to look at what it contains.
e.g.
As can be seen the query shows that the word who's wordId is 7412 is the one with the lowest frequency of 4690 (as expected according to the support table)
Going to the last page shows:-

What is the best method to determine file size resource usage in SenseNet?

In order to charge appropriately for resource usage, i.e. database storage, we need to know the size of our client's files. Is there a simple way to calculate the resource usage for client's Workspace?
If you just want to know the size of the files in the workspace, you can use the funciton below, although total resource usage is likely much higher.
Calculate file size -- useful, but not close to total storage.
public static int DocumentFileSizeMB(string path)
{
var size = 0;
var results = ContentQuery.Query(SafeQueries.TypeInTree, null, "File", path);
if (results != null && results.Count > 0)
{
var longsize = results.Nodes.Sum(n => n.GetFullSize());
size = (int)(longsize / 1000000);
}
return size;
}
To get a better idea of storage space resources, call the SenseNet function GetTreeSize() on a node. However, this doesn't give the full resource usage due to other content that is related to the node size calcuation, but not stored beneath the node, such as Index tables, Log entries, etc.
A better method, but still not the full resource usage.
public static int NodeStorageSizeMB(string path)
{
var size = 0;
var node = Node.LoadNode(path);
if (node != null)
{
size = (int)(node.GetTreeSize() / 1000000); // Use 10**6 as Mega, not 1024*1024, which is "mebibyte".
}
return size;
}

Sage: Iterate over increasing sequences

I have a problem that I am unwilling to believe hasn't been solved before in Sage.
Given a pair of integers (d,n) as input, I'd like to receive a list (or set, or whatever) of all nondecreasing sequences of length d all of whose entries are no greater than n.
Similarly, I'd like another function which returns all strictly increasing sequences of length d whose entries are no greater than n.
For example, for d = 2 n=3, I'd receive the output:
[[1,2], [1,3], [2,3]]
or
[[1,1], [1,2], [1,3], [2,2], [2,3], [3,3]]
depending on whether I'm using increasing or nondecreasing.
Does anyone know of such a function?
Edit Of course, if there is such a method for nonincreasing or decreasing sequences, I can modify that to fit my purposes. Just something to iterate over sequences
I needed this algorithm too and I finally managed to write one today. I will share the code here, but I only started to learn coding last week, so it is not pretty.
Idea Input=(r,d). Step 1) Create a class "ListAndPosition" that has a list L of arrays Integer[r+1]'s, and an integer q between 0 and r. Step 2) Create a method that receives a ListAndPosition (L,q) and screens sequentially the arrays in L checking if the integer at position q is less than the one at position q+1, if so, it adds a new array at the bottom of the list with that entry ++. When done, the Method calls itself again with the new list and q-1 as input.
The code for Step 1)
import java.util.ArrayList;
public class ListAndPosition {
public static Integer r=5;
public final ArrayList<Integer[]> L;
public int q;
public ListAndPosition(ArrayList<Integer[]> L, int q) {
this.L = L;
this.q = q;
}
public ArrayList<Integer[]> getList(){
return L;
}
public int getPosition() {
return q;
}
public void decreasePosition() {
q--;
}
public void showList() {
for(int i=0;i<L.size();i++){
for(int j=0; j<r+1 ; j++){
System.out.print(""+L.get(i)[j]);
}
System.out.println("");
}
}
}
The code for Step 2)
import java.util.ArrayList;
public class NonDecreasingSeqs {
public static Integer r=5;
public static Integer d=3;
public static void main(String[] args) {
//Creating the first array
Integer[] firstArray;
firstArray = new Integer[r+1];
for(int i=0;i<r;i++){
firstArray[i] = 0;
}
firstArray[r] = d;
//Creating the starting listAndDim
ArrayList<Integer[]> L = new ArrayList<Integer[]>();
L.add(firstArray);
ListAndPosition Lq = new ListAndPosition(L,r-1);
System.out.println(""+nonDecSeqs(Lq).size());
}
public static ArrayList<Integer[]> nonDecSeqs(ListAndPosition Lq){
int iterations = r-1-Lq.getPosition();
System.out.println("How many arrays in the list after "+iterations+" iterations? "+Lq.getList().size());
System.out.print("Should we stop the iteration?");
if(0<Lq.getPosition()){
System.out.println(" No, position = "+Lq.getPosition());
for(int i=0;i<Lq.getList().size();i++){
//Showing particular array
System.out.println("Array of L #"+i+":");
for(int j=0;j<r+1;j++){
System.out.print(""+Lq.getList().get(i)[j]);
}
System.out.print("\nCan it be modified at position "+Lq.getPosition()+"?");
if(Lq.getList().get(i)[Lq.getPosition()]<Lq.getList().get(i)[Lq.getPosition()+1]){
System.out.println(" Yes, "+Lq.getList().get(i)[Lq.getPosition()]+"<"+Lq.getList().get(i)[Lq.getPosition()+1]);
{
Integer[] tempArray = new Integer[r+1];
for(int j=0;j<r+1;j++){
if(j==Lq.getPosition()){
tempArray[j] = new Integer(Lq.getList().get(i)[j])+1;
}
else{
tempArray[j] = new Integer(Lq.getList().get(i)[j]);
}
}
Lq.getList().add(tempArray);
}
System.out.println("New list");Lq.showList();
}
else{
System.out.println(" No, "+Lq.getList().get(i)[Lq.getPosition()]+"="+Lq.getList().get(i)[Lq.getPosition()+1]);
}
}
System.out.print("Old position = "+Lq.getPosition());
Lq.decreasePosition();
System.out.println(", new position = "+Lq.getPosition());
nonDecSeqs(Lq);
}
else{
System.out.println(" Yes, position = "+Lq.getPosition());
}
return Lq.getList();
}
}
Remark: I needed my sequences to start at 0 and end at d.
This is probably not a very good answer to your question. But you could, in principle, use Partitions and the max_slope=-1 argument. Messing around with filtering lists of IntegerVectors sounds equally inefficient and depressing for other reasons.
If this has a canonical name, it might be in the list of sage-combinat functionality, and there is even a base class you could perhaps use for integer lists, which is basically what you are asking about. Maybe you could actually get what you want using IntegerListsLex? Hope this proves helpful.
This question can be solved by using the class "UnorderedTuples" described here:
http://doc.sagemath.org/html/en/reference/combinat/sage/combinat/tuple.html
To return all all nondecreasing sequences with entries between 0 and n-1 of length d, you may type:
UnorderedTuples(range(n),d)
This returns the nondecreasing sequence as a list. I needed an immutable object (because the sequences would become keys of a dictionary). So I used the "tuple" method to turn the lists into tuples:
immutables = []
for s in UnorderedTuples(range(n),d):
immutables.append(tuple(s))
return immutables
And I also wrote a method which picks out only the increasing sequences:
def isIncreasing(list):
for i in range(len(list) - 1):
if list[i] >= list[i+1]:
return false
return true
The method that returns only strictly increasing sequences would look like
immutables = []
for s in UnorderedTuples(range(n),d):
if isIncreasing(s):
immutables.append(tuple(s))
return immutables

Optimized recalculating all pairs shortest path when removing vertexes dynamically from an undirected graph

I use following dijkstra implementation to calculate all pairs shortest paths in an undirected graph. After calling calculateAllPaths(), dist[i][j] contains shortest path length between i and j (or Integer.MAX_VALUE if no such path available).
The problem is that some vertexes of my graph are removing dynamically and I should recalculate all paths from scratch to update dist matrix. I'm seeking for a solution to optimize update speed by avoiding unnecessary calculations when a vertex removes from my graph. I already search for solution and I now there is some algorithms such as LPA* to do this, but they seem very complicated and I guess a simpler solution may solve my problem.
public static void calculateAllPaths()
{
for(int j=graph.length/2+graph.length%2;j>=0;j--)
{
calculateAllPathsFromSource(j);
}
}
public static void calculateAllPathsFromSource(int s)
{
final boolean visited[] = new boolean[graph.length];
for (int i=0; i<dist.length; i++)
{
if(i == s)
{
continue;
}
//visit next node
int next = -1;
int minDist = Integer.MAX_VALUE;
for (int j=0; j<dist[s].length; j++)
{
if (!visited[j] && dist[s][j] < minDist)
{
next = j;
minDist = dist[s][j];
}
}
if(next == -1)
{
continue;
}
visited[next] = true;
for(int v=0;v<graph.length;v++)
{
if(v == next || graph[next][v] == -1)
{
continue;
}
int md = dist[s][next] + graph[next][v];
if(md < dist[s][v])
{
dist[s][v] = dist[v][s] = md;
}
}
}
}
If you know that vertices are only being removed dynamically, then instead of just storing the best path matrix dist[i][j], you could also store the permutation of each such path. Say, instead of dist[i][j] you make a custom class myBestPathInfo, and the array of an instance of this, say myBestPathInfo[i][j], contain members best distance as well as permutation of the best path. Preferably, the best path permutation is described as an ordered set of some vertex objects, where the latter are of reference type and unique for each vertex (however used in several myBestPathInfo instances). Such objects could include a boolean property isActive (true/false).
Whenever a vertex is removed, you traverse through the best path permutations for each vertex-vertex pair, to make sure no vertex has been deactivated. Finally, only for broken paths (deactivated vertices) do you re-run Dijkstra's algorithm.
Another solution would be to solve the shortest path for all pairs using linear programming (LP) techniques. A removed vertex can be easily implemented as an additional constraint in your program (e.g. flow in <=0 and and flow out of vertex <= 0*), after which the re-solving of the shortest path LP:s can use the previous optimal solution as a feasible basic feasible solution (BFS) in the dual LPs. This property holds since adding a constraint in the primal LP is equivalent to an additional variable in the dual; hence, previously optimal primal BFS will be feasible in dual after additional constraints. (on-the-fly starting on simplex solver for LPs).

Lucene: Iterate all entries

I have a Lucene Index which I would like to iterate (for one time evaluation at the current stage in development)
I have 4 documents with each a few hundred thousand up to million entries, which I want to iterate to count the number of words for each entry (~2-10) and calculate the frequency distribution.
What I am doing at the moment is this:
for (int i = 0; i < reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
Field text = doc.getField("myDocName#1");
String content = text.stringValue();
int wordLen = countNumberOfWords(content);
//store
}
So far, it is iterating something. The debug confirms that its at least operating on the terms stored in the document, but for some reason it only process a small part of the stored terms. I wonder what I am doing wrong? I simply want to iterate over all documents and everything that is stored in them?
Firstly you need to ensure you index with TermVectors enabled
doc.add(new Field(TITLE, page.getTitle(), Field.Store.YES, Field.Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
Then you can use IndexReader.getTermFreqVector to count terms
TopDocs res = indexSearcher.search(YOUR_QUERY, null, 1000);
// iterate over documents in res, ommited for brevity
reader.getTermFreqVector(res.scoreDocs[i].doc, YOUR_FIELD, new TermVectorMapper() {
public void map(String termval, int freq, TermVectorOffsetInfo[] offsets, int[] positions) {
// increment frequency count of termval by freq
freqs.increment(termval, freq);
}
public void setExpectations(String arg0, int arg1,boolean arg2, boolean arg3) {}
});