splitting lucene index into two halves - lucene

what is the best way to split an existing Lucene index into two halves i.e. each split should contain half of the total number of documents in the original index

The easiest way to split an existing index (without reindexing all the documents) is to:
Make another copy of the existing index (i.e. cp -r myindex mycopy)
Open the first index, and delete half the documents (range 0 to maxDoc / 2)
Open the second index, and delete the other half (range maxDoc / 2 to maxDoc)
Optimize both indices
This is probably not the most efficient way, but it requires very little coding to do.

Recent versions of Lucene have a dedicated tool to do this (IndexSplitter and MultiPassIndexSplitter under contrib/misc).

A fairly robust mechanism is to use a checksum of the document, modulo the number of indexes, to decide which index it will go into.

This question was one of the first I found when I was researching answers to this problem, so I'm leaving my solution here for future generations. In my case, I needed to split my index along specific lines, not arbitrarily down the middle or into thirds or what have you. This is a C# solution using Lucene 3.0.3.
My app's index is over 300GB in size, which was becoming a little unmanageable. Each document in the index is associated to one of the manufacturing plants that uses the app. There is no business reason that one plant would ever search for another plant's data, so I needed to cleanly divide the index along those lines. Here's the code I wrote to do so:
var distinctPlantIDs = databaseRepo.GetDistinctPlantIDs();
var sourceDir = GetOldIndexDir();
foreach (var plantID in distinctPlantIDs)
{
var query = new TermQuery(new Term("PlantID", plantID.ToString()));
var targetDir = GetNewIndexDirForPlant(plantID); //returns a unique directory where this plant's index will go
//read each plant's documents and write them to the new index
using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
using (var sourceSearcher = new IndexSearcher(sourceDir, true))
using (var destWriter = new IndexWriter(targetDir, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
var numHits = sourceSearcher.DocFreq(query.Term);
if (numHits <= 0) continue;
var hits = sourceSearcher.Search(query, numHits).ScoreDocs;
foreach (var hit in hits)
{
var doc = sourceSearcher.Doc(hit.Doc);
destWriter.AddDocument(doc);
}
destWriter.Optimize();
destWriter.Commit();
}
//delete the documents out of the old index
using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
using (var sourceWriter = new IndexWriter(sourceIndexDir, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED))
{
sourceWriter.DeleteDocuments(query);
sourceWriter.Commit();
}
}
That part that deletes the records out of the old index is there because in my case, one plant's records took up the majority of the index (over 2/3rds). So in my real version there is some extra code to do that plant last, and instead of splitting it out like the others it will optimize the remaining index (which is just that plant) and then move it to its new directory.
Anyway, hope this helps someone out there.

Related

Lucene calculate term vectors for existing index

With Lucene.net I would like to get the term vectors as described in this stackoverflow question.
The problem is, the index is already generated with the field indexed and stored, but without term vectors.
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);
Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.
Do you know how this could be possible, without deleting the complete Lucene index?
As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.
In my scenario, I want to find the offset positions of my search term in the matched document.
I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.
Step 1: Perform whatever query you are currently performing.
For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.
Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:
token attributes
offset attributes
and others (see here)
Example:
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;
const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
try
{
ts.Reset();
Console.WriteLine("");
Console.WriteLine("Token: " + searchTerm);
while (ts.IncrementToken())
{
if (searchTerm.Equals(charTermAttr.ToString()))
{
var start = offsetAttr.StartOffset;
var end = offsetAttr.EndOffset;
Console.WriteLine(String.Format(" > offset: {0}-{1}", start, end));
}
}
ts.End();
}
catch (Exception)
{
throw;
}
The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.
The output generated is:
Token: bar
> offset: 4-7
> offset: 12-15
So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.

How to get the matching spans of a Span Term Query in Lucene 5?

In Lucene to get the words around a term it is advised to use Span Queries. There is good walkthrough in http://lucidworks.com/blog/accessing-words-around-a-positional-match-in-lucene/
The spans are supposed to be accessed using the getSpans() method.
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
Spans spans = fleeceQ.getSpans(searcher.getIndexReader());
Then in Lucene 4 the API changed and the getSpans() method got more complex, and finally, in the latest Lucene release (5.3.0), this method was removed (apparently moved to the SpanWeight class).
So, which is the current way of accessing spans matched by a span term query?
The way to do it would be as follows.
LeafReader pseudoAtomicReader = SlowCompositeReaderWrapper.wrap(reader);
Term term = new Term("field", "fox");
SpanTermQuery spanTermQuery = new SpanTermQuery(term);
SpanWeight spanWeight = spanTermQuery.createWeight(is, false);
Spans spans = spanWeight.getSpans(pseudoAtomicReader.getContext(), Postings.POSITIONS);
The support for iterating over the spans via span.next() is also gone in version 5.3 of Lucene. To iterate over the spans you can do
int nxtDoc = 0;
while((nxtDoc = spans.nextDoc()) != spans.NO_MORE_DOCS){
System.out.println(spans.toString());
int id = nxtDoc;
System.out.println("doc_id="+id);
Document doc = reader.document(id);
System.out.println(doc.getField("field"));
System.out.println(spans.nextStartPosition());
System.out.println(spans.endPosition());
}

Lucene phrase query with wildcards

I come up with solution to programmaticlly create query to search for phrase with wildcards using this code:
public static Query createPhraseQuery(String[] phraseWords, String field) {
SpanQuery[] queryParts = new SpanQuery[phraseWords.length];
for (int i = 0; i < phraseWords.length; i++) {
WildcardQuery wildQuery = new WildcardQuery(new Term(field, phraseWords[i]));
queryParts[i] = new SpanMultiTermQueryWrapper<WildcardQuery>(wildQuery);
}
return new SpanNearQuery(queryParts, //words
0, //max distance
true //exact order
);
}
Example creation and call toString() method will output:
String[] phraseWords = new String[]{"foo*", "b*r"};
Query phraseQuery = createPhraseQuery(phraseWords, "text");
System.out.println(phraseQuery.toString());
outputs:
spanNear([SpanMultiTermQueryWrapper(text:foo*), SpanMultiTermQueryWrapper(text:b*r)], 0, true)
Which works great, and fast enough for most cases. For instance, if I create such query and search with it, It will output desired results, for example:
Sentence with foo bar.
Foolies beer drinkers.
...
And not something like:
Bar fooes.
Foo has bar.
I have mentioned that query work fast enough in most cases. Currently I have an index with size of aprox. 200GB and on average searching time is between 0.1 to 3 seconds. Depending on many factors like: cache, size of subsets of documents matching single word in phrase since lucene will perform set intersections between founded terms.
Example:
Let supose I want to query phrase "an* karenjin*" (which I will split into ["an*", "karenjin*"] and than create query using createPhraseQuery method) and I want that it matches sentences containing: "ana karenjina", "ani karenjinoj", "ane karenjine", ... (different cases due croatian grammar).
This query is very slow that I haven't waited long enough to get results (over 1h) and sometimes causes GC overhead limit exceeded exception.
This behaviour is somewhat expected since "an*" itself matches a huge number of documents. I am aware of that I could query "an? karanjin*" which giver results in 30-40sec (faster but still slow).
This is where I am confused.
If I query just "karenjin*" it gives results in 1 sec. Therefore I have tried to query "an* karenjin*" and using a Filter "karenjin*" using WildcardQuery and QueryWrapperFilter. And it is still unacceptable slow (I killed process before it returned anythong).
Documentation says that Filter reduces search space of Query. So I tried to use filter:
Filter filter = new QueryWrapperFilter(new WildcardQuery(new Term("text", "karanjin*")));
And query:
Query query = createPhraseQuery(new String[]{"an*", "karenjin*"}, "text");
Than search, (after several warm-up queries):
Sort sort = new Sort(new SortField("insertTime", SortField.Type.STRING, true));
TopDocs docs = searcher.search(query, filter, 100, sort);
OK, what is my question?
How come is quering:
Query query = new WildcardQuery(new Term("text", "karanjin*"));
is fast, but using Filter described above is still slow?
Yes, wildcards can be performance hogs, especially if they match a lot of terms, but what you describe does seem surprisingly so. Hard to say for sure why that is occuring, but for an attempt.
I'll assume:
Query query = new WildcardQuery(new Term("text", "an*"));
On it's own, is performing very badly, as described. Since the wildcards you are looking for are both prefix style queries, it's a better idea to use a PrefixQuery instead.
Query query = new PrefixQuery(new Term("text", "an"));
Though I don't think that will make much of a difference if any at all. What might just make a different is changing you rewrite method. You could try limiting the number of Terms the query is rewritten into:
Query query = new PrefixQuery(new Term("text", "an"));
//or
//Query query = new WildcardQuery(new Term("text", "an*"));
query.setRewriteMethod(new MultiTermQuery.RewriteMethod.TopTermsRewrite(10));

Lucene fuzzy search on a phrase (FuzzyQuery + SpanQuery)

I am looking for a way of coding the lucene fuzzy query that searches all the documents, which are relevant to an exact phrase. If I search "mosa employee appreciata", a document contains "most employees appreciate" will be returned as the result.
I tried to use:
FuzzyQeury = new FuzzyQuery(new Term("contents","mosa employee appreicata"))
Unfortunately, it empirically doesn't work. The FuzzyQuery employs the editor distance, theoretically, "mosa employee appreciata" should be matched with "most employees appreciate" provide the appropriate distance is given. It seems a bit odd.
Any clues? Thank you.
There are two likely problems here. First: I'm guessing the "contents" field is being analyzed such that "most employees apreciate" is not a term, but rather three terms. Defining as a single term is not appropriate in this case.
However, even if the content listed is a single term, a second likely problem we have is that there is too much distance between the terms to get a match. The Damerau-Levenshtein distance between mosa employee appreicata and most employees appreciate is 4 (the approximate distance, incidentally, between my average first shot at spelling
"Damerau-Levenshtein" and the correct spelling). Fuzzy Query, as of 4.0, handles edit distances of no more than 2, due to performance constraints, and the assumption that larger distances are usually not particularly relevant.
If you need to perform a phrase query with fuzzy terms, you should look into either MultiPhraseQuery, or combine a set of SpanQueries (especially SpanMultiTermQueryWrapper and SpanNearQuery) to meet your needs.
SpanQuery[] clauses = new SpanQuery[3];
clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "mosa")));
clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "employee")));
clauses[2] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "appreicata")));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true)
And since none of the individual terms have an edit distance greater than 2, this should be more effective.
ComplexPhraseQueryParser handles fuzzy searching on phrase words - i.e., specify the words that should be fuzzy searched and those that should not. Works as follows
Query query = new ComplexPhraseQueryParser("content", analyzer)
.parse("some test~ query~ blah blah");
Seems to work nicely. Not sure about performance, however but seems to work well on small data sets.
I had some (very small) millage with the following:
String[] searchTerms = searchString.split(" ");
FuzzyLikeThisQuery fltw = new FuzzyLikeThisQuery(searchTerms.length, new StandardAnalyzer());
Arrays.stream(searchTerms)
.forEach(term -> fltq.addTerms(term, FIELD, SIMILARITY_IN_EDITS, PREFIX_LENGTH);
This query matches far too distant strings with the index. String that don't match are ones where each of the terms are distant by more than 2 edits from the terms used in the indexed content.
Please use at your own peril.
The answer from femtoRgon is great! Thank you.
There is another way to solve this problem.
//declare a mutilphrasequery
MultiPhraseQuery childrenInOrder = new MultiPhraseQuery();
//user fuzzytermenum to enumerate your query string
FuzzyTermEnum fuzzyEnumeratedTerms1 = new FuzzyTermEnum(reader, new Term(searchField,"mosa"));
FuzzyTermEnum fuzzyEnumeratedTerms2 = new FuzzyTermEnum(reader, new Term(searchField,"employee"));
FuzzyTermEnum fuzzyEnumeratedTerms3 = new FuzzyTermEnum(reader, new Term(searchField,"appreicata"));
//this basically pull out the possbile terms from the index
Term termHolder1 = fuzzyEnumeratedTerms1.term();
Term termHolder2 = fuzzyEnumeratedTerms2.term();
Term termHolder3 = fuzzyEnumeratedTerms3.term();
//put the possible terms into multiphrasequery
if (termHolder1==null){
childrenInOrder.add(new Term(searchField,"mosa"));
}else{
childrenInOrder.add(fuzzyEnumeratedTerms1.term());
}
if (termHolder2==null){
childrenInOrder.add(new Term(searchField,"employee"));
}else{
childrenInOrder.add(fuzzyEnumeratedTerms2.term());
}
if (termHolder3==null){
childrenInOrder.add(new Term(searchField,"appreicata"));
}else{
childrenInOrder.add(fuzzyEnumeratedTerms3.term());
}
//close it - it is important to close it
fuzzyEnumeratedTerms1.close();
fuzzyEnumeratedTerms2.close();
fuzzyEnumeratedTerms3.close();

Lucene net IndexWriter after UpdateDocument doubles the size of index even with optimize?

I'm creating the index in a normal way:
var directory = FSDirectory.Open(...);
var analyzer = ...
var indexWriter = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
indexWriter.SetWriteLockTimeout(30000);
indexWriter.AddDocument(doc1);
indexWriter.AddDocument(doc2);
indexWriter.AddDocument(...);
indexWriter.Commit();
indexWriter.Optimize();
indexWriter.Close();
This creates an index of 5.8mb
Now I need to update 2 documents exactly..with 1 word added in each of them...so the size of index should be increased either by a very small amount or none at all:
var indexWriter = new IndexWriter(directory, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED);
indexWriter.SetWriteLockTimeout(30000);
indexWriter.UpdateDocument(doc1);
indexWriter.UpdateDocument(doc2);
indexWriter.Commit();
**indexWriter.Optimize();**
indexWriter.Close();
This operation DOUBLES the size of index in a way that it leaves _0.cfs file of the size the index was at previously 5.8mb...and creates a whole new index of the same size in _2.xxx files...so for a two document with one word changes, it doubles it!
It also continues doing this if I repeat the operation...so it just doubles it forever.
My thoughts were that Optimize call should optimize it and not cause things like these?
How do I stop it from doubling my index?
Thnx!
This is usually caused by having IndexReaders/IndexSearchers opened on the index while you optimize. IndexReaders see a snapshot of the Index when they were opened so they keep a lock on the files and the IndexWriter cannot remove them when its closed.
After optmize, you should refresh IndexReaders/IndexSearchers either by re-creating them or by using the Reopen() method on IndexReader. Once the IndexReaders/IndexSearchers are refreshed, if you create an IndexWriter and Close it immediately, you should see the files disapear.
That being said, if you decide to optimize live indexes (which you should only do when you delete lots of documents), you should always expect the Index to temporarily grow 3X it's "normal" size.