Lucene calculate term vectors for existing index - lucene

With Lucene.net I would like to get the term vectors as described in this stackoverflow question.
The problem is, the index is already generated with the field indexed and stored, but without term vectors.
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);
Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.
Do you know how this could be possible, without deleting the complete Lucene index?

As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.
In my scenario, I want to find the offset positions of my search term in the matched document.
I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.
Step 1: Perform whatever query you are currently performing.
For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.
Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:
token attributes
offset attributes
and others (see here)
Example:
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;
const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
try
{
ts.Reset();
Console.WriteLine("");
Console.WriteLine("Token: " + searchTerm);
while (ts.IncrementToken())
{
if (searchTerm.Equals(charTermAttr.ToString()))
{
var start = offsetAttr.StartOffset;
var end = offsetAttr.EndOffset;
Console.WriteLine(String.Format(" > offset: {0}-{1}", start, end));
}
}
ts.End();
}
catch (Exception)
{
throw;
}
The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.
The output generated is:
Token: bar
> offset: 4-7
> offset: 12-15
So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.

Related

Get the position of matches in Lucene

Is it possible to find the position of words with a match when the indexed field isn't stored?
for example:
Query: "fox over dog"
Indexed text of matched doc: "The quick brown fox jumps over the lazy dog"
What I want: [4,6,9]
Note1: I know text can be highlighted using Lucene but I want the position of the words
Note2: The field isn't set to be stored by Lucene**
I have not done this for practical purposes - just to give a pseudo code and pointers that you can experiment with to reach to correct solution.
Also, you have not specified your Lucene version, I am using Lucene 6.0.0 with Java.
1.While Indexing, set these two booleans for your specific field for which positions are desired. Lucene will be able to give that data if indexing has stored that information otherwise not.
FieldType txtFieldType = new FieldType(
TextField.TYPE_NOT_STORED);
txtFieldType.setStoreTermVectors(true);
txtFieldType.setStoreTermVectorPositions(true);
2.At your searcher, you need to use Terms , TermsEnum & PostingsEnum like below,
`Terms terms = searcher.getIndexReader().getTermVector(hit.doc, "TEXT_FIELD");`
if(terms.hasPositions()){
TermsEnum termsEnum = terms.iterator();
PostingsEnum postings = null;
while(termsEnum.next() != null){
postings = termsEnum.postings(postings ,PostingsEnum.ALL);
while(postings.nextDoc() != PostingsEnum.NO_MORE_DOCS){
System.out.println(postings.nextPosition());
}
You need to do some of your own analysis to arrive at the data that you need but your first need to save meta data as pointed in point # 1.
}
}
searcher is IndexSearcher instance, hit.doc is doc id and hit is a ScoreDoc .

Search by exact words in a phrase using Umbraco Examine

I have some description field per content and those are:
For content1:
The quick brown fox jumps over the lazy dog. And the lazy dog is good.
For content2:
The lazy fog is crazy.
Now, when I use keyword = lazy dog, I want to give result as content1 and not content2
I tried like:
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria =
searcher.CreateSearchCriteria()
.GroupedAnd( new List<string> { "description" }, "lazy dog") )
.Compile();
ISearchResults result = searcher.Search( criteria );
But it didn't gave me desired results, it give me results: content1 and content2.
What should I do in order to get as content1 result ?
By default examine is compiling this query to:
+(+description:lazy dog)
and based on it it's returning the results with both: lazy and dog words.
What you want to achieve is:
+(+description:"lazy dog")
First of what you need to try is to escape the phrase. In your case it will be:
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria =
searcher.CreateSearchCriteria()
.GroupedAnd( new List<string> { "description" }, "lazy dog".Escape()) )
.Compile();
ISearchResults result = searcher.Search( criteria );
Can't test it now, but there were some problems with it in the past from what I remember. The second option and a life saver for you, may be building the search query manually and using the raw query.
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria = searcher.CreateSearchCriteria();
var query = criteria.RawQuery("+description:\"lazy dog\"");
ISearchResults result = searcher.Search( query );
And it should return you correct = matched result only. Personally, I've used also some boosting of specific words to just point some results higher in the score list, but if you want to have only matched items, try above solutions and let me know if it helped you.
If you want to deal with more than one property, you can either use some fluent API methods like GroupedAnd or GroupedOr (depending of the desired behaviour of search) or build more advanced raw query.
For the first option, check Grouped Operations documentation: https://github.com/Shazwazza/Examine/wiki/Grouped-Operations.
For the second scenario it would be the best to analyze how it's done e.g. in ezSearch package (which btw. is awesome!): https://github.com/umco/umbraco-ezsearch/blob/master/Src/Our.Umbraco.ezSearch/Web/UI/Views/MacroPartials/ezSearch.cshtml.

Lucene fuzzy search on a phrase (FuzzyQuery + SpanQuery)

I am looking for a way of coding the lucene fuzzy query that searches all the documents, which are relevant to an exact phrase. If I search "mosa employee appreciata", a document contains "most employees appreciate" will be returned as the result.
I tried to use:
FuzzyQeury = new FuzzyQuery(new Term("contents","mosa employee appreicata"))
Unfortunately, it empirically doesn't work. The FuzzyQuery employs the editor distance, theoretically, "mosa employee appreciata" should be matched with "most employees appreciate" provide the appropriate distance is given. It seems a bit odd.
Any clues? Thank you.
There are two likely problems here. First: I'm guessing the "contents" field is being analyzed such that "most employees apreciate" is not a term, but rather three terms. Defining as a single term is not appropriate in this case.
However, even if the content listed is a single term, a second likely problem we have is that there is too much distance between the terms to get a match. The Damerau-Levenshtein distance between mosa employee appreicata and most employees appreciate is 4 (the approximate distance, incidentally, between my average first shot at spelling
"Damerau-Levenshtein" and the correct spelling). Fuzzy Query, as of 4.0, handles edit distances of no more than 2, due to performance constraints, and the assumption that larger distances are usually not particularly relevant.
If you need to perform a phrase query with fuzzy terms, you should look into either MultiPhraseQuery, or combine a set of SpanQueries (especially SpanMultiTermQueryWrapper and SpanNearQuery) to meet your needs.
SpanQuery[] clauses = new SpanQuery[3];
clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "mosa")));
clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "employee")));
clauses[2] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "appreicata")));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true)
And since none of the individual terms have an edit distance greater than 2, this should be more effective.
ComplexPhraseQueryParser handles fuzzy searching on phrase words - i.e., specify the words that should be fuzzy searched and those that should not. Works as follows
Query query = new ComplexPhraseQueryParser("content", analyzer)
.parse("some test~ query~ blah blah");
Seems to work nicely. Not sure about performance, however but seems to work well on small data sets.
I had some (very small) millage with the following:
String[] searchTerms = searchString.split(" ");
FuzzyLikeThisQuery fltw = new FuzzyLikeThisQuery(searchTerms.length, new StandardAnalyzer());
Arrays.stream(searchTerms)
.forEach(term -> fltq.addTerms(term, FIELD, SIMILARITY_IN_EDITS, PREFIX_LENGTH);
This query matches far too distant strings with the index. String that don't match are ones where each of the terms are distant by more than 2 edits from the terms used in the indexed content.
Please use at your own peril.
The answer from femtoRgon is great! Thank you.
There is another way to solve this problem.
//declare a mutilphrasequery
MultiPhraseQuery childrenInOrder = new MultiPhraseQuery();
//user fuzzytermenum to enumerate your query string
FuzzyTermEnum fuzzyEnumeratedTerms1 = new FuzzyTermEnum(reader, new Term(searchField,"mosa"));
FuzzyTermEnum fuzzyEnumeratedTerms2 = new FuzzyTermEnum(reader, new Term(searchField,"employee"));
FuzzyTermEnum fuzzyEnumeratedTerms3 = new FuzzyTermEnum(reader, new Term(searchField,"appreicata"));
//this basically pull out the possbile terms from the index
Term termHolder1 = fuzzyEnumeratedTerms1.term();
Term termHolder2 = fuzzyEnumeratedTerms2.term();
Term termHolder3 = fuzzyEnumeratedTerms3.term();
//put the possible terms into multiphrasequery
if (termHolder1==null){
childrenInOrder.add(new Term(searchField,"mosa"));
}else{
childrenInOrder.add(fuzzyEnumeratedTerms1.term());
}
if (termHolder2==null){
childrenInOrder.add(new Term(searchField,"employee"));
}else{
childrenInOrder.add(fuzzyEnumeratedTerms2.term());
}
if (termHolder3==null){
childrenInOrder.add(new Term(searchField,"appreicata"));
}else{
childrenInOrder.add(fuzzyEnumeratedTerms3.term());
}
//close it - it is important to close it
fuzzyEnumeratedTerms1.close();
fuzzyEnumeratedTerms2.close();
fuzzyEnumeratedTerms3.close();

splitting lucene index into two halves

what is the best way to split an existing Lucene index into two halves i.e. each split should contain half of the total number of documents in the original index
The easiest way to split an existing index (without reindexing all the documents) is to:
Make another copy of the existing index (i.e. cp -r myindex mycopy)
Open the first index, and delete half the documents (range 0 to maxDoc / 2)
Open the second index, and delete the other half (range maxDoc / 2 to maxDoc)
Optimize both indices
This is probably not the most efficient way, but it requires very little coding to do.
Recent versions of Lucene have a dedicated tool to do this (IndexSplitter and MultiPassIndexSplitter under contrib/misc).
A fairly robust mechanism is to use a checksum of the document, modulo the number of indexes, to decide which index it will go into.
This question was one of the first I found when I was researching answers to this problem, so I'm leaving my solution here for future generations. In my case, I needed to split my index along specific lines, not arbitrarily down the middle or into thirds or what have you. This is a C# solution using Lucene 3.0.3.
My app's index is over 300GB in size, which was becoming a little unmanageable. Each document in the index is associated to one of the manufacturing plants that uses the app. There is no business reason that one plant would ever search for another plant's data, so I needed to cleanly divide the index along those lines. Here's the code I wrote to do so:
var distinctPlantIDs = databaseRepo.GetDistinctPlantIDs();
var sourceDir = GetOldIndexDir();
foreach (var plantID in distinctPlantIDs)
{
var query = new TermQuery(new Term("PlantID", plantID.ToString()));
var targetDir = GetNewIndexDirForPlant(plantID); //returns a unique directory where this plant's index will go
//read each plant's documents and write them to the new index
using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
using (var sourceSearcher = new IndexSearcher(sourceDir, true))
using (var destWriter = new IndexWriter(targetDir, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
var numHits = sourceSearcher.DocFreq(query.Term);
if (numHits <= 0) continue;
var hits = sourceSearcher.Search(query, numHits).ScoreDocs;
foreach (var hit in hits)
{
var doc = sourceSearcher.Doc(hit.Doc);
destWriter.AddDocument(doc);
}
destWriter.Optimize();
destWriter.Commit();
}
//delete the documents out of the old index
using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
using (var sourceWriter = new IndexWriter(sourceIndexDir, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED))
{
sourceWriter.DeleteDocuments(query);
sourceWriter.Commit();
}
}
That part that deletes the records out of the old index is there because in my case, one plant's records took up the majority of the index (over 2/3rds). So in my real version there is some extra code to do that plant last, and instead of splitting it out like the others it will optimize the remaining index (which is just that plant) and then move it to its new directory.
Anyway, hope this helps someone out there.

nutch field problem

I was using something like:
Field notdirectory = new Field("notdirectory","1", Field.Store.NO, Field.Index.UN_TOKENIZED);
and queries like "notdirectory:1" can be processed quite well all the time.
But recently I've changed the "Field.Store.NO, Field.Index.UN_TOKENIZED" to index a non-numeric string:
Field stateField = new Field("state","irn_" + state, Field.Store.NO, Field.Index.UN_TOKENIZED);
and queries like "state:irn_CA" can never fetch any results any more,even though I watch through hadoop logs that "irn_CA" is added to "state" field in fact.
So I doubt for Fields that satisfy "Field.Store.NO, Field.Index.UN_TOKENIZED",only numeric Fields can searchable,but I didn't see any documents about that.
So what's the true reason for this?
I think, you are using StandardAnalyzer for parsing the input query, which will tokenize your input query "irn_CA" into two tokens - "irn" and "CA". Since the index has "irn_CA" as single token, it won't match.
Try using KeywordAnalyzer for while searching. It will generate single token for the query string and match the indexed token correctly.
I think the searcher bean forces everything to lowercase...so make the state is in lower case when adding to the index:
Field stateField = new Field("state","irn_" + state.toLowerCase(), Field.Store.NO, Field.Index.UN_TOKENIZED);
and when you query: 'state:irn_ca' instead of 'state:irn_CA'.
I also note you prefixed with 'irn_' - good call, otherwise the highlighter flags up the the query.