Get the position of matches in Lucene

Get the position of matches in Lucene - lucene

Is it possible to find the position of words with a match when the indexed field isn't stored?
for example:
Query: "fox over dog"
Indexed text of matched doc: "The quick brown fox jumps over the lazy dog"
What I want: [4,6,9]
Note1: I know text can be highlighted using Lucene but I want the position of the words
Note2: The field isn't set to be stored by Lucene**

I have not done this for practical purposes - just to give a pseudo code and pointers that you can experiment with to reach to correct solution.
Also, you have not specified your Lucene version, I am using Lucene 6.0.0 with Java.
1.While Indexing, set these two booleans for your specific field for which positions are desired. Lucene will be able to give that data if indexing has stored that information otherwise not.
FieldType txtFieldType = new FieldType(
TextField.TYPE_NOT_STORED);
txtFieldType.setStoreTermVectors(true);
txtFieldType.setStoreTermVectorPositions(true);
2.At your searcher, you need to use Terms , TermsEnum & PostingsEnum like below,
`Terms terms = searcher.getIndexReader().getTermVector(hit.doc, "TEXT_FIELD");`
if(terms.hasPositions()){
TermsEnum termsEnum = terms.iterator();
PostingsEnum postings = null;
while(termsEnum.next() != null){
postings = termsEnum.postings(postings ,PostingsEnum.ALL);
while(postings.nextDoc() != PostingsEnum.NO_MORE_DOCS){
System.out.println(postings.nextPosition());
}
You need to do some of your own analysis to arrive at the data that you need but your first need to save meta data as pointed in point # 1.
}
}
searcher is IndexSearcher instance, hit.doc is doc id and hit is a ScoreDoc .

Related

Lucene calculate term vectors for existing index

With Lucene.net I would like to get the term vectors as described in this stackoverflow question.
The problem is, the index is already generated with the field indexed and stored, but without term vectors.
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);
Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.
Do you know how this could be possible, without deleting the complete Lucene index?

As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.
In my scenario, I want to find the offset positions of my search term in the matched document.
I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.
Step 1: Perform whatever query you are currently performing.
For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.
Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:
token attributes
offset attributes
and others (see here)
Example:
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;
const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
try
{
ts.Reset();
Console.WriteLine("");
Console.WriteLine("Token: " + searchTerm);
while (ts.IncrementToken())
{
if (searchTerm.Equals(charTermAttr.ToString()))
{
var start = offsetAttr.StartOffset;
var end = offsetAttr.EndOffset;
Console.WriteLine(String.Format(" > offset: {0}-{1}", start, end));
}
}
ts.End();
}
catch (Exception)
{
throw;
}
The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.
The output generated is:
Token: bar
> offset: 4-7
> offset: 12-15
So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.

Can I clear the stopword list in lucene.net in order for exact matches to work better?

When dealing with exact matches I'm given a real world query like this:
"not in education, employment, or training"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? education employment ? training"
Here's a more contrived example:
"there is no such thing"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? ? ? thing"
My goal is to have searches like these match only the exact match as the user entered it.
Could one solution be to clear the stopwords list? would this have adverse affects? if so what? (my google-fu failed)

This all depends on the analyzer you are using. The StandardAnalyzer uses Stop words and strips them out, in fact the StopAnalyzer is where the StandardAnalyzer gets its stop words from.
Use the WhitespaceAnalyzer or create your own by inheriting from one that most closely suits your needs and modify it to be what you want.
Alternatively, if you like the StandardAnalyzer you can new one up with a custom stop word list:
//This is what the default stop word list is in case you want to use or filter this
var defaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
//create a new StandardAnalyzer with custom stop words
var sa = new StandardAnalyzer(
Version.LUCENE_29, //depends on your version
new HashSet<string> //pass in your own stop word list
{
"hello",
"world"
});

With Lucene, determine which name in a SQL Table exists in a document

I have a SQL table with people’s names and I want to find out which name is in a document that I’ve indexed with Lucene.
Is there a way find out other than searching for each name individually?

I guess, you can achieve that by using WildcardQuery and searching only for "*" string.
TopDocs hits = searcher.search(new WildcardQuery(new Term(AppConstants.NAME, "*")), 20);
if (null == hits.scoreDocs || hits.scoreDocs.length <= 0) {
System.out.println("No Hits Found");
return;
}
for (ScoreDoc hit : hits.scoreDocs) {
Document doc = searcher.doc(hit.doc);
.......build a list of names or whatever or cross check with DB table etc etc......
}
In above code, AppConstants.NAME is name field name and its assumed that reader and searcher have already been initialized.
In place of hit limit of 20 , you can specify any number that you want depending on number of rows in table.
Be prepared that search with wild card will be slow and depends heavily on your index size.
When asking question on Lucene, always specify Lucene version and API technology - Java or .NET.
Above Java code works for me with Lucene 6.0.0.

Lucene query result : get the words in the returned documents that were found by the query

In order to present highlighted match-words in the documents that were returned by Lucene queries, Lucene search result may contain the words that were used to return the doc as matching my request.
For example :
Lucene query : "dog cat"
Result : ["dogs are nice","dog and cats are friends"]
How to achieve this with Lucene? Manually I can't handle cats or dogs or any difference between requested words and returned words.

Use Lucene's Highlighter. Something like this:
//By default, this formatter will wrap highlights with <b>, but that is configurable.
Formatter formatter = new SimpleHTMLFormatter();
QueryScorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(formatter, queryScorer);
//You can set a fragmenter as well, by default it will split into fragments 100 chars in size, using SimpleFragmenter.
String highlightedSnippet = highlighter.getBestFragment(myAnalyzer, fieldName, fieldContent);

How to find similar documents

How do you find a similar documents of a given document in Lucene. I do not know what the text is i only know what the document is. Is there a way to find similar documents in lucene. I am a newbie so I may need some hand holding.

you may want to check the MoreLikeThis feature of lucene.
MoreLikeThis constructs a lucene query based on terms within a document to find other similar documents in the index.
http://lucene.apache.org/java/3_0_1/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html
Sample code example (java reference) -
MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"title", "author"}); // specify the fields for similiarity
Query query = mlt.like(docID); // Pass the doc id
TopDocs similarDocs = searcher.search(query, 10); // Use the searcher
if (similarDocs.totalHits == 0)
// Do handling
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Get the position of matches in Lucene - lucene

Related

Lucene calculate term vectors for existing index

Can I clear the stopword list in lucene.net in order for exact matches to work better?

With Lucene, determine which name in a SQL Table exists in a document

Lucene query result : get the words in the returned documents that were found by the query

How to find similar documents

Categories

Resources