How to find similar documents - lucene

How do you find a similar documents of a given document in Lucene. I do not know what the text is i only know what the document is. Is there a way to find similar documents in lucene. I am a newbie so I may need some hand holding.

you may want to check the MoreLikeThis feature of lucene.
MoreLikeThis constructs a lucene query based on terms within a document to find other similar documents in the index.
http://lucene.apache.org/java/3_0_1/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html
Sample code example (java reference) -
MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"title", "author"}); // specify the fields for similiarity
Query query = mlt.like(docID); // Pass the doc id
TopDocs similarDocs = searcher.search(query, 10); // Use the searcher
if (similarDocs.totalHits == 0)
// Do handling
}

Related

Get the position of matches in Lucene

Is it possible to find the position of words with a match when the indexed field isn't stored?
for example:
Query: "fox over dog"
Indexed text of matched doc: "The quick brown fox jumps over the lazy dog"
What I want: [4,6,9]
Note1: I know text can be highlighted using Lucene but I want the position of the words
Note2: The field isn't set to be stored by Lucene**
I have not done this for practical purposes - just to give a pseudo code and pointers that you can experiment with to reach to correct solution.
Also, you have not specified your Lucene version, I am using Lucene 6.0.0 with Java.
1.While Indexing, set these two booleans for your specific field for which positions are desired. Lucene will be able to give that data if indexing has stored that information otherwise not.
FieldType txtFieldType = new FieldType(
TextField.TYPE_NOT_STORED);
txtFieldType.setStoreTermVectors(true);
txtFieldType.setStoreTermVectorPositions(true);
2.At your searcher, you need to use Terms , TermsEnum & PostingsEnum like below,
`Terms terms = searcher.getIndexReader().getTermVector(hit.doc, "TEXT_FIELD");`
if(terms.hasPositions()){
TermsEnum termsEnum = terms.iterator();
PostingsEnum postings = null;
while(termsEnum.next() != null){
postings = termsEnum.postings(postings ,PostingsEnum.ALL);
while(postings.nextDoc() != PostingsEnum.NO_MORE_DOCS){
System.out.println(postings.nextPosition());
}
You need to do some of your own analysis to arrive at the data that you need but your first need to save meta data as pointed in point # 1.
}
}
searcher is IndexSearcher instance, hit.doc is doc id and hit is a ScoreDoc .

With Lucene, determine which name in a SQL Table exists in a document

I have a SQL table with people’s names and I want to find out which name is in a document that I’ve indexed with Lucene.
Is there a way find out other than searching for each name individually?
I guess, you can achieve that by using WildcardQuery and searching only for "*" string.
TopDocs hits = searcher.search(new WildcardQuery(new Term(AppConstants.NAME, "*")), 20);
if (null == hits.scoreDocs || hits.scoreDocs.length <= 0) {
System.out.println("No Hits Found");
return;
}
for (ScoreDoc hit : hits.scoreDocs) {
Document doc = searcher.doc(hit.doc);
.......build a list of names or whatever or cross check with DB table etc etc......
}
In above code, AppConstants.NAME is name field name and its assumed that reader and searcher have already been initialized.
In place of hit limit of 20 , you can specify any number that you want depending on number of rows in table.
Be prepared that search with wild card will be slow and depends heavily on your index size.
When asking question on Lucene, always specify Lucene version and API technology - Java or .NET.
Above Java code works for me with Lucene 6.0.0.

Find typo with Lucene

I would like to use Lucene to index/search text. The text can contain mistyped words, names, etc. What is the most simple way of getting Lucene to find a document containing
"this is Licene"
when user searches for
"Lucene"?
This is only for a demo app, so we need the most simple solution.
Lucene's fuzzy queries and based on Levenshtein edit distance.
Use a fuzzy query in the QueryParser, with syntax like:
Lucene~0.5
Or create a FuzzyQuery, passing in the maximum number of edits, something like:
Query query = new FuzzyQuery(new Term("field", "lucene"), 1);
Note: FuzzyQuery, in Lucene 4.x, does not support greater edit distances than 2.
Another option you could try is using the Lucene SpellChecker:
http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
It is a out of box, and very easy to use:
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
By default, it is using the LevensteinDistance, but you could provide your own customized Edit Distance.

Lucene.net PerFieldAnalyzerWrapper

I've read on how to use the per field analyzer wrapper, but can't get it to work with a custom analyzer of mine. I can't even get the analyzer to run the constructor, which makes me believe I'm actually calling the per field analyzer incorrectly.
Here's what I'm doing:
Create the per field analyzer:
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("<special field>", dta);
Add all the fields do document as usual, including a special field that we analyze differently.
And add document using the analyzer like this:
iw.AddDocument(doc, perFieldAnalyzer);
Am I on the right track?
The problem was related to my reliance on CMSs (Kentico) built-in Lucene helper classes. Basically, using those classes you need to specify the custom analyzer at index-level through the CMS and I did not wish to do that. So I ended up using Lucene.net directly almost everywhere gaining the flexibility of using any custom analyzer I want
I also did some changes to how I structure data and ended up using the tried-and-true KeywordAnalyzer to analyze document tags. Previously I was trying to do some custom tokenization magic on comma separated values like [tag1, tag2, tag with many parts] and could not get it reliably working with multi-parted tags. I still kept that field, but started adding multiple "tag" fields to the document, each storing one tag. So now I have N "tag" fields for "N" tags, each analyzed as a keyword, meaning each tag (one word or many) is a single token.
I think I overthinked it with my initial approach.
Here is what I ended up with.
On Indexing:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
-- Some procedure to compile all documents by reading from DB and putting into Lucene docs
foreach(var doc in docs)
{
iw.AddDocument(doc, perFieldAnalyzer);
}
On Searching:
KeywordAnalyzer ka = new KeywordAnalyzer();
PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(srchInfo.GetAnalyzer(true));
perFieldAnalyzer.AddAnalyzer("documenttags_t", ka);
string baseQuery = "documenttags_t:\"" + tagName + "\"";
Query query = _parser.Parse(baseQuery);
var results = _searcher.Search(query, sortBy)

With Lucene, if someone searches for "red barn", how do you also return results that contain "redbarn"?

I have a working search where if someone searches for two separate words (like "red barn", Lucene does a great job of returning records that have "red barn", "barn red", and "red tractor next to the big brown barn". That's great, but the results do not return anything that contains "redbarn" (unless you specifically search for "redbarn", but then you don't get "red barn" records).
I'm just using the standard analyzer at the moment, but am not sure what needs to change in order to get the all records I'd like.
If it matters, I'm using the NEST client on top of ElasticSearch (which is Lucene under the hood). I've researched the various analyzers and properties available but haven't found the right combination to do this.
The best way to do this is to write an analyzer that would tokenize "redbarn" as ["red", "barn"]. Lucene already does it for german languages, you can look at DictionaryCompoundWordTokenFilter for example.
Standard analyzer works for most cases but if you need detailed text analysis you need to write your own analyzer.
WorldDelimeterFilter that comes with Solr should solve your problem. Solr is built on top of lucene so you should not have a problem if you use the filters that come with solr.See the example below:
public class CustomAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream ts = new WhitespaceTokenizer(reader);
ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 1);
ts = new LowerCaseFilter(ts);
return ts;
}
}