Lucene - Specialized TokenStream/Analyzer given a set of indexable keywords

Lucene - Specialized TokenStream/Analyzer given a set of indexable keywords - lucene

I have the following situation
I have a collection of documents to index. But I need to be selective in what I index.
Selection criteria: the document must contain one of the keywords from a given Set.
That part is easy, I can check if any of those keywords are present in the document and only then index the document.
The tricky situation is (for me anyway!), that I want to index only these keywords. And these keywords can be multiworded, or regex expressions as well, say.
What these keywords are going to be is meaningless to this post, because I can abstract that out - I can generate the list of keywords that need to be indexed.
Is there an existing TokenStream, Analyzer, Filter combination that I can use?
And if there isn't, please could someone point me in the right direction.
If my question isn't clear enough:
HashSet<String> impKeywords = new HashSet<String>(new String[] {"Java", "Lucene"});
I have a class Content which I use, say:
Content content = new Content("I am only interested in Java, Lucene, Nutch, Luke, CommonLisp.");
And, say I have a method to get matching keywords:
HashSet<String> matchingKeywords = content.getMatchingKeywords(impKeywords); // returns a set with "Java" and "Lucene"
And if there are matchingKeywords, only then proceed to index the document; so:
if(!matchingKeywords.isEmpty()) {
// prepare document for indexing, and index.
// But what should be my Analyzer and TokenStream?
}
I want to be able to create an Analyzer with a TokenStream that only returns these matching keywords, so only these tokens are indexed.
End notes: One possibility appears to be that for each document I add a variable number of fields with each of the matching keywords. Where these fields are Indexed but not Analyzed using Field.Index.NOT_ANALYZED. However, it would be better if I'm able to figure out a pre-existing Analyzer/TokenStream for this purpose instead of playing around with fields.

Following #femtoRgon's advise I have resolved the said problem as follows.
As explained in the question, I have:
HashSet<String> impKeywords = new HashSet<String>(new String[] {"Java", "Lucene"});
And I have a class Content which I use, say as follows:
Content content = new Content("I am only interested in Java, Lucene, Nutch, Luke, CommonLisp.");
And, I have a method to get matching keywords:
HashSet<String> matchingKeywords = content.getMatchingKeywords(impKeywords); // returns a set with "Java" and "Lucene" for this example `content`.
And if there are matchingKeywords, only then proceed to index the document; so while indexing I did:
if(!matchingKeywords.isEmpty()) {
Document doc = new Document();
for(String keyword: matchingKeywords) {
doc.add(new Field("keyword", keyword, Field.Store.YES, Field.Index.NOT_ANALYZED);
}
iwriter.addDocument(doc); // iwriter is the instance of IndexWriter
}
Then, while searching I created the following boolean query:
BooleanQuery boolQuery = new BooleanQuery();
for(String queryKeyword: searchKeywords)) {
boolQuery.add(new TermQuery(new Term("keyword", queryKeyword)), BooleanClause.Occur.SHOULD);
}
ScoreDoc[] hits = isearcher.search(boolQuery, null, 1000).scoreDocs; // isearcher is the instance of IndexSearcher
Hope this answer helps someone with similar needs.

Related

Index field using: new TextField(String fieldName, Reader reader)

I've been trying to index a field using the readerValue() that Lucene provides in the Fields. The thing is that the terms are not being indexed. This is the interesting part of the code:
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.deleteAll();
String str = "Some random text to be indexed";
Reader reader = new StringReader(str);
Document doc = new Document();
doc.add(new TextField("content", reader));
indexWriter.addDocument(doc);
Now, if I index that text as a String with the other TextField constructor it works fine, but like this it does not index the terms, instead returns null when I try to get the value of the field after a search:
QueryParser queryParser = new QueryParser("content", new SimpleAnalyzer());
Query query = queryParser.parse(text);
TopDocs topDocs = indexSearcher.search(query,10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document document = indexSearcher.doc(scoreDoc.doc);
Reader r = document.getField("content").readerValue();
I really can't see the problem, maybe it is some dumb thing that I missed, or maybe I'm using it wrong? Thanks in advance for any help.

By default, TextField is unstored. The behavior you are seeing is expected for an un-stored field. You should be able to search on it, but not retrieve it from the index. The constructor that takes a string argument for the field contents allows you to set whether to store the field or not, thus the different behavior.
The reason the store option is not available on that constructor, is that Lucene explicitly disallows a stored field to be set with a Reader or TokenStream value. If you want to store the field, you will simply need get the string value from the Reader yourself.

After Lucene search, get character offsets of all matched words in document? (not just preview snippet)

I am creating a search engine for a large number of HTML documents using lucene.
I know I can use PostingsHighlighter and friends to show snippets, with bold words, similar to Google Search results, also similar to this random lucene-based example.
However, unlike these examples, I need a solution that preserves highlighted words, even after the matched document is opened by the user, similar to Google Books.
Some words are hyphenated, in the form <div> ... an inter-</div><div...>national audience ...</div> I am thinking I need to convert these to plain text first, and write some code to merge words that were hyphenated, before I send them to lucene.
Once the resulting document is opened by the user, I'm hoping that I can use lucene to get character offsets of each matched word in the document.
I will have to cross-reference the offsets in the plain text back to the original HTML, and write code to highlight <b> the words based on said offsets.
<div> ... an <b>inter-</b></div><div...><b>national</b> audience ...</div>
How can I get what I need from lucene? Surely I don't have to write my own search for this 'final inch'?

OK, I figured out something I can get started with. :)
To index:
StandardAnalyzer analyzer - new StandardAnalyzer()
Directory index = FSDirectory.open(new File("...").toPath());
IndexWriterConfig config = new IndexWriterConfig(analyzer);
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
// documents need to be read from the data source..
// only add once, or else your docs will be duplicated as you continue to use the system
writer.close();
specify offsets to store for highlighting
private static final FieldType typeOffsets;
static {
typeOffsets = new FieldType(textField.TYPE_STORED);
typeOffsets.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
}
method addDoc
void addDoc(IndexWriter writer, String title, String body) {
Document doc = new Document();
doc.add(new Field("title", body, typeOffsets));
doc.add(new Field("body", body, typeOffsets));
// you can also add an store a TextField that does not have offsets,
// like a file ID that you wouldn't search on, just need to reference original doc.
writer.addDocument(doc);
}
Perform your first search
String q = "...";
String[] fields = new String[] {"title", "body"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer)
Query query = parser.parse(q)
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));
PostingsHighlighter highlighter = new PostingsHighlighter();
TopDocs topDocs = searcher.search(query, 10, Sort.RELEVANCE);
Get highlighted snippets with highlighter.highlightFields(fields, query, searcher, topDocs). You can iterate over the results.
When you want to highlight the end document (i.e. after the search is completed and user selected the result), use this solution (needs minor edits). It works by using NullFragmenter to turn the whole thing into one snippet.
public static String highlight(String pText, String pQuery) throws Exception
{
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
Highlighter highlighter = new Highlighter(new QueryScorer(parser.parse(pQuery)));
highlighter.setTextFragmenter(new NullFragmenter());
String text = highlighter.getBestFragment(analyzer, "", pText);
if (text != null)
{
return text;
}
return pText;
}
Edit: You can actually use PostingsHighlighter for this last step instead of Highlighter, but you have to override getBreakIterator, and then override your BreakIterator so that it thinks the whole document is one sentance.
Edit: You can override getFormatter to capture the offsets, rather than trying to parse the <b> tags normally output by PostingsHighlighter.

PhraseQuery+Lucene 4.6 is not working for PDF Word search

Iam Using lucene 4.6 version with Phrase Query for searching the words from PDF. Below is my code. Here Iam able to get the out put text from the PDF also getting the query as contents:"Following are the". But No.of hits is showing as 0. Any suggestions on it?? Thanks in advance.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
iwriter.deleteAll();
iwriter.commit();
Document doc = new Document();
PDDocument document = null;
try {
document = PDDocument.load(strFilepath);
}
catch (IOException ex) {
System.out.println("Exception Occured while Loading the document: " + ex);
}
String output=new PDFTextStripper().getText(document);
System.out.println(output);
//String text = "This is the text to be indexed";
doc.add(new Field("contents", output, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
String sentence = "Following are the";
//IndexSearcher searcher = new IndexSearcher(directory);
if(output.contains(sentence)){
System.out.println("");
}
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
// Iterate through the results:
if(hits.length>0){
System.out.println("Searched text existed in the PDF.");
}
ireader.close();
directory.close();
}
catch(Exception e){
System.out.println("Exception: "+e.getMessage());
}

There are two reasons why your PhraseQuery is not working
StandardAnalyzer uses ENGLISH_STOP_WORDS_SET which contains a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with these words which will be removed from TokenStream while indexing. That means when you search "Following are the" in index, are and the will not be found. so you will never get any result for such a PhraseQuery as are and the will never be there in first place to search with.
Solution for this is use this constructor for
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46, CharArraySet.EMPTY_SET); while indexing this will make sure that StopFilter will not remove any word from TokenStream while indexing.
StandardAnalyzer also uses LowerCaseFilter that means all tokens will be normalized to lower case. so Following will be indexed as following that means searching "Following" won't give you result. For this .toLowerCase() will come to your rescue, just use this on your sentence and you should get results from search.
Also have a look at this link which specify Unicode Standard Annex #29 which is followed by StandardTokenizer. And from brief look at it, it looks like APOSTROPHE, QUOTATION MARK, FULL STOP, SMALL COMMA and many other characters under certain condition will be ignored while indexing.

Why I can't get the doc that added recently by IndexWriter in the search result in Lucene 4.0?

Such as the title said, I have encountered a puzzled problem.
I have built an index for my test program, then I use IndexWriter to add a document into index. The code is :
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc1 = new Document();
doc1.add(new Field("name", "张三", Field.Store.YES, Field.Index.ANALYZED));
doc1.add(new IntField("year", 2013, Field.Store.YES));
doc1.add(new TextField("content", "123456789", Field.Store.YES));
iwriter.addDocument(doc1);
iwriter.commit();
iwriter.close();
When I try to search in this index, I can't get this doc. I really get a correct result count, it is one more than before. But when I try to print the doc.get('name'), the output is wrong.
The code in search part is:
DirectoryReader ireader = DirectoryReader.open(directory);
System.out.println(ireader.numDeletedDocs());
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "name", analyzer);
Query query = parser.parse("张");
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
System.out.println(hits.length);
In results, there is a "Name: 李四".
I'm sure that I use the StandardAnalyzer during indexing and searching. And StandardAnalyzer will make one Chinese character as a single token. Why when I search "张", I will get "李四"? Is there anything wrong when I add a doc? Or the docid is mismatch?

Did you (re)open the index after adding the doc? Lucene searches only return the documents that existed as of the time the index was opened for searching.
[edit...]
Use IndexReader.Open() or IndexReader.doOpenIfChanged() to open the index again. doOpenIfChanged() has the advantage that it returns null if you can still use the old IndexReader instance (because the index has not changed).
(If I recall correctly, DirectoryReader.Open() just opens the index directory, so the higher-level Lucene code does not realize that the index has changed if you just call DirectoryReader.Open.)

Lucene Autocomplete with multiple words using Shingle filter

I am trying to make a Lucene autocomplete using Lucene's Dictionary and spellcheck classes, but so far only successful in making it work for single terms.
I googled and found out that we need to make use of Shingle Matrix filter to get the work done.. Can someone experienced with Lucene show me a way to do it ?
All I need is it has to generate words for autocomplete with phrases. For example, if I have a doc like this : "This is a long line with very long rant with too many words in it", Then I should be able to generate words like "long line", "long rant", "many words" etc...
Possible ?
Thanks.

writer = new IndexWriter(dir,
new ShingleAnalyzerWrapper(new StandardAnalyzer(
Version.LUCENE_CURRENT,
Collections.emptySet()),3),
false,
IndexWriter.MaxFieldLength.UNLIMITED);
This did the job for me...

You can write your own Analyzer implementing TokenStream function in inheriting Lucene.Net.Analysis.Analyzer class. There u can use this shingleFilter to get multiword from the tokenstream Code Stream:
public override Lucene.Net.Analysis.TokenStream TokenStream(String fieldName, System.IO.TextReader
reader)
{
Lucene.Net.Analysis.TokenStream tokenStream = new
Lucene.Net.Analysis.Standard.StandardTokenizer(Lucene.Net.Util.Version.LUCENE_30, reader);
tokenStream = new Lucene.Net.Analysis.Shingle.ShingleFilter(tokenStream, maxShingleSize);
return tokenStream;
}
max Shingle size identifies max length of multi word unit

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene - Specialized TokenStream/Analyzer given a set of indexable keywords - lucene

Related

Index field using: new TextField(String fieldName, Reader reader)

After Lucene search, get character offsets of all matched words in document? (not just preview snippet)

PhraseQuery+Lucene 4.6 is not working for PDF Word search

Why I can't get the doc that added recently by IndexWriter in the search result in Lucene 4.0?

Lucene Autocomplete with multiple words using Shingle filter

Categories

Resources