I am adding the document to lucene index as follows:
Document doc = new Document();
String stringObj = (String)field.get(obj);
doc.add(new TextField(fieldName, stringObj.toLowerCase(), org.apache.lucene.document.Field.Store.YES));
indexWriter.addDocument(doc);
I am doing a wild card search as follows:
searchTerm = "*" + searchTerm + "*";
term = new Term(field, sTerm.toLowerCase());
Query query = new WildcardQuery(term);
TotalHitCountCollector collector = new TotalHitCountCollector();
indexSearcher.search(query, collector);
if(collector.getTotalHits() > 0){
TopDocs hits = indexSearcher.search(query, collector.getTotalHits());
}
When I have a string with a "this" value, it is not getting added to the index, hence i do not get the result on searching by "this". I am using a StandardAnalyzer.
Common terms of English language like prepositions, pronouns etc are marked as stop words and omitted before indexing. You can define a custom analyzer or custom stop word list for your analyzer. That way you will be able to omit words that you don't want to be indexed and keep the stop words that you need.
Related
For fun and learning I am trying to build a part-of-speech (POS) tagger with OpenNLP and Lucene 7.4. The goal would be that once indexed I can actually search for a sequence of POS tags and find all sentences that match sequence. I already get the indexing part, but I am stuck on the query part. I am aware that SolR might have some functionality for this, and I already checked the code (which was not so self-expalantory after all). But my goal is to understand and implement in Lucene 7, not in SolR, as I want to be independent of any search engine on top.
Idea
Input sentence 1: The quick brown fox jumped over the lazy dogs.
Applied Lucene OpenNLP tokenizer results in: [The][quick][brown][fox][jumped][over][the][lazy][dogs][.]
Next, applying Lucene OpenNLP POS tagging results in: [DT][JJ][JJ][NN][VBD][IN][DT][JJ][NNS][.]
Input sentence 2: Give it to me, baby!
Applied Lucene OpenNLP tokenizer results in: [Give][it][to][me][,][baby][!]
Next, applying Lucene OpenNLP POS tagging results in: [VB][PRP][TO][PRP][,][UH][.]
Query: JJ NN VBD matches part of sentence 1, so sentence 1 should be returned. (At this point I am only interested in exact matches, i.e. let's leave aside partial matches, wildcards etc.)
Indexing
First, I created my own class com.example.OpenNLPAnalyzer:
public class OpenNLPAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents(String fieldName) {
try {
ResourceLoader resourceLoader = new ClasspathResourceLoader(ClassLoader.getSystemClassLoader());
TokenizerModel tokenizerModel = OpenNLPOpsFactory.getTokenizerModel("en-token.bin", resourceLoader);
NLPTokenizerOp tokenizerOp = new NLPTokenizerOp(tokenizerModel);
SentenceModel sentenceModel = OpenNLPOpsFactory.getSentenceModel("en-sent.bin", resourceLoader);
NLPSentenceDetectorOp sentenceDetectorOp = new NLPSentenceDetectorOp(sentenceModel);
Tokenizer source = new OpenNLPTokenizer(
AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, sentenceDetectorOp, tokenizerOp);
POSModel posModel = OpenNLPOpsFactory.getPOSTaggerModel("en-pos-maxent.bin", resourceLoader);
NLPPOSTaggerOp posTaggerOp = new NLPPOSTaggerOp(posModel);
// Perhaps we should also use a lower-case filter here?
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
// Very important: Tokens are not indexed, we need a store them as payloads otherwise we cannot search on them
TypeAsPayloadTokenFilter payloadFilter = new TypeAsPayloadTokenFilter(posFilter);
return new TokenStreamComponents(source, payloadFilter);
}
catch (IOException e) {
throw new RuntimeException(e.getMessage());
}
}
Note that we are using a TypeAsPayloadTokenFilter wrapped around OpenNLPPOSFilter. This means, our POS tags will be indexed as payloads, and our query - however it'll look like - will have to search on payloads as well.
Querying
This is where I am stuck. I have no clue how to query on payloads, and whatever I try does not work. Note that I am using Lucene 7, it seems that in older versions querying on payload has changed several times. Documentation is extremely scarce. It's not even clear what the proper field name is now to query - is it "word" or "type" or anything else? For example, I tried this code which does not return any search results:
// Step 1: Indexing
final String body = "The quick brown fox jumped over the lazy dogs.";
Directory index = new RAMDirectory();
OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, indexWriterConfig);
Document document = new Document();
document.add(new TextField("body", body, Field.Store.YES));
writer.addDocument(document);
writer.close();
// Step 2: Querying
final int topN = 10;
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
final String fieldName = "body"; // What is the correct field name here? "body", or "type", or "word" or anything else?
final String queryText = "JJ";
Term term = new Term(fieldName, queryText);
SpanQuery match = new SpanTermQuery(term);
BytesRef pay = new BytesRef("type"); // Don't understand what to put here as an argument
SpanPayloadCheckQuery query = new SpanPayloadCheckQuery(match, Collections.singletonList(pay));
System.out.println(query.toString());
TopDocs topDocs = searcher.search(query, topN);
Any help is very much appreciated here.
Why don't you use TypeAsSynonymFilter instead of TypeAsPayloadTokenFilter and just make a normal query. So in your Analyzer:
:
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
TypeAsSynonymFilter typeAsSynonymFilter = new TypeAsSynonymFilter(posFilter);
return new TokenStreamComponents(source, typeAsSynonymFilter);
And indexing side:
static Directory index() throws Exception {
Directory index = new RAMDirectory();
OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, indexWriterConfig);
writer.addDocument(doc("The quick brown fox jumped over the lazy dogs."));
writer.addDocument(doc("Give it to me, baby!"));
writer.close();
return index;
}
static Document doc(String body){
Document document = new Document();
document.add(new TextField(FIELD, body, Field.Store.YES));
return document;
}
And searching side:
static void search(Directory index, String searchPhrase) throws Exception {
final int topN = 10;
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(FIELD, new WhitespaceAnalyzer());
Query query = parser.parse(searchPhrase);
System.out.println(query);
TopDocs topDocs = searcher.search(query, topN);
System.out.printf("%s => %d hits\n", searchPhrase, topDocs.totalHits);
for(ScoreDoc scoreDoc: topDocs.scoreDocs){
Document doc = searcher.doc(scoreDoc.doc);
System.out.printf("\t%s\n", doc.get(FIELD));
}
}
And then use them like this:
public static void main(String[] args) throws Exception {
Directory index = index();
search(index, "\"JJ NN VBD\""); // search the sequence of POS tags
search(index, "\"brown fox\""); // search a phrase
search(index, "\"fox brown\""); // search a phrase (no hits)
search(index, "baby"); // search a word
search(index, "\"TO PRP\""); // search the sequence of POS tags
}
The result looks like this:
body:"JJ NN VBD"
"JJ NN VBD" => 1 hits
The quick brown fox jumped over the lazy dogs.
body:"brown fox"
"brown fox" => 1 hits
The quick brown fox jumped over the lazy dogs.
body:"fox brown"
"fox brown" => 0 hits
body:baby
baby => 1 hits
Give it to me, baby!
body:"TO PRP"
"TO PRP" => 1 hits
Give it to me, baby!
This is my code to perform a PhraseQuery using Lucene. While it is clear how to get score matches for each document inside the index, i am not understanding how to extract the total number of matches for a single document.
The following is my code performing the query:
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.add(new Term("contents", "word1"), 0);
builder.add(new Term("contents", "word2"), 1);
builder.add(new Term("contents", "word3"), 2);
builder.setSlop(3);
PhraseQuery pq = builder.build();
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(pq, hitsPerPage);
ScoreDoc[] hits = docs.scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(docId + " " + hits[i].score);
}
Is there a method to extract the total number of matches for each document rather than the score?
Approach A. This might not be the best way but it will give you a quick insight. You can use explain() function of IndexSearcher class which will return a string containing lots of information and phrase frequency in a document. Add this code inside your for loop:
System.out.println(searcher.explain(pq, searcher.doc(docId)));
Approach B. A more systematic way of doing this is to do the same thing that explain() function does. To compute the phrase frequency, explain() builds a scorer object for the phrase query and calls freq() on it. Most of the methods/classes used to do this are private/protected so I am not sure if you can really use them. However it might be helpful to look at the code of explain() in PhraseWeight class inside PhraseQuery and ExactPhraseScorer class. (Some of these classes are not public and you should download the source code to be able to see them).
I'm trying to compute tf-idf value of each term in a document. So, I iterate through the terms in a document and want to find the frequency of the term in the whole corpus and the number of documents in which the term appears. Following is my code:
//#param index path to index directory
//#param docNbr the document number in the index
public void readingIndex(String index, int docNbr) {
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
Document doc = reader.document(docNbr);
System.out.println("Processing file: "+doc.get("id"));
Terms termVector = reader.getTermVector(docNbr, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
long termFreq = itr.totalTermFreq(); //FIXME: this only return frequency in this doc
long docCount = itr.docFreq(); //FIXME: docCount = 1 in all cases
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
}
reader.close();
}
Although the documentation says totalTermFreq() returns the total number of occurrences of this term across all documents, when testing I found it only returns the frequency of the term in the document given by docNbr. and docFreq() always return 1.
How can I get frequency of a term across the whole index?
Update
Of course, I can create a map to map a term to its frequency. Then iterate through each document to count the total number of time a term occur. However, I thought Lucene should have a built in method for that purpose.
Thank you,
IndexReader.TotalTermFreq(Term) will provide this for you. Your calls to the similar methods on the TermsEnum are indeed providing the stats for all documents, in the enumeration. Using the reader should get you the stats for all the documents in the index itself. Something like:
String termText = term.utf8ToString();
Term termInstance = new Term("contents", term);
long termFreq = reader.totalTermFreq(termInstance);
long docCount = reader.docFreq(termInstance);
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
Currently I have an issue with the Lucene search (version 2.9).
I have a search term and I need to use it on several fields. Therefore, I have to use MultiFieldQueryParser. On the other hand, I have to use the WhildcardQuery(), because our customer wants to search for a term in a phrase (e.g. "CMH" should match "KRC250/CMH/830/T/H").
I have tried to replace the slashes ('/') with stars ('*') and use a BooleanQuery with enclosed stars for the term.
Unfortunately whichout any success.
Does anyone have any Idea?
Yes, if the field shown is a single token, setting setAllowLeadingWildcard to be true would be necessary, like:
parser.setAllowLeadingWildcard(true);
Query query = parser.parse("*CMH*");
However:
You don't mention how the field is analyzed. By default, the StandardAnalyzer is used, which will split it into tokens at slashes (or asterisks, when indexing data). If you are using this sort of analysis, you could simply create a TermQuery searching for "cmh" (StandardAnalyzer includes a LowercaseFilter), or simply:
String[] fields = {"this", "that", "another"};
QueryParser parser = MultiFieldQueryParser(Version.LUCENE_29, fields, analyzer) //Assuming StandardAnalyzer
Query simpleQuery = parser.parse("CMH");
//Or even...
Query slightlyMoreComplexQuery = parser.parse("\"CMH/830/T\"");
I don't understand what you mean by a BooleanQuery with enclosed stars, if you can include code to elucidate that, it might help.
Sorry, maybe I have described it a little bit wrong.
I took something like this:
BooleanQuery bq = new BooleanQuery();
foreach (string field in fields)
{
foreach (string tok in tokArr)
{
bq.Add(new WildcardQuery(new Term(field, " *" + tok + "* ")), BooleanClause.Occur.SHOULD);
}
}
return bq;
but unfortunately it did not work.
I have modified it like this
string newterm = string.Empty;
string[] tok = term.Split(new[] { ' ', '/' }, StringSplitOptions.RemoveEmptyEntries);
tok.ForEach(x => newterm += x.EnsureStartsWith(" *").EnsureEndsWith("* "));
var version = Lucene.Net.Util.Version.LUCENE_29;
var analyzer = new StandardAnalyzer(version);
var parser = new MultiFieldQueryParser(version, fields, analyzer);
parser.SetDefaultOperator(QueryParser.Operator.AND);
parser.SetAllowLeadingWildcard(true);
return parser.Parse(newterm);
and my customer love it :-)
I am trying to delete documents from Lucene Index.
I want to delete only the specified file from lucene index .
My following program is deleting the index which can be searched using keyword analyzer but my required filename can be searched only using StandardAnalyzer . So is it any way to set standard analyzer in my term or instead of term how can i user QueryParser to delete the Documents from lucene index.
try{
File INDEX_DIR= new File("D:\\merge lucene\\abc\\");
Directory directory = FSDirectory.open(INDEX_DIR);
IndexReader indexReader = IndexReader.open(directory,false);
Term term= new Term("path","fileindex23005.htm");
int l= indexReader.deleteDocuments(term);
indexReader.close();
System.out.println("documents deleted");
}
catch(Exception x){x.printStackTrace();}
I assume you are using Lucene 3.6 or before, otherwise IndexReader.deleteDocuments no longer exists. You should, however, be using IndexWriter instead, anyway.
If you can only find the document using query parser, then just run a normal query, then iterate through the documents returned, and delete them by docnum, along the lines of:
Query query = queryParser.parse("My Query!");
ScoreDoc[] docs = searcher.search(query, 100).scoreDocs;
For (ScoreDoc doc : docs) {
indexReader.deleteDocument(doc.doc);
}
Or better yet (simpler, uses non-defunct, non-deprecated functionality), just use an IndexWriter, and pass it the query directly:
Query query = queryParser.parse("My Query!");
writer.deleteDocuments(query);
Adding for future reference for someone like me, where delete documents is on indexWriter , you may use
indexWriter.deleteDocuments(Term... terms)
instead of using deleteDocuments(query) method; to have less hassle if you have to match only one field. Be-aware that this method treats terms as OR condition if multiple terms are passed. So it will match any term and will delete all records. The code below will match state=Tx in documents stored and will delete matching records.
indexWriter.deleteDocuments(
new Term("STATE", "Tx")
);
For combining different fields with AND condition, we may use following code:
BooleanQuery.Builder builder = new BooleanQuery.Builder();
//note year is stored as int , not as string when document is craeted.
//if you use Term here which will need 2016 as String, that will not match with documents stored with year as int.
Query yearQuery = IntPoint.newExactQuery("year", 2016);
Query stateQuery = new TermQuery(new Term("STATE", "TX"));
Query cityQuery = new TermQuery(new Term("CITY", "CITY NAME"));
builder.add(yearQuery, BooleanClause.Occur.MUST);
builder.add(stateQuery, BooleanClause.Occur.MUST);
builder.add(cityQuery, BooleanClause.Occur.MUST);
indexWriter.deleteDocuments(builder.build());
As #dillippattnaik pointed out, multiple terms result in OR. I have updated his code to make it AND using BooleanQuery:
BooleanQuery query = new BooleanQuery
{
{ new TermQuery( new Term( "year", "2016" ) ), Occur.MUST },
{ new TermQuery( new Term( "STATE", "TX" ) ), Occur.MUST },
{ new TermQuery( new Term( "CITY", "CITY NAME" ) ), Occur.MUST }
};
indexWriter.DeleteDocuments( query );