Lucene 4.5. Searching a StringField for a multi term query - lucene

I'm trying to query a StringField for an index created with Lucene 4.5 with a string made up of multiple terms.
Let us suppose we create a Document object using the following code snippet/pseudocode.
Directory dir = FSDirectory.open(new File(indexPath));
Analyzer analyzer = new EnglishAnalyzer(Version.LUCENE_45);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer = (dir, iwc);
Document doc = new Document();
Field title = new StringField("Title", minQuery, Field.Store.YES);
doc.add(title);
writer.addDocument(doc);
Now suppose I go and query the above create index using the following query code (again is just a sketch of the actual code I'm using):
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(indexPath)));
BM25Similarity bm25sim = new BM25Similarity();
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(bm25sim);
Analyzer analyzer = new EnglishAnalyzer(Version.LUCENE_45);
QueryParser parser = new QueryParser(Version.LUCENE_45, "Content", analyzer);
Query query = parser.parse("Title:\"washington dc\"");
TopDocs result = searcher.search(query, 1);
When I run the code above I got the following exception in correspondence of the searcher.search(query,1) statement:
Exception in thread "main" java.lang.IllegalStateException: field "Title" was
indexed without position data; cannot run PhraseQuery (term=washington)
I've looked around and I cannot find a way to overcome to this issue. It looks like in past versions of Lucene you could add the Field.Index.ANALYZED option to the field creation but I've not been able to do something like that in my case.
Any idea?

Your query is getting analyzed as full-text, rather than as one atomic string. In order to allow the query parser to effectively decide on the appropriate analyzer to use for different fields, you can use a PerFieldAnalyzerWrapper, with KeywordAnalyzer being the appropriate analyzer to apply to a StringField.
Map<String,Analyzer> analyzerMap = new HashMap<String,Analyzer>();
analyzerPerField.put("Title", new KeywordAnalyzer());
PerFieldAnalyzerWrapper analyzer =
new PerFieldAnalyzerWrapper(new EnglishAnalyzer(Version.LUCENE_45), analyzerMap);
QueryParser parser = new QueryParser(Version.LUCENE_45, "Content", analyzer);

Related

Querying part-of-speech tags with Lucene 7 OpenNLP

For fun and learning I am trying to build a part-of-speech (POS) tagger with OpenNLP and Lucene 7.4. The goal would be that once indexed I can actually search for a sequence of POS tags and find all sentences that match sequence. I already get the indexing part, but I am stuck on the query part. I am aware that SolR might have some functionality for this, and I already checked the code (which was not so self-expalantory after all). But my goal is to understand and implement in Lucene 7, not in SolR, as I want to be independent of any search engine on top.
Idea
Input sentence 1: The quick brown fox jumped over the lazy dogs.
Applied Lucene OpenNLP tokenizer results in: [The][quick][brown][fox][jumped][over][the][lazy][dogs][.]
Next, applying Lucene OpenNLP POS tagging results in: [DT][JJ][JJ][NN][VBD][IN][DT][JJ][NNS][.]
Input sentence 2: Give it to me, baby!
Applied Lucene OpenNLP tokenizer results in: [Give][it][to][me][,][baby][!]
Next, applying Lucene OpenNLP POS tagging results in: [VB][PRP][TO][PRP][,][UH][.]
Query: JJ NN VBD matches part of sentence 1, so sentence 1 should be returned. (At this point I am only interested in exact matches, i.e. let's leave aside partial matches, wildcards etc.)
Indexing
First, I created my own class com.example.OpenNLPAnalyzer:
public class OpenNLPAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents(String fieldName) {
try {
ResourceLoader resourceLoader = new ClasspathResourceLoader(ClassLoader.getSystemClassLoader());
TokenizerModel tokenizerModel = OpenNLPOpsFactory.getTokenizerModel("en-token.bin", resourceLoader);
NLPTokenizerOp tokenizerOp = new NLPTokenizerOp(tokenizerModel);
SentenceModel sentenceModel = OpenNLPOpsFactory.getSentenceModel("en-sent.bin", resourceLoader);
NLPSentenceDetectorOp sentenceDetectorOp = new NLPSentenceDetectorOp(sentenceModel);
Tokenizer source = new OpenNLPTokenizer(
AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, sentenceDetectorOp, tokenizerOp);
POSModel posModel = OpenNLPOpsFactory.getPOSTaggerModel("en-pos-maxent.bin", resourceLoader);
NLPPOSTaggerOp posTaggerOp = new NLPPOSTaggerOp(posModel);
// Perhaps we should also use a lower-case filter here?
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
// Very important: Tokens are not indexed, we need a store them as payloads otherwise we cannot search on them
TypeAsPayloadTokenFilter payloadFilter = new TypeAsPayloadTokenFilter(posFilter);
return new TokenStreamComponents(source, payloadFilter);
}
catch (IOException e) {
throw new RuntimeException(e.getMessage());
}
}
Note that we are using a TypeAsPayloadTokenFilter wrapped around OpenNLPPOSFilter. This means, our POS tags will be indexed as payloads, and our query - however it'll look like - will have to search on payloads as well.
Querying
This is where I am stuck. I have no clue how to query on payloads, and whatever I try does not work. Note that I am using Lucene 7, it seems that in older versions querying on payload has changed several times. Documentation is extremely scarce. It's not even clear what the proper field name is now to query - is it "word" or "type" or anything else? For example, I tried this code which does not return any search results:
// Step 1: Indexing
final String body = "The quick brown fox jumped over the lazy dogs.";
Directory index = new RAMDirectory();
OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, indexWriterConfig);
Document document = new Document();
document.add(new TextField("body", body, Field.Store.YES));
writer.addDocument(document);
writer.close();
// Step 2: Querying
final int topN = 10;
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
final String fieldName = "body"; // What is the correct field name here? "body", or "type", or "word" or anything else?
final String queryText = "JJ";
Term term = new Term(fieldName, queryText);
SpanQuery match = new SpanTermQuery(term);
BytesRef pay = new BytesRef("type"); // Don't understand what to put here as an argument
SpanPayloadCheckQuery query = new SpanPayloadCheckQuery(match, Collections.singletonList(pay));
System.out.println(query.toString());
TopDocs topDocs = searcher.search(query, topN);
Any help is very much appreciated here.
Why don't you use TypeAsSynonymFilter instead of TypeAsPayloadTokenFilter and just make a normal query. So in your Analyzer:
:
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
TypeAsSynonymFilter typeAsSynonymFilter = new TypeAsSynonymFilter(posFilter);
return new TokenStreamComponents(source, typeAsSynonymFilter);
And indexing side:
static Directory index() throws Exception {
Directory index = new RAMDirectory();
OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, indexWriterConfig);
writer.addDocument(doc("The quick brown fox jumped over the lazy dogs."));
writer.addDocument(doc("Give it to me, baby!"));
writer.close();
return index;
}
static Document doc(String body){
Document document = new Document();
document.add(new TextField(FIELD, body, Field.Store.YES));
return document;
}
And searching side:
static void search(Directory index, String searchPhrase) throws Exception {
final int topN = 10;
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(FIELD, new WhitespaceAnalyzer());
Query query = parser.parse(searchPhrase);
System.out.println(query);
TopDocs topDocs = searcher.search(query, topN);
System.out.printf("%s => %d hits\n", searchPhrase, topDocs.totalHits);
for(ScoreDoc scoreDoc: topDocs.scoreDocs){
Document doc = searcher.doc(scoreDoc.doc);
System.out.printf("\t%s\n", doc.get(FIELD));
}
}
And then use them like this:
public static void main(String[] args) throws Exception {
Directory index = index();
search(index, "\"JJ NN VBD\""); // search the sequence of POS tags
search(index, "\"brown fox\""); // search a phrase
search(index, "\"fox brown\""); // search a phrase (no hits)
search(index, "baby"); // search a word
search(index, "\"TO PRP\""); // search the sequence of POS tags
}
The result looks like this:
body:"JJ NN VBD"
"JJ NN VBD" => 1 hits
The quick brown fox jumped over the lazy dogs.
body:"brown fox"
"brown fox" => 1 hits
The quick brown fox jumped over the lazy dogs.
body:"fox brown"
"fox brown" => 0 hits
body:baby
baby => 1 hits
Give it to me, baby!
body:"TO PRP"
"TO PRP" => 1 hits
Give it to me, baby!

Lucene problems searchinh hyphenated field

I'm having some problems with Lucene that are driving me nuts. I have the following field:
doc.Add(new Field("cataloguenumber", i.CatalogueNumber.ToLower(), Field.Store.YES, Field.Index.ANALYZED));
Which will contain a catalogue number that looks something like this:
DF-GH5
DF-FJ4
DF-DOG
AC-DP
AC-123
AC-DOCO
i.e. two characters followed by a hyphen followed by 2-5 alphanumeric characters.
I'm trying to run a boolean query to allow users to search over the data:
// specify the search fields, lucene search in multiple fields
string[] searchfields = new string[] { "cataloguenumber", "title", "author", "categories", "year", "length", "keyword", "description" };
// Making a boolean query for searching and get the searched hits
BooleanQuery mainQuery = new BooleanQuery();
QueryParser parser;
//Add filter for main keyword
parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, searchfields, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30));
parser.AllowLeadingWildcard = true;
mainQuery.Add(parser.Parse(GetMainSearchQueryString(SearchPhrase)), Occur.MUST);
The system is working fine for all fields EXCEPT cataloguenumber which for whatever reason is not working at all.
Ideally we would like to be able to search by full or partial cataloguenumber so for example "DF-" should return all items prefixed DF
Does anyone know how I can make this work?
Thanks very much in advance
Olly
A common source of problems is to use different analyzers on index-time and query-time. You should be able to get good results by using a StandardAnalyzer - it treats the text DF-GH5 as a single token so you will be able to search using fx df-gh5 or df-* but make sure to use it for the IndexWriter and the QueryParser.
Here is a simple example which builds an in-memory index with a single document, and tries to query the index by cataloguenumber.
public static void Test()
{
// Use an in-memory index.
RAMDirectory indexDirectory = new RAMDirectory();
// Make sure to use the same analyzer for indexing
Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
// Add single document to the index.
using (IndexWriter writer = new IndexWriter(indexDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
{
Document document = new Document();
document.Add(new Field("content", "This is just some text", Field.Store.YES, Field.Index.ANALYZED));
document.Add(new Field("cataloguenumber", "DF-GH5", Field.Store.YES, Field.Index.ANALYZED));
writer.AddDocument(document);
}
var parser = new MultiFieldQueryParser(
Lucene.Net.Util.Version.LUCENE_30,
new[] { "cataloguenumber", "content" },
analyzer);
var searcher = new IndexSearcher(indexDirectory);
DoSearch("df-gh5", parser, searcher);
DoSearch("df-*", parser, searcher);
}
private static void DoSearch(string queryString, MultiFieldQueryParser parser, IndexSearcher searcher)
{
var query = parser.Parse(queryString);
TopDocs docs = searcher.Search(query, 10);
foreach (ScoreDoc scoreDoc in docs.ScoreDocs)
{
Document searchHit = searcher.Doc(scoreDoc.Doc);
string cataloguenumber = searchHit.GetValues("cataloguenumber").FirstOrDefault();
string content = searchHit.GetValues("content").FirstOrDefault();
Console.WriteLine($"Found object: {cataloguenumber} {content}");
}
}

Apache Lucene 5.5.3 - Searching a string ending with special character

I'm using Apache Lucene 5.5.3. I'm using org.apache.lucene.analysis.standard.StandardAnalyzer in my code and using below code snippet to create index.
Document doc = new Document();
doc.add(new TextField("userName", getUserName(), Field.Store.YES));
Now if I search for a string 'ALL-' , then I'm not getting any search results but if I search for a string 'ALL-Categories', then I'm getting some search results.
The same thing is happening for a string with special characters '+' , '.', '!' etc.
Below is my search code:-
Directory directory = new RAMDirectory();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Document document = new Document();
document.add(new TextField("body", ALL-THE GLITTERS IS NOT GOLD, Field.Store.YES));
IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(buildAnalyzer()));
writer.addDocument(document);
writer.commit();
Builder builder = new BooleanQuery.Builder();
Query query1 = new QueryParser(IndexAttribute.USER_NAME, buildAnalyzer()).parse(searchQUery+"*");
Query query2 = new QueryParser(IndexAttribute.IS_VETERAN, buildAnalyzer()).parse(""+isVeteran);
builder.add(query1, BooleanClause.Occur.MUST);
builder.add(query2, BooleanClause.Occur.MUST);
Query q = builder.build();
TopDocs docs = searcher.search(q, 10);
ScoreDoc[] hits = docs.scoreDocs;
private static Analyzer buildAnalyzer() throws IOException {
return CustomAnalyzer.builder().withTokenizer("whitespace").addTokenFilter("lowercase")
.addTokenFilter("standard").build();
}
So, Please suggest me on this.
Please refer section Escaping Special Characters to know special characters in Lucene 5.5.3.
As suggested in above article, you need to place a \ or alternatively you can use method public static String escape(String s) of QueryParser class to achieve the same.
I got the solution with WildcardQuery, StringField and MultiFieldQueryParser combination. In addition to these classes, we have to do is escape the space in the query string

How to delete Documents from a Lucene Index using Term or QueryParser

I am trying to delete documents from Lucene Index.
I want to delete only the specified file from lucene index .
My following program is deleting the index which can be searched using keyword analyzer but my required filename can be searched only using StandardAnalyzer . So is it any way to set standard analyzer in my term or instead of term how can i user QueryParser to delete the Documents from lucene index.
try{
File INDEX_DIR= new File("D:\\merge lucene\\abc\\");
Directory directory = FSDirectory.open(INDEX_DIR);
IndexReader indexReader = IndexReader.open(directory,false);
Term term= new Term("path","fileindex23005.htm");
int l= indexReader.deleteDocuments(term);
indexReader.close();
System.out.println("documents deleted");
}
catch(Exception x){x.printStackTrace();}
I assume you are using Lucene 3.6 or before, otherwise IndexReader.deleteDocuments no longer exists. You should, however, be using IndexWriter instead, anyway.
If you can only find the document using query parser, then just run a normal query, then iterate through the documents returned, and delete them by docnum, along the lines of:
Query query = queryParser.parse("My Query!");
ScoreDoc[] docs = searcher.search(query, 100).scoreDocs;
For (ScoreDoc doc : docs) {
indexReader.deleteDocument(doc.doc);
}
Or better yet (simpler, uses non-defunct, non-deprecated functionality), just use an IndexWriter, and pass it the query directly:
Query query = queryParser.parse("My Query!");
writer.deleteDocuments(query);
Adding for future reference for someone like me, where delete documents is on indexWriter , you may use
indexWriter.deleteDocuments(Term... terms)
instead of using deleteDocuments(query) method; to have less hassle if you have to match only one field. Be-aware that this method treats terms as OR condition if multiple terms are passed. So it will match any term and will delete all records. The code below will match state=Tx in documents stored and will delete matching records.
indexWriter.deleteDocuments(
new Term("STATE", "Tx")
);
For combining different fields with AND condition, we may use following code:
BooleanQuery.Builder builder = new BooleanQuery.Builder();
//note year is stored as int , not as string when document is craeted.
//if you use Term here which will need 2016 as String, that will not match with documents stored with year as int.
Query yearQuery = IntPoint.newExactQuery("year", 2016);
Query stateQuery = new TermQuery(new Term("STATE", "TX"));
Query cityQuery = new TermQuery(new Term("CITY", "CITY NAME"));
builder.add(yearQuery, BooleanClause.Occur.MUST);
builder.add(stateQuery, BooleanClause.Occur.MUST);
builder.add(cityQuery, BooleanClause.Occur.MUST);
indexWriter.deleteDocuments(builder.build());
As #dillippattnaik pointed out, multiple terms result in OR. I have updated his code to make it AND using BooleanQuery:
BooleanQuery query = new BooleanQuery
{
{ new TermQuery( new Term( "year", "2016" ) ), Occur.MUST },
{ new TermQuery( new Term( "STATE", "TX" ) ), Occur.MUST },
{ new TermQuery( new Term( "CITY", "CITY NAME" ) ), Occur.MUST }
};
indexWriter.DeleteDocuments( query );

How to set Lucene standard analyzer for PhraseQuery search?

I'm under the impression from a variety of tutorials out there on Lucene that if I do something like:
IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
Field title = new Field("title", titlefield, Field.Store.YES, Field.Index.ANALYZED);
doc.add(title);
writer.addDocument(doc);
writer.optimize();
writer.close();
IndexReader ireader = IndexReader.open(indexPath);
IndexSearcher indexsearcher = new IndexSearcher(ireader);
Term term1 = new Term("title", "one");
Term term2 = new Term("title", "two");
PhraseQuery query = new PhraseQuery();
query.add(term1);
query.add(term2);
query.setSlop(2);
that Lucene should return all queries for the title field containing "one" and "two" within 2 words of each other. But I don't get any results because I'm not using the StandardAnalyzer to search. How can do a proximity search in Lucene then? Does the following queryParser allow for proximity searches (using the tilde?)
QueryParser queryParser = new QueryParser("title",new StandardAnalyzer());
Query query = queryParser.parse("test");
yes, when you parse a query using QueryParser you will be able to do proximity searching.
In general it is always recommended to use the same analyser for indexing and searching.
BR,
Chris