How to set Lucene standard analyzer for PhraseQuery search? - lucene

I'm under the impression from a variety of tutorials out there on Lucene that if I do something like:
IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
Field title = new Field("title", titlefield, Field.Store.YES, Field.Index.ANALYZED);
doc.add(title);
writer.addDocument(doc);
writer.optimize();
writer.close();
IndexReader ireader = IndexReader.open(indexPath);
IndexSearcher indexsearcher = new IndexSearcher(ireader);
Term term1 = new Term("title", "one");
Term term2 = new Term("title", "two");
PhraseQuery query = new PhraseQuery();
query.add(term1);
query.add(term2);
query.setSlop(2);
that Lucene should return all queries for the title field containing "one" and "two" within 2 words of each other. But I don't get any results because I'm not using the StandardAnalyzer to search. How can do a proximity search in Lucene then? Does the following queryParser allow for proximity searches (using the tilde?)
QueryParser queryParser = new QueryParser("title",new StandardAnalyzer());
Query query = queryParser.parse("test");

yes, when you parse a query using QueryParser you will be able to do proximity searching.
In general it is always recommended to use the same analyser for indexing and searching.
BR,
Chris

Related

Lucene problems searchinh hyphenated field

I'm having some problems with Lucene that are driving me nuts. I have the following field:
doc.Add(new Field("cataloguenumber", i.CatalogueNumber.ToLower(), Field.Store.YES, Field.Index.ANALYZED));
Which will contain a catalogue number that looks something like this:
DF-GH5
DF-FJ4
DF-DOG
AC-DP
AC-123
AC-DOCO
i.e. two characters followed by a hyphen followed by 2-5 alphanumeric characters.
I'm trying to run a boolean query to allow users to search over the data:
// specify the search fields, lucene search in multiple fields
string[] searchfields = new string[] { "cataloguenumber", "title", "author", "categories", "year", "length", "keyword", "description" };
// Making a boolean query for searching and get the searched hits
BooleanQuery mainQuery = new BooleanQuery();
QueryParser parser;
//Add filter for main keyword
parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, searchfields, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30));
parser.AllowLeadingWildcard = true;
mainQuery.Add(parser.Parse(GetMainSearchQueryString(SearchPhrase)), Occur.MUST);
The system is working fine for all fields EXCEPT cataloguenumber which for whatever reason is not working at all.
Ideally we would like to be able to search by full or partial cataloguenumber so for example "DF-" should return all items prefixed DF
Does anyone know how I can make this work?
Thanks very much in advance
Olly
A common source of problems is to use different analyzers on index-time and query-time. You should be able to get good results by using a StandardAnalyzer - it treats the text DF-GH5 as a single token so you will be able to search using fx df-gh5 or df-* but make sure to use it for the IndexWriter and the QueryParser.
Here is a simple example which builds an in-memory index with a single document, and tries to query the index by cataloguenumber.
public static void Test()
{
// Use an in-memory index.
RAMDirectory indexDirectory = new RAMDirectory();
// Make sure to use the same analyzer for indexing
Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
// Add single document to the index.
using (IndexWriter writer = new IndexWriter(indexDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
{
Document document = new Document();
document.Add(new Field("content", "This is just some text", Field.Store.YES, Field.Index.ANALYZED));
document.Add(new Field("cataloguenumber", "DF-GH5", Field.Store.YES, Field.Index.ANALYZED));
writer.AddDocument(document);
}
var parser = new MultiFieldQueryParser(
Lucene.Net.Util.Version.LUCENE_30,
new[] { "cataloguenumber", "content" },
analyzer);
var searcher = new IndexSearcher(indexDirectory);
DoSearch("df-gh5", parser, searcher);
DoSearch("df-*", parser, searcher);
}
private static void DoSearch(string queryString, MultiFieldQueryParser parser, IndexSearcher searcher)
{
var query = parser.Parse(queryString);
TopDocs docs = searcher.Search(query, 10);
foreach (ScoreDoc scoreDoc in docs.ScoreDocs)
{
Document searchHit = searcher.Doc(scoreDoc.Doc);
string cataloguenumber = searchHit.GetValues("cataloguenumber").FirstOrDefault();
string content = searchHit.GetValues("content").FirstOrDefault();
Console.WriteLine($"Found object: {cataloguenumber} {content}");
}
}

Apache Lucene 5.5.3 - Searching a string ending with special character

I'm using Apache Lucene 5.5.3. I'm using org.apache.lucene.analysis.standard.StandardAnalyzer in my code and using below code snippet to create index.
Document doc = new Document();
doc.add(new TextField("userName", getUserName(), Field.Store.YES));
Now if I search for a string 'ALL-' , then I'm not getting any search results but if I search for a string 'ALL-Categories', then I'm getting some search results.
The same thing is happening for a string with special characters '+' , '.', '!' etc.
Below is my search code:-
Directory directory = new RAMDirectory();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Document document = new Document();
document.add(new TextField("body", ALL-THE GLITTERS IS NOT GOLD, Field.Store.YES));
IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(buildAnalyzer()));
writer.addDocument(document);
writer.commit();
Builder builder = new BooleanQuery.Builder();
Query query1 = new QueryParser(IndexAttribute.USER_NAME, buildAnalyzer()).parse(searchQUery+"*");
Query query2 = new QueryParser(IndexAttribute.IS_VETERAN, buildAnalyzer()).parse(""+isVeteran);
builder.add(query1, BooleanClause.Occur.MUST);
builder.add(query2, BooleanClause.Occur.MUST);
Query q = builder.build();
TopDocs docs = searcher.search(q, 10);
ScoreDoc[] hits = docs.scoreDocs;
private static Analyzer buildAnalyzer() throws IOException {
return CustomAnalyzer.builder().withTokenizer("whitespace").addTokenFilter("lowercase")
.addTokenFilter("standard").build();
}
So, Please suggest me on this.
Please refer section Escaping Special Characters to know special characters in Lucene 5.5.3.
As suggested in above article, you need to place a \ or alternatively you can use method public static String escape(String s) of QueryParser class to achieve the same.
I got the solution with WildcardQuery, StringField and MultiFieldQueryParser combination. In addition to these classes, we have to do is escape the space in the query string

Lucene phrase query does not work

I can't figure out how to make phrase query to work. It returns exact mathes, but slop option doesn't seem to make a difference.
Here's my code:
static void Main(string[] args)
{
using (Directory directory = new RAMDirectory())
{
Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
using (IndexWriter writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
// index a few documents
writer.AddDocument(createDocument("1", "henry morgan"));
writer.AddDocument(createDocument("2", "henry junior morgan"));
writer.AddDocument(createDocument("3", "henry immortal jr morgan"));
writer.AddDocument(createDocument("4", "morgan henry"));
}
// search for documents that have "foo bar" in them
String sentence = "henry morgan";
IndexSearcher searcher = new IndexSearcher(directory, true);
PhraseQuery query = new PhraseQuery()
{
//allow inverse order
Slop = 3
};
query.Add(new Term("contents", sentence));
// display search results
List<string> results = new List<string>();
Console.WriteLine("Looking for \"{0}\"...", sentence);
TopDocs topDocs = searcher.Search(query, 100);
foreach (ScoreDoc scoreDoc in topDocs.ScoreDocs)
{
var matchedContents = searcher.Doc(scoreDoc.Doc).Get("contents");
results.Add(matchedContents);
Console.WriteLine("Found: {0}", matchedContents);
}
}
private static Document createDocument(string id, string content)
{
Document doc = new Document();
doc.Add(new Field("id", id, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("contents", content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
return doc;
}
I thought that all options except document with id=3 are supposed to match, but only the first one does. Did I miss something?
In Lucene In Action 2nd, 3.4.6 Searching by phrase: PhraseQuery.
PhraseQuery uses this information to locate documents where terms are within a certain distance of one another
Sure, a plain TermQuery would do the trick to locate this document
knowing either of those words, but in this case we only want documents
that have phrases where the words are either exactly side by side
(quick fox) or have one word in between (quick [irrelevant] fox)
So the PhraseQuery is actually used between terms and the sample code in that chapter also proves it. As you use StandardAnalyzer, so "henry morgan"
will be henry and morgan after analyzation. Therefore, you can not add "henry morgan" as one Term
/*
Sets the number of other words permitted between words
in query phrase.If zero, then this is an exact phrase search.
*/
public void setSlop(int s) { slop = s; }
The definition of setSlop may further explain the case.
After a little change on your code, I got it nailed.
// code in Scala
val query = new PhraseQuery();
query.setSlop(3)
List("henry", "morgan").foreach { word =>
query.add(new Term("contents", word))
}
In this case, the four documents will all be matched. If you have any further problem, I suggest you read that chapter in Lucene In Action 2nd.
That might help.

Lucene doesn't search number fields

I'm trying to index and then search integer field with lucene. But it doesn't find anything (Text fields search well).
Document doc = new Document();
//UserType = 1
doc.add(new IntField("userType", user.getType().getId(), Field.Store.YES));
FSDirectory dir = FSDirectory.open(FileSystems.getDefault().getPath(indexDir));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
writer = new IndexWriter(dir, config);
writer.addDocument(doc);
For search I tried to use next queries:
1) new QueryParser(defautField, new StandartAnalyzer()).parse("userType:1");
2) new QueryParser(defautField, new StandartAnalyzer()).parse("userType:[1 TO 1]");
3) new QueryParser(defautField, new StandartAnalyzer()).parse("userType:\"1\"");
But it doesn't work.
QueryParser doesn't handle numerics. You can search using NumericRangeQuery:
Query query = NumericRangeQuery.newIntRange("userType", 1, 1, true, true);

Lucene 4.5. Searching a StringField for a multi term query

I'm trying to query a StringField for an index created with Lucene 4.5 with a string made up of multiple terms.
Let us suppose we create a Document object using the following code snippet/pseudocode.
Directory dir = FSDirectory.open(new File(indexPath));
Analyzer analyzer = new EnglishAnalyzer(Version.LUCENE_45);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer = (dir, iwc);
Document doc = new Document();
Field title = new StringField("Title", minQuery, Field.Store.YES);
doc.add(title);
writer.addDocument(doc);
Now suppose I go and query the above create index using the following query code (again is just a sketch of the actual code I'm using):
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(indexPath)));
BM25Similarity bm25sim = new BM25Similarity();
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(bm25sim);
Analyzer analyzer = new EnglishAnalyzer(Version.LUCENE_45);
QueryParser parser = new QueryParser(Version.LUCENE_45, "Content", analyzer);
Query query = parser.parse("Title:\"washington dc\"");
TopDocs result = searcher.search(query, 1);
When I run the code above I got the following exception in correspondence of the searcher.search(query,1) statement:
Exception in thread "main" java.lang.IllegalStateException: field "Title" was
indexed without position data; cannot run PhraseQuery (term=washington)
I've looked around and I cannot find a way to overcome to this issue. It looks like in past versions of Lucene you could add the Field.Index.ANALYZED option to the field creation but I've not been able to do something like that in my case.
Any idea?
Your query is getting analyzed as full-text, rather than as one atomic string. In order to allow the query parser to effectively decide on the appropriate analyzer to use for different fields, you can use a PerFieldAnalyzerWrapper, with KeywordAnalyzer being the appropriate analyzer to apply to a StringField.
Map<String,Analyzer> analyzerMap = new HashMap<String,Analyzer>();
analyzerPerField.put("Title", new KeywordAnalyzer());
PerFieldAnalyzerWrapper analyzer =
new PerFieldAnalyzerWrapper(new EnglishAnalyzer(Version.LUCENE_45), analyzerMap);
QueryParser parser = new QueryParser(Version.LUCENE_45, "Content", analyzer);