Lucene doesn't search number fields - lucene

I'm trying to index and then search integer field with lucene. But it doesn't find anything (Text fields search well).
Document doc = new Document();
//UserType = 1
doc.add(new IntField("userType", user.getType().getId(), Field.Store.YES));
FSDirectory dir = FSDirectory.open(FileSystems.getDefault().getPath(indexDir));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
writer = new IndexWriter(dir, config);
writer.addDocument(doc);
For search I tried to use next queries:
1) new QueryParser(defautField, new StandartAnalyzer()).parse("userType:1");
2) new QueryParser(defautField, new StandartAnalyzer()).parse("userType:[1 TO 1]");
3) new QueryParser(defautField, new StandartAnalyzer()).parse("userType:\"1\"");
But it doesn't work.

QueryParser doesn't handle numerics. You can search using NumericRangeQuery:
Query query = NumericRangeQuery.newIntRange("userType", 1, 1, true, true);

Related

Apache Lucene fuzzy search for multi-worded phrases

I have the following Apache Lucene 7 application:
StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(standardAnalyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document document = new Document();
document.add(new TextField("content", new FileReader("document.txt")));
writer.addDocument(document);
writer.close();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Query fuzzyQuery = new FuzzyQuery(new Term("content", "Company"), 2);
TopDocs results = searcher.search(fuzzyQuery, 5);
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore())
when I use it with :
new FuzzyQuery(new Term("content", "Company"), 2);
the application works fine and returns the following result:
Hits: 1
Max score:0.35161147
but when I try to search with multi term query, for example:
new FuzzyQuery(new Term("content", "Company name"), 2);
it returns the following result:
Hits: 0
Max score:NaN
Anyway, the phrase Company name exists in the source document.txt file.
How to properly use FuzzyQuery in this case in order to be able to do the fuzzy search for multi-word phrases.
UPDATED
Based on the provided solution I have tested it on the following text information:
Company name: BlueCross BlueShield Customer Service
1-800-521-2227
of Texas Preauth-Medical 1-800-441-9188
Preauth-MH/CD 1-800-528-7264
Blue Card Access 1-800-810-2583
For the following query:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCross"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
the search works fine:
Hits: 1
Max score:0.5753642
but when I try to corrupt a little bit the search query(for example from BlueCross to BlueCros)
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCros"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
it stops working and returns:
Hits: 0
Max score:NaN
The problem here is the following, you're using TextField, which is tokenizing field. E.g. your text "Company name is working on something" would be effectively split by spaces (and others delimeters). So, even if you have the text Company name, during indexation it will become Company, name, is, etc.
In this case this TermQuery won't be able to find what you're looking for. The trick which going to help you would look like this:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "some"), 2));
clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "text"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
However, I wouldn't recommend this approach much, especially if your load would be big and you're planning on searching on a 10 term long company names. One should be aware, that those query are potentially heavy to execute.
The following problem with BlueCros is the following. By default Lucene uses StandardAnalyzer for TextField. So it means it effectively lowercase the terms, basically it means that BlueCross in the content field becomes bluecross.
Fuzzy difference between BlueCros and bluecross is 3, that's the reason you do not have a match.
Simple proposal would be to convert term in query to the lowercase, by doing something like .toLowerCase()
In general, one should prefer to use same analyzers during the query time as well (e.g. during construction of the query)
For Lucene.Net it can be like this.
private string _IndexPath = #"Your Index Path";
private Directory _Directory;
private Searcher _IndexSearcher;
private MultiPhraseQuery _MultiPhraseQuery;
_Directory = FSDirectory.Open(_IndexPath);
IndexReader indexReader = IndexReader.Open(_Directory, true);
string field = "Name" // Your field name
string keyword = "big red fox"; // your search term
float fuzzy = 0,7f; // between 0-1
using (_IndexSearcher = new IndexSearcher(indexReader))
{
// "big red fox" to [big,red,fox]
var keywordSplit = keyword.Split();
_MultiPhraseQuery = new MultiPhraseQuery();
FuzzyTermEnum[] _FuzzyTermEnum = new FuzzyTermEnum[keywordSplit.Length];
Term[] _Term = new Term[keywordSplit.Length];
for (int i = 0; i < keywordSplit.Length; i++)
{
_FuzzyTermEnum[i] = new FuzzyTermEnum(indexReader, new Term(field, keywordSplit[i]),fuzzy);
_Term[i] = _FuzzyTermEnum[i].Term;
if (_Term[i] == null)
{
_MultiPhraseQuery.Add(new Term(field, keywordSplit[i]));
}
else
{
_MultiPhraseQuery.Add(_FuzzyTermEnum[i].Term);
}
}
var results = _IndexSearcher.Search(_MultiPhraseQuery, indexReader.MaxDoc);
foreach (var loopDoc in results.ScoreDocs.OrderByDescending(s => s.Score))
{
//YourCode Here
}
}

Apache Lucene 5.5.3 - Searching a string ending with special character

I'm using Apache Lucene 5.5.3. I'm using org.apache.lucene.analysis.standard.StandardAnalyzer in my code and using below code snippet to create index.
Document doc = new Document();
doc.add(new TextField("userName", getUserName(), Field.Store.YES));
Now if I search for a string 'ALL-' , then I'm not getting any search results but if I search for a string 'ALL-Categories', then I'm getting some search results.
The same thing is happening for a string with special characters '+' , '.', '!' etc.
Below is my search code:-
Directory directory = new RAMDirectory();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Document document = new Document();
document.add(new TextField("body", ALL-THE GLITTERS IS NOT GOLD, Field.Store.YES));
IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(buildAnalyzer()));
writer.addDocument(document);
writer.commit();
Builder builder = new BooleanQuery.Builder();
Query query1 = new QueryParser(IndexAttribute.USER_NAME, buildAnalyzer()).parse(searchQUery+"*");
Query query2 = new QueryParser(IndexAttribute.IS_VETERAN, buildAnalyzer()).parse(""+isVeteran);
builder.add(query1, BooleanClause.Occur.MUST);
builder.add(query2, BooleanClause.Occur.MUST);
Query q = builder.build();
TopDocs docs = searcher.search(q, 10);
ScoreDoc[] hits = docs.scoreDocs;
private static Analyzer buildAnalyzer() throws IOException {
return CustomAnalyzer.builder().withTokenizer("whitespace").addTokenFilter("lowercase")
.addTokenFilter("standard").build();
}
So, Please suggest me on this.
Please refer section Escaping Special Characters to know special characters in Lucene 5.5.3.
As suggested in above article, you need to place a \ or alternatively you can use method public static String escape(String s) of QueryParser class to achieve the same.
I got the solution with WildcardQuery, StringField and MultiFieldQueryParser combination. In addition to these classes, we have to do is escape the space in the query string

Lucene white space analyzer ignoring phrases?

I'm updating a document in Lucene, but when I search for the full value in one of the fields no results come back. If I search for just one word, then I get a result back.
This example comes from chapter 2 of the Lucene in Action 2nd Edition book and I'm using the Lucene 3 Java Library.
Here's the main logic
"Document fields show new value when updated, and not old value" in {
getHitCount("city", "Amsterdam") must equal(1)
val update = new Document
update add new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED)
update add new Field("country", "Netherlands", Field.Store.YES, Field.Index.NO)
update add new Field("contents", "Den Haag has a lot of museums", Field.Store.NO, Field.Index.ANALYZED)
update add new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED)
wr updateDocument(new Term("id", "1"), update)
wr close
getHitCount("city", "Amsterdam") must equal(0)
getHitCount("city", "Den Haag") must equal(1)
}
It's the last line in the above that fails - the hit count is 0. If I change the query to either "Den" or "Haag" then I get 1 hit.
Here is all the setup and dependencies. Note how the writer uses a white space query analyzer as the book suggests. Is this the problem?
override def beforeEach{
dir = new RAMDirectory
val wri = writer
for (i <- 0 to ids.length - 1) {
val doc = new Document
doc add new Field("id", ids(i), Field.Store.YES, Field.Index.NOT_ANALYZED)
doc add new Field("country", unindexed(i), Field.Store.YES, Field.Index.NO)
doc add new Field("contents", unstored(i), Field.Store.NO, Field.Index.ANALYZED)
doc add new Field("city", text(i), Field.Store.YES, Field.Index.ANALYZED)
wri addDocument doc
}
wri close
wr = writer
}
var dir: RAMDirectory = _
def writer = new IndexWriter(dir, new WhitespaceAnalyzer, IndexWriter.MaxFieldLength.UNLIMITED)
var wr: IndexWriter = _
def getHitCount(field: String, q: String): Int = {
val searcher = new IndexSearcher(dir)
val query = new TermQuery(new Term(field, q))
val hitCount = searcher.search(query, 1).totalHits
searcher.close()
hitCount
}
You may want to look at PhraseQuery instead of TermQuery.

How to set Lucene standard analyzer for PhraseQuery search?

I'm under the impression from a variety of tutorials out there on Lucene that if I do something like:
IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
Field title = new Field("title", titlefield, Field.Store.YES, Field.Index.ANALYZED);
doc.add(title);
writer.addDocument(doc);
writer.optimize();
writer.close();
IndexReader ireader = IndexReader.open(indexPath);
IndexSearcher indexsearcher = new IndexSearcher(ireader);
Term term1 = new Term("title", "one");
Term term2 = new Term("title", "two");
PhraseQuery query = new PhraseQuery();
query.add(term1);
query.add(term2);
query.setSlop(2);
that Lucene should return all queries for the title field containing "one" and "two" within 2 words of each other. But I don't get any results because I'm not using the StandardAnalyzer to search. How can do a proximity search in Lucene then? Does the following queryParser allow for proximity searches (using the tilde?)
QueryParser queryParser = new QueryParser("title",new StandardAnalyzer());
Query query = queryParser.parse("test");
yes, when you parse a query using QueryParser you will be able to do proximity searching.
In general it is always recommended to use the same analyser for indexing and searching.
BR,
Chris

Different analyzers for each field

How can I enable different analyzers for each field in a document I'm indexing with Lucene? Example:
RAMDirectory dir = new RAMDirectory();
IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
Field field1 = new Field("field1", someText1, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
Field field2 = new Field("field2", someText2, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.Add(field1);
doc.Add(field2);
iw.AddDocument(doc);
iw.Commit();
The analyzer is an argument to the IndexWriter, but I want to use StandardAnalyzer for field1 and SimpleAnalyzer for field2, how can I do that? The same applies when searching, of course. The correct analyzer must be applied for each field.
PerFieldAnalyzerWrapper is what you are looking for. The equivalent of this in Lucene.net is here.
Map<String, Analyzer> analyzerMap = new HashMap<String, Analyzer>();
analyzerMap.put(fieldone, new IKAnalyzer4PinYin(false, IKAnalyzer4PinYin.PINYIN));
analyzerMap.put(fieldtwo, new IKAnalyzer4PinYin(false, KAnalyzer4PinYin.PINYIN_SHOUZIMU));
PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new IKAnalyzer4PinYin(false), analyzerMap);
IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_40 , wrapper);
Necromancing.
For C#:
Lucene.Net.Util.LuceneVersion version = Lucene.Net.Util.LuceneVersion.LUCENE_48;
Dictionary<string, Lucene.Net.Analysis.Analyzer> fieldAnalyzers =
new Dictionary<string, Lucene.Net.Analysis.Analyzer>(System.StringComparer.OrdinalIgnoreCase);
fieldAnalyzers["YourFieldName"] = new Lucene.Net.Analysis.Core.KeywordAnalyzer();
Lucene.Net.Analysis.Miscellaneous.PerFieldAnalyzerWrapper wrapper =
new Lucene.Net.Analysis.Miscellaneous.PerFieldAnalyzerWrapper(
new Lucene.Net.Analysis.Core.KeywordAnalyzer(), fieldAnalyzers);
Lucene.Net.Index.IndexWriterConfig writerConfig = new Lucene.Net.Index.IndexWriterConfig(version, wrapper);