Neo4j\Lucene multiterm wildcard at the end of query - lucene

I'm trying to create auto suggestion based on Lucene full text index.
The main issue is how to create autosuggestion(autocomplete) based on multiterm phrases, for example -
nosql dat*
results can be
nosql database
nosql data
but not
perfect nosql database
What is the correct syntax for Lucene query in order to create auto suggestion based on the first words in a multi term query with a wildcard at the end ?

I had a similar requirement,
Lucene has Span queries that allow you to use location of words in the text in queries.
I've implemented it in Lucene using FirstSpanQuery. (read about it in the docs)
here I use SpanNearQuery to force all the words to be next to each other and
SpanFirstQuery to force all of them to be in the start of the text.
if (querystr.contains(" ")) // more than one word?
{
String[] words = querystr.split(" ");
SpanQuery[] clausesWildCard = new SpanQuery[words.length];
for (int i = 0; i < words.length; i++) {
if (i == words.length - 1) //last word, add wildcard clause
{
PrefixQuery pq = new PrefixQuery(new Term(VALUE, words[i]));
clausesWildCard[i] = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
}
else
{
Term clause = new Term(VALUE, words[i]);
clausesWildCard[i] = new SpanTermQuery(clause);
}
}
SpanQuery allTheWordsNear = new SpanNearQuery(clausesWildCard, 0, true);
prefixquery = new SpanFirstQuery(allTheWordsNear, words.length);
}

Related

Apache Lucene fuzzy search for multi-worded phrases

I have the following Apache Lucene 7 application:
StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(standardAnalyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document document = new Document();
document.add(new TextField("content", new FileReader("document.txt")));
writer.addDocument(document);
writer.close();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Query fuzzyQuery = new FuzzyQuery(new Term("content", "Company"), 2);
TopDocs results = searcher.search(fuzzyQuery, 5);
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore())
when I use it with :
new FuzzyQuery(new Term("content", "Company"), 2);
the application works fine and returns the following result:
Hits: 1
Max score:0.35161147
but when I try to search with multi term query, for example:
new FuzzyQuery(new Term("content", "Company name"), 2);
it returns the following result:
Hits: 0
Max score:NaN
Anyway, the phrase Company name exists in the source document.txt file.
How to properly use FuzzyQuery in this case in order to be able to do the fuzzy search for multi-word phrases.
UPDATED
Based on the provided solution I have tested it on the following text information:
Company name: BlueCross BlueShield Customer Service
1-800-521-2227
of Texas Preauth-Medical 1-800-441-9188
Preauth-MH/CD 1-800-528-7264
Blue Card Access 1-800-810-2583
For the following query:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCross"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
the search works fine:
Hits: 1
Max score:0.5753642
but when I try to corrupt a little bit the search query(for example from BlueCross to BlueCros)
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCros"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
it stops working and returns:
Hits: 0
Max score:NaN
The problem here is the following, you're using TextField, which is tokenizing field. E.g. your text "Company name is working on something" would be effectively split by spaces (and others delimeters). So, even if you have the text Company name, during indexation it will become Company, name, is, etc.
In this case this TermQuery won't be able to find what you're looking for. The trick which going to help you would look like this:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "some"), 2));
clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "text"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
However, I wouldn't recommend this approach much, especially if your load would be big and you're planning on searching on a 10 term long company names. One should be aware, that those query are potentially heavy to execute.
The following problem with BlueCros is the following. By default Lucene uses StandardAnalyzer for TextField. So it means it effectively lowercase the terms, basically it means that BlueCross in the content field becomes bluecross.
Fuzzy difference between BlueCros and bluecross is 3, that's the reason you do not have a match.
Simple proposal would be to convert term in query to the lowercase, by doing something like .toLowerCase()
In general, one should prefer to use same analyzers during the query time as well (e.g. during construction of the query)
For Lucene.Net it can be like this.
private string _IndexPath = #"Your Index Path";
private Directory _Directory;
private Searcher _IndexSearcher;
private MultiPhraseQuery _MultiPhraseQuery;
_Directory = FSDirectory.Open(_IndexPath);
IndexReader indexReader = IndexReader.Open(_Directory, true);
string field = "Name" // Your field name
string keyword = "big red fox"; // your search term
float fuzzy = 0,7f; // between 0-1
using (_IndexSearcher = new IndexSearcher(indexReader))
{
// "big red fox" to [big,red,fox]
var keywordSplit = keyword.Split();
_MultiPhraseQuery = new MultiPhraseQuery();
FuzzyTermEnum[] _FuzzyTermEnum = new FuzzyTermEnum[keywordSplit.Length];
Term[] _Term = new Term[keywordSplit.Length];
for (int i = 0; i < keywordSplit.Length; i++)
{
_FuzzyTermEnum[i] = new FuzzyTermEnum(indexReader, new Term(field, keywordSplit[i]),fuzzy);
_Term[i] = _FuzzyTermEnum[i].Term;
if (_Term[i] == null)
{
_MultiPhraseQuery.Add(new Term(field, keywordSplit[i]));
}
else
{
_MultiPhraseQuery.Add(_FuzzyTermEnum[i].Term);
}
}
var results = _IndexSearcher.Search(_MultiPhraseQuery, indexReader.MaxDoc);
foreach (var loopDoc in results.ScoreDocs.OrderByDescending(s => s.Score))
{
//YourCode Here
}
}

Can we use SpanNearQuery in phonetic index?

I've implemented a lucene-based software to index more than 10 millions of person's names and these names can be written on different ways like "Luíz" and "Luis". The index was created using the phonetic values of the respective tokens (a custom analyzer was created).
Currently, I'm using QueryParser to query for a given name with good results. But, in the book "Lucene in Action" is mentioned that SpanNearQuery can improve my queries using the proximity of tokens. I've played with the SpanNearQuery against a non-phonetic index of name and the results were superior compared to QueryParser.
As we should query using the same analyzer used to indexing, I couldn't find how I can use my custom phonetic analyzer and SpanNearQuery at same time, or rephrasing:
how can I use SpanNearQuery on the phonetic index?
Thanks in advance.
My first thought is: Wouldn't a phrase query with slop do the job? That would certainly be the easiest way:
"term1 term2"~5
This will use your phonetic analyzer, and produce a proximity query with the resulting tokens.
So, if you really do need to use SpanQueries here (perhaps you are using fuzzy queries or wildcards or some such, or PhraseQuery has been leering menacingly at you and you want nothing more to do with it), you'll need to do the analysis yourself. You can do this by getting a TokenStream from Analyzer.tokenStream, and iterating through the analyzed tokens.
If you are using a phonetic algorithm that produces a single code per term (soundex, for example):
SpanNearQuery.Builder nearBuilder = new SpanNearQuery.Builder("text", true);
nearBuilder.setSlop(4);
TokenStream stream = analyzer.tokenStream("text", queryStringToParse);
stream.addAttribute(CharTermAttribute.class);
stream.reset();
while(stream.incrementToken()) {
CharTermAttribute token = stream.getAttribute(CharTermAttribute.class);
nearBuilder.addClause(new SpanTermQuery(new Term("text", token.toString())));
}
Query finalQuery = nearBuilder.build();
stream.close();
If you are using a double metaphone, where you can have 1-2 terms at the same position, it's a bit more complex, as you'll need to consider those position increments:
SpanNearQuery.Builder nearBuilder = new SpanNearQuery.Builder("text", true);
nearBuilder.setSlop(4);
TokenStream stream = analyzer.tokenStream("text", "through and through");
stream.addAttribute(CharTermAttribute.class);
stream.addAttribute(PositionIncrementAttribute.class);
stream.reset();
String queuedToken = null;
while(stream.incrementToken()) {
CharTermAttribute token = stream.getAttribute(CharTermAttribute.class);
PositionIncrementAttribute increment = stream.getAttribute(PositionIncrementAttribute.class);
if (increment.getPositionIncrement() == 0) {
nearBuilder.addClause(new SpanOrQuery(
new SpanTermQuery(new Term("text", queuedToken)),
new SpanTermQuery(new Term("text", token.toString()))
));
queuedToken = null;
}
else if (increment.getPositionIncrement() >= 1 && queuedToken != null) {
nearBuilder.addClause(new SpanTermQuery(new Term("text", queuedToken)));
queuedToken = token.toString();
}
else {
queuedToken = token.toString();
}
}
if (queuedToken != null) {
nearBuilder.addClause(new SpanTermQuery(new Term("text", queuedToken)));
}
Query finalQuery = nearBuilder.build();
stream.close();

Lucene MultiFieldQuery with WildcardQuery

Currently I have an issue with the Lucene search (version 2.9).
I have a search term and I need to use it on several fields. Therefore, I have to use MultiFieldQueryParser. On the other hand, I have to use the WhildcardQuery(), because our customer wants to search for a term in a phrase (e.g. "CMH" should match "KRC250/CMH/830/T/H").
I have tried to replace the slashes ('/') with stars ('*') and use a BooleanQuery with enclosed stars for the term.
Unfortunately whichout any success.
Does anyone have any Idea?
Yes, if the field shown is a single token, setting setAllowLeadingWildcard to be true would be necessary, like:
parser.setAllowLeadingWildcard(true);
Query query = parser.parse("*CMH*");
However:
You don't mention how the field is analyzed. By default, the StandardAnalyzer is used, which will split it into tokens at slashes (or asterisks, when indexing data). If you are using this sort of analysis, you could simply create a TermQuery searching for "cmh" (StandardAnalyzer includes a LowercaseFilter), or simply:
String[] fields = {"this", "that", "another"};
QueryParser parser = MultiFieldQueryParser(Version.LUCENE_29, fields, analyzer) //Assuming StandardAnalyzer
Query simpleQuery = parser.parse("CMH");
//Or even...
Query slightlyMoreComplexQuery = parser.parse("\"CMH/830/T\"");
I don't understand what you mean by a BooleanQuery with enclosed stars, if you can include code to elucidate that, it might help.
Sorry, maybe I have described it a little bit wrong.
I took something like this:
BooleanQuery bq = new BooleanQuery();
foreach (string field in fields)
{
foreach (string tok in tokArr)
{
bq.Add(new WildcardQuery(new Term(field, " *" + tok + "* ")), BooleanClause.Occur.SHOULD);
}
}
return bq;
but unfortunately it did not work.
I have modified it like this
string newterm = string.Empty;
string[] tok = term.Split(new[] { ' ', '/' }, StringSplitOptions.RemoveEmptyEntries);
tok.ForEach(x => newterm += x.EnsureStartsWith(" *").EnsureEndsWith("* "));
var version = Lucene.Net.Util.Version.LUCENE_29;
var analyzer = new StandardAnalyzer(version);
var parser = new MultiFieldQueryParser(version, fields, analyzer);
parser.SetDefaultOperator(QueryParser.Operator.AND);
parser.SetAllowLeadingWildcard(true);
return parser.Parse(newterm);
and my customer love it :-)

Is there any way to extract all the tokens from solr?

How can one extract all the tokens from solr? Not from one document, but from all the documents indexed in solr?
Thanks!
You may do something like this(This sample is approved to be working on a lucene 4.x index):
IndexSearcher isearcher = new IndexSearcher(dir, true);
IndexReader reader = isearcher.getIndexReader();
Fields fields = MultiFields.getFields(reader);
Collection<String> cols = reader.getFieldNames(IndexReader.FieldOption.ALL);
for (String col : cols) {
Terms te = fields.terms(col);
if (te != null) {
TermsEnum tex = te.getThreadTermsEnum();
while (tex.next() != null)
// do something
tex.getTerm().text();
}
}
This iterates over all columns and also over every term per col. You may lookup the methods provided by TermsEnum like getTerm().

Is it possible to iterate through documents stored in Lucene Index?

I have some documents stored in a Lucene index with a docId field.
I want to get all docIds stored in the index. There is also a problem. Number of documents is about 300 000 so I would prefer to get this docIds in chunks of size 500. Is it possible to do so?
IndexReader reader = // create IndexReader
for (int i=0; i<reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
// do something with docId here...
}
Lucene 4
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i=0; i<reader.maxDoc(); i++) {
if (liveDocs != null && !liveDocs.get(i))
continue;
Document doc = reader.document(i);
}
See LUCENE-2600 on this page for details: https://lucene.apache.org/core/4_0_0/MIGRATE.html
There is a query class named MatchAllDocsQuery, I think it can be used in this case:
Query query = new MatchAllDocsQuery();
TopDocs topDocs = getIndexSearcher.search(query, RESULT_LIMIT);
Document numbers (or ids) will be subsequent numbers from 0 to IndexReader.maxDoc()-1. These numbers are not persistent and are valid only for opened IndexReader. You could check if the document is deleted with IndexReader.isDeleted(int documentNumber) method
If you use .document(i) as in above examples and skip over deleted documents be careful if you use this method for paginating results.
i.e.: You have a 10 docs/per page list and you need to get the docs. for page 6. Your input might be something like this: offset=60,count = 10 (documents from 60 to 70).
IndexReader reader = // create IndexReader
for (int i=offset; i<offset + 10; i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
}
You will have some problems with the deleted ones because you should not start from offset=60, but from offset=60 + the number of deleted documents that appear before 60.
An alternative I found is something like this:
is = getIndexSearcher(); //new IndexSearcher(indexReader)
//get all results without any conditions attached.
Term term = new Term([[any mandatory field name]], "*");
Query query = new WildcardQuery(term);
topCollector = TopScoreDocCollector.create([[int max hits to get]], true);
is.search(query, topCollector);
TopDocs topDocs = topCollector.topDocs(offset, count);
note: replace text between [[ ]] with own values.
Ran this on large index with 1.5million entries and got random 10 results in less than a second.
Agree is slower but at least you can ignore deleted documents if you need pagination.