Lucene MultiFieldQuery with WildcardQuery - lucene

Currently I have an issue with the Lucene search (version 2.9).
I have a search term and I need to use it on several fields. Therefore, I have to use MultiFieldQueryParser. On the other hand, I have to use the WhildcardQuery(), because our customer wants to search for a term in a phrase (e.g. "CMH" should match "KRC250/CMH/830/T/H").
I have tried to replace the slashes ('/') with stars ('*') and use a BooleanQuery with enclosed stars for the term.
Unfortunately whichout any success.
Does anyone have any Idea?

Yes, if the field shown is a single token, setting setAllowLeadingWildcard to be true would be necessary, like:
parser.setAllowLeadingWildcard(true);
Query query = parser.parse("*CMH*");
However:
You don't mention how the field is analyzed. By default, the StandardAnalyzer is used, which will split it into tokens at slashes (or asterisks, when indexing data). If you are using this sort of analysis, you could simply create a TermQuery searching for "cmh" (StandardAnalyzer includes a LowercaseFilter), or simply:
String[] fields = {"this", "that", "another"};
QueryParser parser = MultiFieldQueryParser(Version.LUCENE_29, fields, analyzer) //Assuming StandardAnalyzer
Query simpleQuery = parser.parse("CMH");
//Or even...
Query slightlyMoreComplexQuery = parser.parse("\"CMH/830/T\"");
I don't understand what you mean by a BooleanQuery with enclosed stars, if you can include code to elucidate that, it might help.

Sorry, maybe I have described it a little bit wrong.
I took something like this:
BooleanQuery bq = new BooleanQuery();
foreach (string field in fields)
{
foreach (string tok in tokArr)
{
bq.Add(new WildcardQuery(new Term(field, " *" + tok + "* ")), BooleanClause.Occur.SHOULD);
}
}
return bq;
but unfortunately it did not work.
I have modified it like this
string newterm = string.Empty;
string[] tok = term.Split(new[] { ' ', '/' }, StringSplitOptions.RemoveEmptyEntries);
tok.ForEach(x => newterm += x.EnsureStartsWith(" *").EnsureEndsWith("* "));
var version = Lucene.Net.Util.Version.LUCENE_29;
var analyzer = new StandardAnalyzer(version);
var parser = new MultiFieldQueryParser(version, fields, analyzer);
parser.SetDefaultOperator(QueryParser.Operator.AND);
parser.SetAllowLeadingWildcard(true);
return parser.Parse(newterm);
and my customer love it :-)

Related

What is an alternative for Lucene Query's extractTerms?

In Lucene 4.6.0 there was the method extractTerms that provided the extraction of terms from a query (Query 4.6.0). However, from Lucene 6.2.1, it does no longer exist (Query Lucene 6.2.1). Is there a valid alternative for it?
What I'd need is to parse terms (and corrispondent fields) of a Query built by QueryParser.
Maybe not the best answer but one way is to use the same analyzer and tokenize the query string:
Analyzer anal = new StandardAnalyzer();
TokenStream ts = anal.tokenStream("title", query); // string query
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken()) {
System.out.println(termAtt.toString());
}
anal.close();
I have temporarely solved my problem with the following code. Smarter alternatives will be well accepted:
QueryParser qp = new QueryParser("title", a);
Query q = qp.parse(query);
Set<Term> termQuerySet = new HashSet<Term>();
Weight w = searcher.createWeight(q, true, 3.4f);
w.extractTerms(termQuerySet);

Hibernate Search with Lucene Phone Number Analyzer issues

Our database contains thousands of numbers in various formats and what I am attempting to do is remove all punctuation at index time and store only the digits and then when a user types digits into a keyword field, only match on those digits. I thought that a custom analyzer was the way to go but I think I am missing an important step...
#Override
protected TokenStreamComponents createComponents(String fieldName) {
log.debug("Creating Components for Analyzer...");
final Tokenizer source = new KeywordTokenizer();
LowerCaseFilter lcFilter = new LowerCaseFilter(source);
PatternReplaceFilter prFilter = new PatternReplaceFilter(lcFilter,
Pattern.compile("[^0-9]"), "", true);
TrimFilter trimFilter = new TrimFilter(prFilter);
return new TokenStreamComponents(source, trimFilter);
}
...
#KeywordSearch
#Analyzer(impl = com.jjkane.common.search.analyzer.PhoneNumberAnalyzer.class)
#Field(name = "phone", index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES)
public String getPhone() {
return this.phone;
}
This may just be ignorance on my part in attempting to do this... From all the documentation, it seems like I am on the right track, but the query never matches unless I submit (555)555-5555 as an exact match to what was in my db. If I put in 5555555555, I get nothing...

how to search special characters in hibernate search?

I'm new to hibernate lucene search. From few days on wards, I am working on search keyword with special characters. I am using MultiFieldQueryParser for exact phrase matching as well as Boolean search. But in this process I am unable to get the results with search keywords like "Having 1+ years of experience" and if I am not putting any quotes around the search keyword then I am getting results. So what I observed in the execution of lucene query is, it is escaping the special symbols(+). I am using StandardAnalyzer.class. I think, If I am using WhiteSpaceAnalyzer it will not escape the special characters but it may effect the Boolean searching like +java +php(i.e java and php) because it may treat as normal text. so please assist some suggestions.
The following is my snippet:
Session session = getSession();
FullTextSession fullTextSession = Search.getFullTextSession(session);
MultiFieldQueryParser parser = new MultiFieldQueryParser(new String[] { "student.skills.skill",
"studentProfileSummary.profileTitle", "studentProfileSummary.currentDesignation" },
new StandardAnalyzer());
parser.setDefaultOperator(Operator.OR);
org.apache.lucene.search.Query luceneQuery = null;
QueryBuilder qb = fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(Student.class).get();
BooleanQuery boolQuery = new BooleanQuery();
if (StringUtils.isEmpty(zipcode) != true && StringUtils.isBlank(zipcode) != true) {
boolQuery.add(
qb.keyword().onField("personal.locations.postalCode").matching(zipcode).createQuery(),
BooleanClause.Occur.MUST);
}
if (StringUtils.isEmpty(query) != true && StringUtils.isBlank(query) != true) {
try {
luceneQuery = parser.parse(query.toUpperCase());
} catch (ParseException e) {
luceneQuery = parser.parse(parser.escape(query.toUpperCase()));
}
boolQuery.add(luceneQuery, BooleanClause.Occur.MUST);
}
boolQuery.add(qb.keyword().onField("vStatus").matching(1).createQuery(), BooleanClause.Occur.MUST);
boolQuery.add(qb.keyword().onField("status").matching(1).createQuery(), BooleanClause.Occur.MUST);
boolQuery.add(qb.range().onField("studentProfileSummary.profilePercentage").from(80).to(100).createQuery(),
BooleanClause.Occur.MUST);
FullTextQuery createFullTextQuery = fullTextSession.createFullTextQuery(boolQuery, Student.class);
createFullTextQuery.setProjection("id", "studentProfileSummary.profileTitle", "firstName","lastName");
if (isEmptyFilter == false) {
createFullTextQuery.setFirstResult((int) pageNumber);
createFullTextQuery.setMaxResults((int) end);
}
return createFullTextQuery.list();
The key to control such effects is indeed in the Analyzers you choose to use. As you noticed the standard Analyzer is going to remove/ignore some symbols as they are commonly not used.
Since the standard analyzer is good with most english natural language but you want to treat also special symbols, the typical solution is to index text into multiple fields, and you assign a different Analyzer to each field. You can then generate the queries targeting both fields, and combine the scores it obtains from both fields. You can even customize the weight that each field shoudl have and experiment with different Similarity implementations to obtain various effects.
But un your specific example of "1+ years" you might want to consider what you expect it to find. Should it match a string "6 years"?
Then you probably want to implement a custom analyzer which specifically looks for such patterns and generates multiple matching tokens like a sequence {"1 year", "2 years", "3 years", ...}. That's going to be effective but only match that specific sequence of terms, so maybe you want to look for more advanced extensions from the Lucene community, as you can plug many more extensions in it.

Lucene not indexing String field with value "this"

I am adding the document to lucene index as follows:
Document doc = new Document();
String stringObj = (String)field.get(obj);
doc.add(new TextField(fieldName, stringObj.toLowerCase(), org.apache.lucene.document.Field.Store.YES));
indexWriter.addDocument(doc);
I am doing a wild card search as follows:
searchTerm = "*" + searchTerm + "*";
term = new Term(field, sTerm.toLowerCase());
Query query = new WildcardQuery(term);
TotalHitCountCollector collector = new TotalHitCountCollector();
indexSearcher.search(query, collector);
if(collector.getTotalHits() > 0){
TopDocs hits = indexSearcher.search(query, collector.getTotalHits());
}
When I have a string with a "this" value, it is not getting added to the index, hence i do not get the result on searching by "this". I am using a StandardAnalyzer.
Common terms of English language like prepositions, pronouns etc are marked as stop words and omitted before indexing. You can define a custom analyzer or custom stop word list for your analyzer. That way you will be able to omit words that you don't want to be indexed and keep the stop words that you need.

How to delete Documents from a Lucene Index using Term or QueryParser

I am trying to delete documents from Lucene Index.
I want to delete only the specified file from lucene index .
My following program is deleting the index which can be searched using keyword analyzer but my required filename can be searched only using StandardAnalyzer . So is it any way to set standard analyzer in my term or instead of term how can i user QueryParser to delete the Documents from lucene index.
try{
File INDEX_DIR= new File("D:\\merge lucene\\abc\\");
Directory directory = FSDirectory.open(INDEX_DIR);
IndexReader indexReader = IndexReader.open(directory,false);
Term term= new Term("path","fileindex23005.htm");
int l= indexReader.deleteDocuments(term);
indexReader.close();
System.out.println("documents deleted");
}
catch(Exception x){x.printStackTrace();}
I assume you are using Lucene 3.6 or before, otherwise IndexReader.deleteDocuments no longer exists. You should, however, be using IndexWriter instead, anyway.
If you can only find the document using query parser, then just run a normal query, then iterate through the documents returned, and delete them by docnum, along the lines of:
Query query = queryParser.parse("My Query!");
ScoreDoc[] docs = searcher.search(query, 100).scoreDocs;
For (ScoreDoc doc : docs) {
indexReader.deleteDocument(doc.doc);
}
Or better yet (simpler, uses non-defunct, non-deprecated functionality), just use an IndexWriter, and pass it the query directly:
Query query = queryParser.parse("My Query!");
writer.deleteDocuments(query);
Adding for future reference for someone like me, where delete documents is on indexWriter , you may use
indexWriter.deleteDocuments(Term... terms)
instead of using deleteDocuments(query) method; to have less hassle if you have to match only one field. Be-aware that this method treats terms as OR condition if multiple terms are passed. So it will match any term and will delete all records. The code below will match state=Tx in documents stored and will delete matching records.
indexWriter.deleteDocuments(
new Term("STATE", "Tx")
);
For combining different fields with AND condition, we may use following code:
BooleanQuery.Builder builder = new BooleanQuery.Builder();
//note year is stored as int , not as string when document is craeted.
//if you use Term here which will need 2016 as String, that will not match with documents stored with year as int.
Query yearQuery = IntPoint.newExactQuery("year", 2016);
Query stateQuery = new TermQuery(new Term("STATE", "TX"));
Query cityQuery = new TermQuery(new Term("CITY", "CITY NAME"));
builder.add(yearQuery, BooleanClause.Occur.MUST);
builder.add(stateQuery, BooleanClause.Occur.MUST);
builder.add(cityQuery, BooleanClause.Occur.MUST);
indexWriter.deleteDocuments(builder.build());
As #dillippattnaik pointed out, multiple terms result in OR. I have updated his code to make it AND using BooleanQuery:
BooleanQuery query = new BooleanQuery
{
{ new TermQuery( new Term( "year", "2016" ) ), Occur.MUST },
{ new TermQuery( new Term( "STATE", "TX" ) ), Occur.MUST },
{ new TermQuery( new Term( "CITY", "CITY NAME" ) ), Occur.MUST }
};
indexWriter.DeleteDocuments( query );