What is an alternative for Lucene Query's extractTerms? - lucene

In Lucene 4.6.0 there was the method extractTerms that provided the extraction of terms from a query (Query 4.6.0). However, from Lucene 6.2.1, it does no longer exist (Query Lucene 6.2.1). Is there a valid alternative for it?
What I'd need is to parse terms (and corrispondent fields) of a Query built by QueryParser.

Maybe not the best answer but one way is to use the same analyzer and tokenize the query string:
Analyzer anal = new StandardAnalyzer();
TokenStream ts = anal.tokenStream("title", query); // string query
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken()) {
System.out.println(termAtt.toString());
}
anal.close();

I have temporarely solved my problem with the following code. Smarter alternatives will be well accepted:
QueryParser qp = new QueryParser("title", a);
Query q = qp.parse(query);
Set<Term> termQuerySet = new HashSet<Term>();
Weight w = searcher.createWeight(q, true, 3.4f);
w.extractTerms(termQuerySet);

Related

Lucene Porter Stemmer - get original unstemmed word

I have worked out how to use Lucene's Porter Stemmer but would like to also retrieve the original, un-stemmed word. So, to this end, I added a CharTermAttribute to the TokenStream before creating the PorterStemFilter, as follows:
Analyzer analyzer = new StandardAnalyzer();
TokenStream original = analyzer.tokenStream("StandardTokenStream", new StringReader(inputText));
TokenStream stemmed = new PorterStemFilter(original);
CharTermAttribute originalWordAttribute = original.addAttribute(CharTermAttribute.class);
CharTermAttribute stemmedWordAttribute = stemmed.addAttribute(CharTermAttribute.class);
stemmed.reset();
while (stemmed.incrementToken()) {
System.out.println(stemmedWordAttribute+" "+originalWordAttribute);
}
Unfortunately, both attributes return the stemmed word.
Is there a way to get the original word as well?
Lucene's PorterStemFilter can be combined with Lucene's KeywordRepeatFilter. The Porter Stemmer uses this to provide both the stemmed and unstemmed tokens.
Modifying your approach:
Analyzer analyzer = new StandardAnalyzer();
TokenStream original = analyzer.tokenStream("StandardTokenStream", new StringReader(inputText));
TokenStream repeated = new KeywordRepeatFilter(original);
TokenStream stemmed = new PorterStemFilter(repeated);
CharTermAttribute stemmedWordAttribute = stemmed.addAttribute(CharTermAttribute.class);
stemmed.reset();
while (stemmed.incrementToken()) {
String originalWord = stemmedWordAttribute.toString();
stemmed.incrementToken();
String stemmedWord = stemmedWordAttribute.toString();
System.out.println(originalWord + " " + stemmedWord);
}
This is fairly crude, but shows the approach.
Example input:
testing giraffe book passing
Resulting output:
testing test
giraffe giraff
book book
passing pass
For each pair of tokens, if the second matches the first (book book), then there was no stemming.
Normally, you would use this with RemoveDuplicatesTokenFilter to remove the duplicate book term - but if you do that I think it becomes much harder to track the stemmed/unstemmed pairs - so for your specific scenario, I did not use that de-duplication filter.

how to search special characters in hibernate search?

I'm new to hibernate lucene search. From few days on wards, I am working on search keyword with special characters. I am using MultiFieldQueryParser for exact phrase matching as well as Boolean search. But in this process I am unable to get the results with search keywords like "Having 1+ years of experience" and if I am not putting any quotes around the search keyword then I am getting results. So what I observed in the execution of lucene query is, it is escaping the special symbols(+). I am using StandardAnalyzer.class. I think, If I am using WhiteSpaceAnalyzer it will not escape the special characters but it may effect the Boolean searching like +java +php(i.e java and php) because it may treat as normal text. so please assist some suggestions.
The following is my snippet:
Session session = getSession();
FullTextSession fullTextSession = Search.getFullTextSession(session);
MultiFieldQueryParser parser = new MultiFieldQueryParser(new String[] { "student.skills.skill",
"studentProfileSummary.profileTitle", "studentProfileSummary.currentDesignation" },
new StandardAnalyzer());
parser.setDefaultOperator(Operator.OR);
org.apache.lucene.search.Query luceneQuery = null;
QueryBuilder qb = fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(Student.class).get();
BooleanQuery boolQuery = new BooleanQuery();
if (StringUtils.isEmpty(zipcode) != true && StringUtils.isBlank(zipcode) != true) {
boolQuery.add(
qb.keyword().onField("personal.locations.postalCode").matching(zipcode).createQuery(),
BooleanClause.Occur.MUST);
}
if (StringUtils.isEmpty(query) != true && StringUtils.isBlank(query) != true) {
try {
luceneQuery = parser.parse(query.toUpperCase());
} catch (ParseException e) {
luceneQuery = parser.parse(parser.escape(query.toUpperCase()));
}
boolQuery.add(luceneQuery, BooleanClause.Occur.MUST);
}
boolQuery.add(qb.keyword().onField("vStatus").matching(1).createQuery(), BooleanClause.Occur.MUST);
boolQuery.add(qb.keyword().onField("status").matching(1).createQuery(), BooleanClause.Occur.MUST);
boolQuery.add(qb.range().onField("studentProfileSummary.profilePercentage").from(80).to(100).createQuery(),
BooleanClause.Occur.MUST);
FullTextQuery createFullTextQuery = fullTextSession.createFullTextQuery(boolQuery, Student.class);
createFullTextQuery.setProjection("id", "studentProfileSummary.profileTitle", "firstName","lastName");
if (isEmptyFilter == false) {
createFullTextQuery.setFirstResult((int) pageNumber);
createFullTextQuery.setMaxResults((int) end);
}
return createFullTextQuery.list();
The key to control such effects is indeed in the Analyzers you choose to use. As you noticed the standard Analyzer is going to remove/ignore some symbols as they are commonly not used.
Since the standard analyzer is good with most english natural language but you want to treat also special symbols, the typical solution is to index text into multiple fields, and you assign a different Analyzer to each field. You can then generate the queries targeting both fields, and combine the scores it obtains from both fields. You can even customize the weight that each field shoudl have and experiment with different Similarity implementations to obtain various effects.
But un your specific example of "1+ years" you might want to consider what you expect it to find. Should it match a string "6 years"?
Then you probably want to implement a custom analyzer which specifically looks for such patterns and generates multiple matching tokens like a sequence {"1 year", "2 years", "3 years", ...}. That's going to be effective but only match that specific sequence of terms, so maybe you want to look for more advanced extensions from the Lucene community, as you can plug many more extensions in it.

Can we use SpanNearQuery in phonetic index?

I've implemented a lucene-based software to index more than 10 millions of person's names and these names can be written on different ways like "Luíz" and "Luis". The index was created using the phonetic values of the respective tokens (a custom analyzer was created).
Currently, I'm using QueryParser to query for a given name with good results. But, in the book "Lucene in Action" is mentioned that SpanNearQuery can improve my queries using the proximity of tokens. I've played with the SpanNearQuery against a non-phonetic index of name and the results were superior compared to QueryParser.
As we should query using the same analyzer used to indexing, I couldn't find how I can use my custom phonetic analyzer and SpanNearQuery at same time, or rephrasing:
how can I use SpanNearQuery on the phonetic index?
Thanks in advance.
My first thought is: Wouldn't a phrase query with slop do the job? That would certainly be the easiest way:
"term1 term2"~5
This will use your phonetic analyzer, and produce a proximity query with the resulting tokens.
So, if you really do need to use SpanQueries here (perhaps you are using fuzzy queries or wildcards or some such, or PhraseQuery has been leering menacingly at you and you want nothing more to do with it), you'll need to do the analysis yourself. You can do this by getting a TokenStream from Analyzer.tokenStream, and iterating through the analyzed tokens.
If you are using a phonetic algorithm that produces a single code per term (soundex, for example):
SpanNearQuery.Builder nearBuilder = new SpanNearQuery.Builder("text", true);
nearBuilder.setSlop(4);
TokenStream stream = analyzer.tokenStream("text", queryStringToParse);
stream.addAttribute(CharTermAttribute.class);
stream.reset();
while(stream.incrementToken()) {
CharTermAttribute token = stream.getAttribute(CharTermAttribute.class);
nearBuilder.addClause(new SpanTermQuery(new Term("text", token.toString())));
}
Query finalQuery = nearBuilder.build();
stream.close();
If you are using a double metaphone, where you can have 1-2 terms at the same position, it's a bit more complex, as you'll need to consider those position increments:
SpanNearQuery.Builder nearBuilder = new SpanNearQuery.Builder("text", true);
nearBuilder.setSlop(4);
TokenStream stream = analyzer.tokenStream("text", "through and through");
stream.addAttribute(CharTermAttribute.class);
stream.addAttribute(PositionIncrementAttribute.class);
stream.reset();
String queuedToken = null;
while(stream.incrementToken()) {
CharTermAttribute token = stream.getAttribute(CharTermAttribute.class);
PositionIncrementAttribute increment = stream.getAttribute(PositionIncrementAttribute.class);
if (increment.getPositionIncrement() == 0) {
nearBuilder.addClause(new SpanOrQuery(
new SpanTermQuery(new Term("text", queuedToken)),
new SpanTermQuery(new Term("text", token.toString()))
));
queuedToken = null;
}
else if (increment.getPositionIncrement() >= 1 && queuedToken != null) {
nearBuilder.addClause(new SpanTermQuery(new Term("text", queuedToken)));
queuedToken = token.toString();
}
else {
queuedToken = token.toString();
}
}
if (queuedToken != null) {
nearBuilder.addClause(new SpanTermQuery(new Term("text", queuedToken)));
}
Query finalQuery = nearBuilder.build();
stream.close();

Lucene MultiFieldQuery with WildcardQuery

Currently I have an issue with the Lucene search (version 2.9).
I have a search term and I need to use it on several fields. Therefore, I have to use MultiFieldQueryParser. On the other hand, I have to use the WhildcardQuery(), because our customer wants to search for a term in a phrase (e.g. "CMH" should match "KRC250/CMH/830/T/H").
I have tried to replace the slashes ('/') with stars ('*') and use a BooleanQuery with enclosed stars for the term.
Unfortunately whichout any success.
Does anyone have any Idea?
Yes, if the field shown is a single token, setting setAllowLeadingWildcard to be true would be necessary, like:
parser.setAllowLeadingWildcard(true);
Query query = parser.parse("*CMH*");
However:
You don't mention how the field is analyzed. By default, the StandardAnalyzer is used, which will split it into tokens at slashes (or asterisks, when indexing data). If you are using this sort of analysis, you could simply create a TermQuery searching for "cmh" (StandardAnalyzer includes a LowercaseFilter), or simply:
String[] fields = {"this", "that", "another"};
QueryParser parser = MultiFieldQueryParser(Version.LUCENE_29, fields, analyzer) //Assuming StandardAnalyzer
Query simpleQuery = parser.parse("CMH");
//Or even...
Query slightlyMoreComplexQuery = parser.parse("\"CMH/830/T\"");
I don't understand what you mean by a BooleanQuery with enclosed stars, if you can include code to elucidate that, it might help.
Sorry, maybe I have described it a little bit wrong.
I took something like this:
BooleanQuery bq = new BooleanQuery();
foreach (string field in fields)
{
foreach (string tok in tokArr)
{
bq.Add(new WildcardQuery(new Term(field, " *" + tok + "* ")), BooleanClause.Occur.SHOULD);
}
}
return bq;
but unfortunately it did not work.
I have modified it like this
string newterm = string.Empty;
string[] tok = term.Split(new[] { ' ', '/' }, StringSplitOptions.RemoveEmptyEntries);
tok.ForEach(x => newterm += x.EnsureStartsWith(" *").EnsureEndsWith("* "));
var version = Lucene.Net.Util.Version.LUCENE_29;
var analyzer = new StandardAnalyzer(version);
var parser = new MultiFieldQueryParser(version, fields, analyzer);
parser.SetDefaultOperator(QueryParser.Operator.AND);
parser.SetAllowLeadingWildcard(true);
return parser.Parse(newterm);
and my customer love it :-)

How to delete Documents from a Lucene Index using Term or QueryParser

I am trying to delete documents from Lucene Index.
I want to delete only the specified file from lucene index .
My following program is deleting the index which can be searched using keyword analyzer but my required filename can be searched only using StandardAnalyzer . So is it any way to set standard analyzer in my term or instead of term how can i user QueryParser to delete the Documents from lucene index.
try{
File INDEX_DIR= new File("D:\\merge lucene\\abc\\");
Directory directory = FSDirectory.open(INDEX_DIR);
IndexReader indexReader = IndexReader.open(directory,false);
Term term= new Term("path","fileindex23005.htm");
int l= indexReader.deleteDocuments(term);
indexReader.close();
System.out.println("documents deleted");
}
catch(Exception x){x.printStackTrace();}
I assume you are using Lucene 3.6 or before, otherwise IndexReader.deleteDocuments no longer exists. You should, however, be using IndexWriter instead, anyway.
If you can only find the document using query parser, then just run a normal query, then iterate through the documents returned, and delete them by docnum, along the lines of:
Query query = queryParser.parse("My Query!");
ScoreDoc[] docs = searcher.search(query, 100).scoreDocs;
For (ScoreDoc doc : docs) {
indexReader.deleteDocument(doc.doc);
}
Or better yet (simpler, uses non-defunct, non-deprecated functionality), just use an IndexWriter, and pass it the query directly:
Query query = queryParser.parse("My Query!");
writer.deleteDocuments(query);
Adding for future reference for someone like me, where delete documents is on indexWriter , you may use
indexWriter.deleteDocuments(Term... terms)
instead of using deleteDocuments(query) method; to have less hassle if you have to match only one field. Be-aware that this method treats terms as OR condition if multiple terms are passed. So it will match any term and will delete all records. The code below will match state=Tx in documents stored and will delete matching records.
indexWriter.deleteDocuments(
new Term("STATE", "Tx")
);
For combining different fields with AND condition, we may use following code:
BooleanQuery.Builder builder = new BooleanQuery.Builder();
//note year is stored as int , not as string when document is craeted.
//if you use Term here which will need 2016 as String, that will not match with documents stored with year as int.
Query yearQuery = IntPoint.newExactQuery("year", 2016);
Query stateQuery = new TermQuery(new Term("STATE", "TX"));
Query cityQuery = new TermQuery(new Term("CITY", "CITY NAME"));
builder.add(yearQuery, BooleanClause.Occur.MUST);
builder.add(stateQuery, BooleanClause.Occur.MUST);
builder.add(cityQuery, BooleanClause.Occur.MUST);
indexWriter.deleteDocuments(builder.build());
As #dillippattnaik pointed out, multiple terms result in OR. I have updated his code to make it AND using BooleanQuery:
BooleanQuery query = new BooleanQuery
{
{ new TermQuery( new Term( "year", "2016" ) ), Occur.MUST },
{ new TermQuery( new Term( "STATE", "TX" ) ), Occur.MUST },
{ new TermQuery( new Term( "CITY", "CITY NAME" ) ), Occur.MUST }
};
indexWriter.DeleteDocuments( query );