Using Lucene's API, can I boost terms? - lucene

With Lucene's query parser, it's possible to boost terms (causing them to weight higher in the search results) by appending "^n", e.g. "apple^5 pie" would assign five times more importance to the term "apple". Is it possible to do this when constructing a query using the API? Note that I'm not wanting to boost fields or documents, but individual terms within a field.

You simply need to use the setBoost(float) method of the Query class.
For example
TermQuery tq1 = new TermQuery(new Term("text", "term1"));
tq1.setBoost(5f);
TermQuery tq2 = new TermQuery(new Term("text", "term2"));
tq2.setBoost(0.8f);
BooleanQuery query = new BooleanQuery();
query.add(tq1, Occur.SHOULD);
query.add(tq2, Occur.SHOULD);
This is equivalent to parsing the query text:term1^5 text:term2^0.8.

Related

Lucene substring matching

What is the best way to check if String a is part of String b in Lucene. For example: a = "capital" and b = "Berlin is a capital of Germany". In this case b contains a and fits the requirement.
I think your problem can be treated as some field contains certain term or not.
The basic TermQuery should be enough to solve your problem, in most analyzers, "Berlin is a capital of Germany" will be analyzed as "berlin", "capital" "germany"(if you use the basic stop words)
// code in Scala
new TermQuery(new Term("contents", "capital"))
you can also use PhraseQuery to solve your problem(though, your problem is not the most suitable scenario for PhraseQuery).
val query = new PhraseQuery();
query.add(new Term("contents", "capital"))
Lucene In Action 2nd, 3.4 Lucene’s diverse queries introduces all kinds of Query used in Lucene. I suggest you have a read and that might help.

Lucene synonym expansion,stemming,spell check and more

I am using Lucene to index my database and then perform a phrase search on a specific field(field name: keyword).
I am using following code currently:
String userQuery = request.getParameter("query");
//create standard analyzer object
analyzer = new StandardAnalyzer(Version.LUCENE_30);
Analyzer analyze=AnalyzerUtil.getPorterStemmerAnalyzer(analyzer);
//create File object of our index directory
File file = new File(LUCENE_INDEX_DIRECTORY);
//create index reader object
reader = IndexReader.open(FSDirectory.open(file),true);
//create index searcher object
searcher = new IndexSearcher(reader);
//create topscore document collector
collector = TopScoreDocCollector.create(1000, false);
//create query parser object
parser = new QueryParser(Version.LUCENE_30,"keyword", analyze);
parser.setAllowLeadingWildcard(true);
//parse the query and get reference to Query object
query = parser.parse(userQuery);
//********Line 1***********************
//search the query
searcher.search(query, collector);
hits = collector.topDocs().scoreDocs;
//check whether the search returns any result
if(hits.length>0){//Code to retrieve hits}
This code works fine for stemming, but now I want to also expand my query to do synonym search like if I enter "Man" and my lucene index has a entry "male", it would still be able to give me that as a hit.
I tried to add this at Line 1 in the above code query=SynExpand.expand(userQuery,
searcher, analyze,"keyword",serialVersionUID);
But it doesn't give me any result.
I also want to introduce spell check, where in if I enter "ubelievable" instead of "unbelievable" it would still give me a result.
I have no idea why synonym expansion isn't working for me and how to do spelling check.Please if someone could guide me I will be really grateful.
Thanks!
Fuzzy search may be done by query keyword modifier, namely by adding tilde:
keyword:ubelievable~
See Lucene Parser Syntax for more details and other types of queries that may be interesting to you.
There are 2 ways of dealing with synonyms. Query expansion you are trying to use relies on WordNet. As SynExpand's documentation says, you should first invoke Syns2Index to use expansion. This is easy way, but it works only with English words.
If you need to add support for multiple languages or add your own synonyms, you can use synonym injection during indexing. The idea is to write your own analyzer that will inject synonyms from your own dictionary into indexed documents. This may sound hard to implement, but fortunately there's excellent example in Lucene in Action book (source code is available for free, see lia.analysis.synonym package. Though, I highly recommend to get your copy of this nice book).

Prevent KeywordTokenizer from creating multiple key-value pairs

I use the Lucene java QueryParser with KeywordAnalyzer. A query like topic:(hello world) is broken up in to multiple parts by the KeywordTokenizer so the resulting Query object looks like this topic:(hello) topic:(world) i.e. Instead of one, I now have two key-value pairs. I would like the QueryParser to interpret hello world as one value, without using double quotes. What is the best way to do so?
Parsing topic:("hello world") results in a single key value combination but, using double quotes is not an option.
I am not using the Lucene search engine. I am using Lucene's QueryParser just for parsing the query, not for searching. The text Hello World is entered at runtime, by the user so that can change. I would like KeywordTokenizer to treat Hello World as one Token instead of parsing splitting it in to two Tokens.
You'll need to use a BooleanQuery. Here's a code snippet using the .NET port of Lucene. This should work with both the KeywordAnalyzer and the StandardAnalyzer.
var luceneAnalyzer = new KeywordAnalyzer();
var query1 = new QueryParser("Topic", luceneAnalyzer).Parse("hello");
var query2 = new QueryParser("Topic", luceneAnalyzer).Parse("world");
BooleanQuery filterQuery = new BooleanQuery();
filterQuery.Add(query1, BooleanClause.Occur.MUST);
filterQuery.Add(query1, BooleanClause.Occur.MUST);
TopDocs results = searcher.Search(filterQuery);
You can construct the query directly as follows. This preserves the space.
Query query = new TermQuery(new Term("field", "value has space"));
If you print query as System.out.println(query); you will see the following result.
field:value has space

Parse a search string (into NHibernate Criterias )

I would like to implement an advanced search for my project.
The search right now uses all the strings the user enters and makes one big disjunction with criteria API.
This works fine, but now I would like to implement more features: AND, OR and brackets()
I have got a hard time parsing the string - and building criterias from the string. I have found this Stackoverflow question, but it didn't really help (he didn't make it clear what he wanted).
I found another article, but this supports much more and spits out sql statements.
Another thing I've heard mention a lot is Lucene - but I'm not sure if this really would help me.
I've been searching around a little bit and I've found the Lucene.Net WhitespaceAnalyzer and the QueryParser.
It changes the search A AND B OR C into something like +A +B C, which is a good step in the correct direction (plus it handles brackets).
The next step would be to get the converted string into a set of conjunctions and disjunctions.
The Java example I found was using the query builder which I couldn't find in NHibernate.
Any more ideas ?
Guess you haven't heard about Nhibernate Search till now
Nhibernate Search uses lucene underneath and gives u all the options of using AND, OR, grammar.
All you have to do is attribute your entities for indexing and Nhibernate will index it at a predefined location.
Next time you can search this index with the power that lucene exposes and then get your domain level entity objects in return.
using (IFullTextSession s = Search.CreateFullTextSession(sf.OpenSession(new SearchInterceptor()))) {
QueryParser qp = new QueryParser("id", new StopAnalyzer());
IQuery NHQuery = s.CreateFullTextQuery(qp.Parse("Summary:series"), typeof(Book));
IList result = NHQuery.List();
Powerful, isn’t it?
What I am basically doing right now is parsing the input string with the Lucene.Net parse API.
This gives me a uniform and simplified syntax. (Pseudocode)
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
void Function Search (string queryString)
{
Analyzer analyzer = new WhitespaceAnalyzer();
QueryParser luceneParser = new QueryParser("name", analyzer);
Query luceneQuery = luceneParser.Parse(queryString);
string[] words = luceneQuery.ToString().Split(' ');
foreach (string word in words)
{
//Parsing the lucene.net string
}
}
After that I am parsing this string manually, creating the disjunctions and conjunctions.

Boost factor in MultiFieldQueryParser

Can I boost different fields in MultiFieldQueryParser with different factors?
Also, what is the maximum boost factor value I can assign to a field?
Thanks a ton!
Ed
MultiFieldQueryParser has a [constructor][1] that accepts a map of boosts. You use it with something like this:
String[] fields = new String[] { "title", "keywords", "text" };
HashMap<String,Float> boosts = new HashMap<String,Float>();
boosts.put("title", 10);
boosts.put("keywords", 5);
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(
fields,
new StandardAnalyzer(),
boosts
);
As for the maximum boost, I'm not sure, but you shouldn't think about boost in absolute terms anyway. Just use a ratio of boosts that makes sense. Also see this question.
[1]: https://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/classic/MultiFieldQueryParser.html#MultiFieldQueryParser(org.apache.lucene.util.Version, java.lang.String[], org.apache.lucene.analysis.Analyzer, java.util.Map)