How to use MultiFieldQueryParser along with boost factor? - lucene

I have indexed documents in Lucene based on three fields: title, address, city. Now I want to build my query say, C A B so that I can retrieve the documents as follows:
C must be present in the title field of the documents and either A or B must be present in either of address and city fields of the matched documents. The documents that have A present in either of those fields should get higher score or higher boost. Here A, B, C may be single terms or phrases.
I am new to Lucene. I do not have any experience of framing such complex queries. In this context I have read the post Boost factor in MultiFieldQueryParser
But this post does not answer my question. So if anyone please help me to solve this I will be really grateful.

title:C AND (address:A^2 OR city:A^2 OR address:B OR city:B)

Don't get caught up on reading about MultiFieldQueryParser, that isn't really what you need for this. Standard QueryParser syntax will serve your purposes you fine.
See the Lucene QueryParser syntax documentation
A query like:
+title:C +((address:A city:A)^2 address:B city:B)
Should do nicely.
To explain a bit:
+title:C - require a match on title:C. No results will be returned that don't match this condition.
+(....) - require a match on the subquery contained inside. As long as a match is found on any one of the optional queries contained within the parentheses is matched, this will be satisfied.
(address:A city:A)^2 - You prefer a match on A, these two queries are boosted more heavily with ^2.

Related

Boosting CloudSearch results which match facet

I'm using AWS CloudSearch for a search index, and the user can currently search over it for records which match the name field and a few others. However, we have users in different languages and I would like to give a boost to results which match their local language. Every record has a locale field, which could be a facet. However, I don't want to simply exclude results which don't match, and nor do I want to simply sort it so that everything in their language always comes first regardless of 'relevance' - I simply want to give a 'boost' to any result where locale=<my locale>.
In other words, I would like highly relevant matches in a different locale to still beat barely relevant matches in my own language, but relevant matches in my own language should definitely rank higher than matches in a different language.
Is there a way to do this when I query CloudSearch or should I just do the reordering client side once I have fetched all of the results?
So this was painfully convoluted but I did manage to get it to work in the end. If you are doing a structured search (q.parser=structured) then you can perform a query that receives a boost if the locale field matches a given value.
Sadly, where it gets a bit cumbersome, is that it doesn't really seem to be designed for boosting things that match something other than your main query, so by default it filters out anything that doesn't match the locale and excludes them entirely from the results. So you have to combine two versions of the query with an (or) - one with the boost and one without.
So my basic query (which in my case was already (or 'un' (prefix 'un'))) now becomes (or (and <OLD_QUERY> (term field=locale boost=2.5 'en')) <OLD_QUERY>)
In other words: EITHER a match for the original query where locale='en', boosted by 2.5, OR a match for that same query in any locale without a boost.
Painful, but it works!

Is it possible to order lucene documents by matching term?

I'm using Lucene 4.10.3 with Java 1.7
I'm wondering whether it's possible to order query results the matching term?
Simply put, if my documents conatin a text field;
The query is
text:a*
I want documents with ab, then ac, then ad etc.
The real case is more complex however, what I'm actually trying to accomplish is to "stuff" a relational DB into my lucene Index (probably not the best idea?).
An appropriate example would be :
I have documents representing books in a library. every book has a title and also a list of people who has borrowed this book and the date of borrowing.
when a user searches for a book with title containing "JAVA", I want to give priority to books that were borrowed by this user. This could be accomplished by adding a TextField "borrowers", adding a SHOULD clause on it and ordering by score)
also, if there are several books with "JAVA" that this user has borrowed before, I want to show the most recent borrowed ones first. so I thought to create a TextField "borrowers" that will look like
borrowers : "user1__20150505 user2__20150506" etc.
I will add a BooleanClause borrowers: user1* and order by matching term.
any other solution ideas will be welcome
I understand your real problem is more complex, but maybe this is helpful anyway.
You could first search for Tokens in the index that match your query, then for each matching token executing a query using this token specifically.
See https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/index/TermsEnum.html for that. Just seek to the prefix and iterate until the prefix stops matching.
In general it is sometimes easy to just issue two queries. For example one within the corpus of books the user as borrowed before and another witin the whole corpus.
These approaches may not work, but in that case you could implement a custom Scorer somehow mapping the ordering to a number.
See http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/

Lucene.net filters: do they restrict the initial search space or just exclude filtered docs from search results?

I am trying to determine whether Lucene.net filters restrict the search space or just exclude documents from search matches based on the documents returned by the filter. That is, if a filter allows for documents A,B, and C but not D and E and a user's query matches B,C, and D, would the filter prevent D from even being considered as a query match, or would the query matches include B,C, and D and then just have D excluded by the filter after the query runs? I'm not finding conclusive information anywhere on this front. The closest I've come is this post from a year and a half ago: http://java.dzone.com/news/fast-lucene-search-filters which suggests that filters are applied after the query matches are returned. Keep in mind that I'm using the current version of Lucene.net, NOT Lucene for Java.
In current Lucene.Net, Filters are applied after query execution. Therefore, documents excluded by the filter would still be scored per the query criterias and then removed from the result set.
That is changing in 4.x, see Lucene-1536 for more informations.
Now you might ask, is it worth is to use filters? The answer is yes, if you cache the Filter for re-use.

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article