Lucene exact ordering - lucene

I've had this long term issue in not quite understanding how to implement a decent Lucene sort or ranking. Say I have a list of cities and their populations. If someone searches "new" or "london" I want the list of prefix matches ordered by population, and I have that working with a prefix search and an sort by field reversed, where there is a population field, IE New Mexico, New York; or London, Londonderry.
However I also always want the exact matching name to be at the top. So in the case of "London" the list should show "London, London, Londonderry" where the first London is in the UK and the second London is in Connecticut, even if Londonderry has a higher population than London CT.
Does anyone have a single query solution?

dlamblin,let me see if I get this correctly: You want to make a prefix-based query, and then sort the results by population, and maybe combine the sort order with preference for exact matches.
I suggest you separate the search from the sort and use a CustomSorter for the sorting:
Here's a blog entry describing a custom sorter.
The classic Lucene book describes this well.

API for
Sortcomparator
says
There is a distinct Comparable for each unique term in the field - if
some documents have the same term in
the field, the cache array will have
entries which reference the same
Comparable
You can apply a
FieldSortedHitQueue
to the sortcomparator which has a Comparator field for which the api says ...
Stores a comparator corresponding to
each field being sorted by.
Thus the term can be sorted accordingly

My current solution is to create an exact searcher and a prefix searcher, both sorted by reverse population, and then copy out all my hits starting from the exact hits, moving to the prefix hits. It makes paging my results slightly more annoying than I think it should be.
Also I used a hash to eliminate duplicates but later changed the prefix searcher into a boolean query of a prefix search (MUST) with an exact search (MUST NOT), to have Lucene remove the duplicates. Though this seemed even more wasteful.
Edit: Moved to a comment (since the feature now exists): Yuval F Thank you for your blog post ... How would the sort comparator know that the name field "london" exactly matches the search term "london" if it cannot access the search term?

Related

Is it possible to order lucene documents by matching term?

I'm using Lucene 4.10.3 with Java 1.7
I'm wondering whether it's possible to order query results the matching term?
Simply put, if my documents conatin a text field;
The query is
text:a*
I want documents with ab, then ac, then ad etc.
The real case is more complex however, what I'm actually trying to accomplish is to "stuff" a relational DB into my lucene Index (probably not the best idea?).
An appropriate example would be :
I have documents representing books in a library. every book has a title and also a list of people who has borrowed this book and the date of borrowing.
when a user searches for a book with title containing "JAVA", I want to give priority to books that were borrowed by this user. This could be accomplished by adding a TextField "borrowers", adding a SHOULD clause on it and ordering by score)
also, if there are several books with "JAVA" that this user has borrowed before, I want to show the most recent borrowed ones first. so I thought to create a TextField "borrowers" that will look like
borrowers : "user1__20150505 user2__20150506" etc.
I will add a BooleanClause borrowers: user1* and order by matching term.
any other solution ideas will be welcome
I understand your real problem is more complex, but maybe this is helpful anyway.
You could first search for Tokens in the index that match your query, then for each matching token executing a query using this token specifically.
See https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/index/TermsEnum.html for that. Just seek to the prefix and iterate until the prefix stops matching.
In general it is sometimes easy to just issue two queries. For example one within the corpus of books the user as borrowed before and another witin the whole corpus.
These approaches may not work, but in that case you could implement a custom Scorer somehow mapping the ordering to a number.
See http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/

Lucene.NET - do an AND search multiple words on multiple fields

I define a Document object for my product entity which has several fields: Title, Brand, Category, Size, Color, Material.
Now I want to support user to do an AND search on multiple fields. Any document that have one, two or more fields contain all the search words will be responded.
For example, when user enter "gucci shirt red" I want to return all documents that have fields matched with all 3 tokens "gucci", "shirt" AND "red". So all documents below will be responded:
1.Documents with title contains all the 3 words, for example Title = "Gucci Modern Shirt Red" or "Gucci blue shirt"...
2.Documents with Title = "Gucci classical shirt" AND Color = "red"
3.Documents with Category = "mens shirt" AND Brand = "gucci" AND Color = "red"
4.etc..
I know that Lucene support operator + that do a MUST for search query. For example I can translate the above keyword to query "+gucci +shirt +red" then I'm sure documents of example (1) above will definitely be responded. But does it work for cases (2) and (3) above ?
When doing these types of queries I like to: create a master BooleanQuery and add several sub-queries that work together to give the best result:
TermQuery: (exact match), someone types in the exact match of the title
PhraseQuery: (use slop), so if you have "Gucci Modern Shirt Red" and someone types in "Gucci Shirt" (notice one word gap) it would match
FuzzyQuery: (slow on large(> 50 million records)/non-memory indexes) to account for potential misspellings
Boolean SubQuery: with all of the terms seperated and OR'ed. Queries matching 1 our of 4 words will have low score however 3/4 words will have a higher score.
Query Parse (as mentioned above with potential field boosts)
Other: i.e. Synonym search on phrases etc.
I would OR all of these types and then filter them out using a Collector minimum score.
The reason I like the master BooleanQuery approach is that you can have a setting where a user chooses "the type" of query. Maybe as simple -> advanced and it is easy to add/remove query types rather quickly on the fly and the query can be built pretty easily giving predicitve results. Boosting records/similarity you are working within the internal Lucene algorithm and results are not sometimes clear.
Performance: I have done queries like this using Lucene 3.0.x on indexes with > 100M records NOT IN MEMORY and it works pretty quickly giving sub-second responses. Fuzzy Query does slow things down, but as stated before that can be made into an advanced search option (or "Search again with...")
No, when not given a a field to search explicitly in the query, it will go to the default field, which it would appear is the "title" in your case. You would need a query more like:
+shirt +color:red +brand:gucci
for instance.
Or, one common usage is to set up a catch all field, in which all (or a large subset) of searchable data is mashed together, allowing you to search everything in a very loose fashion, on that field, in which case you would just use something like:
all:(+shirt +gucci +red)
Or, if you made that field your default field instead:
+shirt +gucci +red
As you indicated.
You could use MultiFieldQueryParser. Add Title, color, brand etc to this.
If you search for "gucci shirt red" then using above Parser would return query like
+((Title:gucci Color:gucci Brand:gucci) (Title:shirt Color:shirt Brand:shirt) (Title:red Color:red Brand:red)
This should solve the problem.
Also, if you want that lets say, for above query you want to show brand with gucci products to be shown 1st then you could apply boost to this field.

Solr: How can I get all documents ordered by score with a list of keywords?

I have a Solr 3.1 database containing Emails with two fields:
datetime
text
For the query I have two parameters:
date of today
keyword array("important thing", "important too", "not so important, but more than average")
Is it possible to create a query to
get ALL documents of this day AND
sort them by relevancy by ordering them so that the email with contains most of my keywords(important things) scores best?
The part with the date is not very complicated:
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
I know that you can boost the keywords this way:
q=text:"first keyword"^5 OR text:"second one"^2 OR text:"minus scoring"^0.5 OR text:"*"
But how do I only use the keywords to sort this list and get ALL entries instead of doing a realy query and get only a few entries back?
Thanks for help!
You need to specify your terms in the main query and then change your date query to be a filter query on these results by adding the following.
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
So you should have something like this:
q=<terms go here>&fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
Edit: A little more about filter queries (as suggested by rfreak).
From Solr Wiki - FilterQuery Guidance - "Now, what is a filter query? It is simply a part of a query that is factored out for special treatment. This is achieved in Solr by specifying it using the fq (filter query) parameter instead of the q (main query) parameter. The same result could be achieved leaving that query part in the main query. The difference will be in query efficiency. That's because the result of a filter query is cached and then used to filter a primary query result using set intersection."
These should be sorted by relevancy score already, that is just the default behavior of Solr. You can see the score by adding that field.
fl=*,score
If you use the Full Interface for Make A Query on the Admin Interface on your Solr installation at http://<yourserver:port#>/<instancename>/admin/form.jspyou will see where you can specify the filter query, fields, and other options. You can check out the Solr Wiki for more details on the options and how they are used.
I hope that this helps you.
You could do a first query for:
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
which gives all documents that match the range. Then, use CachingWrapperFilter for the second query to find documents in the DocSet from first query which have at least one keyword. They will be relevance ranked per tf-idf. You may want to use ConstantScoringQuery for the first to get the list of matching docids in the fastest possible way.
Sorting by relevance is default behavior on solr/lucene.
If your results are unsatisfied, try to put the keywords in quotes
//Edit: Folowing the answer from Paige Cook, use somethink like that
q="important thing"&fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
//2. nd update. By thinking about this answer: quotes are not an good idea, because in this case you will only receive "important thing" mails, but no "important too"
The Point is: what keywords you are using. Because: searching for -- important thing -- results in the highest scores for "important thing" mails. But lucene does not know, how to score "important too" or "not so important, but more than average" in relation to your keywords.
An other idea would be searching only for "important". But the field-values "importand thing" and "importand too" gives nearly the same score values,because 50% of the searched keywords (in this key: "imported") are part of the field-value.
So probably you have to change your keywords. It could work after changeing "importend to" into "also an important mail", to get the beast ratio of search-word "important" and field-value in order to score the shortest Mail-discripton to the highest value.

Lucene Fuzzy Search for customer names and partial address

I was going thru all the existing questions posts but couldn't get something much relevant.
I have file with millions of records for person first name, last name, address1, address2, country code, date of birth - I would like to check my list of customers with above file on daily basis (my customer list also get updated daily and file also gets updated daily).
For first name and last name I would like fuzzy match (may be lucene fuzzyquery/levenshtein distance 90% match) and for remaining fields country and date of birth I wanted exact match.
I am new to Lucene, but by looking at number of posts, looks like its possible.
My questions are:
How should I index my input file? I need to build index on combination of FN, LN, country, DOB and use the index for search
How I can use Fuzzy query of Lucene here?
Is there any other way I can implement the same?
Rushik, here are a few ideas:
Consider using Solr. It is much easier to start using it, rather than bare Lucene.
Build a Lucene/Solr index of the file. It appears that a document per customer is enough, if you use a multi-valued field or two different fields for addresses.
Do you have a unique id per person? To use Solr, you need one. In Lucene, you can get away without using a unique id.
Store the country code as a "keyword". If you only require exact match for date of birth, you may do the same. For range queries, you will need another representation.
I assume your customer list is smaller than the file. A possible policy would be to daily index the changes in the file (Here a unique id is really handy - otherwise you need to delete by query, which may miss the mark). Then you can optimize the index, and after that run a search for your updated customer list.
What you describe is a BooleanQuery, Whose clauses are fuzzy queries for the first and last names and term queries for the other fields. You can create the query programmaticaly or using the query parser.
Consider using soundex for names as described here.
Some academic papers on this subject are well worth reading (google for the free PDFs):
A Comparison of Personal Name Matching: Techniques and Practical Issues (2006)
Overview of Record Linkage and Current Research Directions (2006)
A Parallel Open Source Data Linkage System (2004)
You should also consider the following libraries/frameworks:
Duke: https://github.com/larsga/Duke
Febrl: http://sourceforge.net/projects/febrl/
(Answered for future visitors.)

Make lucene treat all terms in a field as a single term

In my Lucene documents I have a field "company" where the company name is tokenized.
I need the tokenization for a certain part of my application.
But for this query, I need to be able to create a PrefixQuery over the whole company field.
Example:
My Brand
my
brand
brahmin farm
brahmin
farm
Regularly querying for "bra" would return both documents because they both have a term starting with bra.
The result I want though, would only return the last entry because the first term starts with bra.
Any suggestions?
Create another indexed field, where the company name is not tokenized. When necessary, search on that field rather than the tokenized company name field.
If you want fast searches, you need to have index entries that point directly at the records of interest. There might be something that you can to with the proximity data to filter records, but it will be slow. I see the problem as: how can a "contains" query over a complete field be performed efficiently?
You might be able to minimize the increase in index size by creating (for each current field) a "first term" field and "remaining terms" field. This would eliminate duplication of the first term in two fields. For "normal" queries, you look for query terms in either of these fields. For "startswith" queries, you search only the "first term" field. But this seems like more trouble than it's worth.
Use a SpanQuery to only search the first term position. A PrefixQuery wrapped by SpanMultiTermQueryWrapper wrapped by SpanPositionRangeQuery:
<SpanPositionRangeQuery: spanPosRange(SpanMultiTermQueryWrapper(company:bra*), 0, 1)>