Is it possible to order lucene documents by matching term? - lucene

I'm using Lucene 4.10.3 with Java 1.7
I'm wondering whether it's possible to order query results the matching term?
Simply put, if my documents conatin a text field;
The query is
text:a*
I want documents with ab, then ac, then ad etc.
The real case is more complex however, what I'm actually trying to accomplish is to "stuff" a relational DB into my lucene Index (probably not the best idea?).
An appropriate example would be :
I have documents representing books in a library. every book has a title and also a list of people who has borrowed this book and the date of borrowing.
when a user searches for a book with title containing "JAVA", I want to give priority to books that were borrowed by this user. This could be accomplished by adding a TextField "borrowers", adding a SHOULD clause on it and ordering by score)
also, if there are several books with "JAVA" that this user has borrowed before, I want to show the most recent borrowed ones first. so I thought to create a TextField "borrowers" that will look like
borrowers : "user1__20150505 user2__20150506" etc.
I will add a BooleanClause borrowers: user1* and order by matching term.
any other solution ideas will be welcome

I understand your real problem is more complex, but maybe this is helpful anyway.
You could first search for Tokens in the index that match your query, then for each matching token executing a query using this token specifically.
See https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/index/TermsEnum.html for that. Just seek to the prefix and iterate until the prefix stops matching.
In general it is sometimes easy to just issue two queries. For example one within the corpus of books the user as borrowed before and another witin the whole corpus.
These approaches may not work, but in that case you could implement a custom Scorer somehow mapping the ordering to a number.
See http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/

Related

Lucene Fuzzy Search for customer names and partial address

I was going thru all the existing questions posts but couldn't get something much relevant.
I have file with millions of records for person first name, last name, address1, address2, country code, date of birth - I would like to check my list of customers with above file on daily basis (my customer list also get updated daily and file also gets updated daily).
For first name and last name I would like fuzzy match (may be lucene fuzzyquery/levenshtein distance 90% match) and for remaining fields country and date of birth I wanted exact match.
I am new to Lucene, but by looking at number of posts, looks like its possible.
My questions are:
How should I index my input file? I need to build index on combination of FN, LN, country, DOB and use the index for search
How I can use Fuzzy query of Lucene here?
Is there any other way I can implement the same?
Rushik, here are a few ideas:
Consider using Solr. It is much easier to start using it, rather than bare Lucene.
Build a Lucene/Solr index of the file. It appears that a document per customer is enough, if you use a multi-valued field or two different fields for addresses.
Do you have a unique id per person? To use Solr, you need one. In Lucene, you can get away without using a unique id.
Store the country code as a "keyword". If you only require exact match for date of birth, you may do the same. For range queries, you will need another representation.
I assume your customer list is smaller than the file. A possible policy would be to daily index the changes in the file (Here a unique id is really handy - otherwise you need to delete by query, which may miss the mark). Then you can optimize the index, and after that run a search for your updated customer list.
What you describe is a BooleanQuery, Whose clauses are fuzzy queries for the first and last names and term queries for the other fields. You can create the query programmaticaly or using the query parser.
Consider using soundex for names as described here.
Some academic papers on this subject are well worth reading (google for the free PDFs):
A Comparison of Personal Name Matching: Techniques and Practical Issues (2006)
Overview of Record Linkage and Current Research Directions (2006)
A Parallel Open Source Data Linkage System (2004)
You should also consider the following libraries/frameworks:
Duke: https://github.com/larsga/Duke
Febrl: http://sourceforge.net/projects/febrl/
(Answered for future visitors.)

Pattern for searching entire DB record, not specific field

More and more, I'm seeing searches that not only find a substring in a specific column, but they appear to search in all columns. An example is in Amazon, where you can search for "Arnold" and it finds both the movie Running Man starring Arnold Schwarzeneggar, and the Gund toy Arnold the Snoring Pig. I don't know what the term is for this type of search (Wide search? Global search?), and that bugs me. But what I really want to know is what is the normal pattern for accomplishing this type of search in a QUICK way.
The obvious, and slow, way to do it would be to search for the substring "Arnold" in the title, "Arnold" in the author, "Arnold" in the description, etc.
The first quick solution that comes to mind is to store a mapping for each word used to describe a product to the product itself, and then search that word mapping. That could be quick, but doesn't seem very space-efficient to me.
There are probably a hundred ways to accomplish this, some of which probably don't even use a database. But what is the norm?
I've done this in the past by storing an XML version of items in an XML column in the table, then searching in that column instead of the others.
Maybe they're not storing the data the way you expect.
They could, for example, store all titles, authors, descriptions, and every other searchable field in one table with an attribute to distinguish the field's type.

Tag suggestion (not tag autocomplete)

AJAX autocomplete is fairly simple to implement. However, I wonder how to handle smart tag suggestion like this on SO.
To clarify the difference between autocomplete and suggestion:
autocomplete: foo [foobar, foobaz]
suggestion: foo [barfoo, foobar, foobaz], or even better, with 'did you mean' feature: [barfoo, foobar, foobaz, fobar, fobaz]
I suppose I need some full text search in tags (all letters indexed, not just words). There would be no problem to do it witch regex or other patterns for limited number of tags (even client side).
But how to implement this feature for big number of tags?
Is there any particular reason (besides URL) the tags on SO are dash separated? What about Unicode characters in tags?
I store the tags in the table with the following columns: id, tagname.
My SQL query returns objects with following fields: id, tagname, count
(I use Doctrine ORM and pgsql as default db driver.)
I would go with SELECTING them from database by REGEXP at every keypress. I did this on my sites and the was no prefrormance problem (I do not have heavy loaded server thought). If you do not like this idea, I would cash all 1-5 letters combinations which will users enter and refresh them on daily basis in separate table. If this table is indexed than you have very fast implementation.
To elaborate more on the second appreach:
Briefly: 1. Make a table SEARCHTABLE representing 1-n relationship betwean keywords (limit it to 3-4 letters) and primary IDs of tags. 2. INDEX on both fields. 3. Everytime the user makes a search do look at the SEARCHTABLE and if the combination is there, use that - very fast, as everything is indexed. If not do the regexp search and put all results to SEARCHTABLE.
Notes:
You should invalidate the table if
you add tags, but this should much
less often than a search. When
invalidating table you do not
necesarilly TRUNCATE it, you can
easily rebuild it taking all
keywords into account.
If you want to speed it up, you can "pregenerate" all two or even three
letters searches.
If you care enough, you should be using information from n-1 letter kewords to generate
the n letter keyword. It speeds the things tremendously. Imagine that user has typed "mo"
and you have shown them appropriate result from SEARCHTABLE. Than when she types "n"
giving it "mon" you need only serach trough already selected items to generate new
response.
Hope it is more comprehensive now.

First Name Variations in a Database

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.
Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.
As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.
Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!
Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.
The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?
Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!
In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).
In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.
The term you are looking for is Hypocorism:
http://en.wikipedia.org/wiki/Hypocorism
And Wikipedia lists many of them. You could bang out some Python or Perl to scrape that page and put it in a db.
I would go with a structure like this:
create table given_names (
id int primary key,
name text not null unique
);
create table hypocorisms (
id int references given_names(id),
name text not null,
primary key (id, name)
);
insert into given_names values (1, 'William');
insert into hypocorisms values (1, 'Bill');
insert into hypocorisms values (1, 'Billy');
Then you could write a function/sproc to normalize a name:
normalize_given_name('Bill'); --returns William
One issue you will face is that different names can have the same hypocorism (Albert -> Al, Alan -> Al)
I think your basic approach is solid. I don't think fulltext is going to help you. For seeding, behindthename.com seems to have large amount of the data you want.
Are you using SQl Server 2005 Express with Advanced Services as to me it sounds you would benefit from the Full Text indexing and more specifically Contains and Containstable which you can use with specific instructions here is a link for the uses of Containstable:
http://msdn.microsoft.com/en-us/library/ms189760.aspx
and here is the download link for SQL Server 2005 With Advanced Services:
http://www.microsoft.com/downloads/details.aspx?familyid=4C6BA9FD-319A-4887-BC75-3B02B5E48A40&displaylang=en
Hope this helps,
Andrew
You can use the SQL Server Full Text Search and do an inflectional search.
Basically like:
SELECT ProductId, ProductName
FROM ProductModel
WHERE CONTAINS(CatalogDescription, ' FORMSOF(THESAURUS, metal) ')
Check out:
http://en.wikipedia.org/wiki/SQL_Server_Full_Text_Search#Inflectional_Searches
http://msdn.microsoft.com/en-us/library/ms345119.aspx
http://www.mssqltips.com/tip.asp?tip=1491
Not sure what your application is, but if your users know at the time of sign up that people from their past might be searching the database for them, you could offer them the chance in the user profile to define other names they might be known as (including last names, women change these all the time and makes finding them much harder!) and that they want people to be able to search on. Store these in a separate related table. Then search on that. Just make the structure such that you can define one name as the main name (the one you use for everything except the search.)
You'll find that you're dabbling in an area known as "Natural Language Processing" and you'll need to do several things, most of which can be found under the topic of stemming.
Simplistic stemming simply breaks the word apart, but more advanced algorithms associate words that mean the same thing - for instance Google might use stemming to convert "cat" and "kitten" to "feline" and search for all three, weighing the actual word provided by the user as slightly heavier so exact matches return before stemmed matches.
It's a known problem, and there are open source stemmers available.
-Adam
No, Full Text searches will not help to solve your problem.
I think you might want to take a look at some of the following links: (Funny, no one mentioned SoundEx till now)
SoundEx - MSDN
SoundEx - Google results
InformIT - Tolerant Search algorithms
Basically SoundEx allows you to evaluate the level of similarity in similar sounding words. The function is also available on SQL 2005.
As a side issue, instead of returning similar results, it might prove more intuitive to the user to use a AJAX based script to deliver similar sounding names before the user initiates his/her search. That way you can show the user "similar names" or "did you mean..." kind of data.
Here's an idea for automatically finding "name synonyms" like Bill/William. That problem has been studied in the broader context of synonyms in general: inducing them from statistics of which words commonly appear in the same contexts in a large text corpus like the Web. You could try combining that approach with a list of names like Moby Names; I don't know if it's been done before.
Here are some pointers.

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".