Lucene Proximity Search with multiple words - lucene

I am trying to build a query to search an Lucene index of names with name variants.
The index was built with Lucene.NET version 2.9.2
The user enters for example "Margaret White".
Without a name variant option, my query becomes "Margaret White"~1 and it works.
Now, I can look up name variants against both firstname and surname to produce an extended list.
eg. in this case (and I only include some as an example, since the list can be 100 or more sometimes) we can have
Margaret / Margrett White / Whyte
The query "margrett white"~1 OR "margaret white"~1 OR "margrett whyte"~1 OR "margaret whyte"~1
gives me the correct result but given a possible 100 x 100 variant combinations, the query string woudl be cumbersome to say the least.
I have tried various ways to achieve a more compact query but nothing seems to work.
Can anyone give me any pointers or alternative approach. I have control over the index creation process and wonder if there is something I can do at that stage?
Thanks for looking
Roger

Do the synonym filter in your indexing process instead of at query time. Just map "white", "whyte", ... to some single word; say "white". Same for "Margaret."
Then your query will just be "margaret white"~1

I faced a similar problem and solved it by writing my own query parser and manually instantiating query primitives. Writing your own query parser is not necessarily easy, but it gives you a lot of flexibility. In my new query language, I use within/N to specify a proximity query. With it, the following complex queries become possible:
margaret within/3 white
margaret within/3 (white or whyte)
Or even more complex queries
("first name" within/3 margaret) within/10 ("last name" within/3 (white or whyte))

Related

How to tackle efficient searching of a string that could have multiple variations?

My title sounds complicated, but the situation is very simple. People search on my site using a term such as "blackfriday".
When they conduct the search, my SQL code needs to look in various places such as a ProductTitle and ProductDescription field to find this term. For example:
SELECT *
FROM dbo.Products
WHERE ProductTitle LIKE '%blackfriday%' OR
ProductDescription LIKE '%blackfriday%'
However, the term appears differently in the database fields. It is most like to appear with a space between the words as such "Black Friday USA 2015". So without going through and adding more combinations to the WHERE clause such as WHERE ProductTitle LIKE '%Black-Friday%', is there a better way to accomplish this kind of fuzzy searching?
I have full-text search enabled on the above fields but its really not that good when I use the CONTAINS clause. And of course other terms may not be as neat as this example.
I should start by saying that "variations (of a string)" is a bit vague. You could mean plurality, verb tenses, synonyms, and/or combined words (or, ignoring spaces and punctuation between 2 words) like the example you posted: "blackfriday" vs. "black friday" vs "black-friday". I have a few solutions of which 1 or more together may work for you depending on your use case.
Ignoring punctuation
Full Text searches already ignore punctuation and match them to spaces. So black-friday will match black friday whether using FREETEXT or CONTAINS. But it won't match blackfriday.
Synonyms and combined words
Using FREETEXT or FREETEXTTABLE for your full text search is a good way to handle synonyms and some matching of combined words (I don't know which ones). You can customize the thesaurus to add more combined words assuming it's practical for you to come up with such a list.
Handling combinations of any 2 words
Maybe your use case calls for you to match poorly formatted text or hashtags. In that case I have a couple of ideas:
Write the full text query to cover each combination of words using a dictionary. For example your data layer can rewrite a search for black friday as CONTAINS(*, '"black friday" OR "blackfriday"'). This may have to get complex, for example would black friday treehouse have to be ("black friday" OR "blackfriday") AND ("treehouse" OR "tree house")? You would need a dictionary to figure out that "treehouse" is made up of 2 words and thus can be split.
If it's not practical to use a dictionary for the words being searched for (I don't know why, maybe acronyms or new memes) you could create a long query to cover every letter combination. So searching for do-re-mi could be "do re mi" OR "doremi" OR "do remi" OR "dore mi" OR "d oremi" OR "d o remi" .... Yes it will be a lot of combinations, but surprisingly it may run quickly because of how full text efficiently looks up words in the index.
A hack / workaround if searching for multiple variations is very important.
Define which fields in the DB are searchable (e.g ProductTitle, ProductDescription)
Before saving these fields in the DB, replace each space (or consecutive spaces by a placeholder e.g "%")
Search the DB for variation matches employing the placeholder
Do the reverse process when displaying these fields on your site (i.e replace placeholder with space)
Alternatively you can enable regex matching for your users (meaning they can define a regex either explicitly or let your app build one from their search term). But it is slower and probably error-prone to do it this way
After looking into everything, I have settled for using SQL's FREETEXT full-text search. Its not ideal, or accurate, but for now it will have to do.
My answer is probably inadequate but do you have any scenarios which wont be addressed by query below.
SELECT *
FROM dbo.Products
WHERE ProductTitle LIKE '%black%friday%' OR
ProductDescription LIKE '%black%friday%'

Lucene.NET - do an AND search multiple words on multiple fields

I define a Document object for my product entity which has several fields: Title, Brand, Category, Size, Color, Material.
Now I want to support user to do an AND search on multiple fields. Any document that have one, two or more fields contain all the search words will be responded.
For example, when user enter "gucci shirt red" I want to return all documents that have fields matched with all 3 tokens "gucci", "shirt" AND "red". So all documents below will be responded:
1.Documents with title contains all the 3 words, for example Title = "Gucci Modern Shirt Red" or "Gucci blue shirt"...
2.Documents with Title = "Gucci classical shirt" AND Color = "red"
3.Documents with Category = "mens shirt" AND Brand = "gucci" AND Color = "red"
4.etc..
I know that Lucene support operator + that do a MUST for search query. For example I can translate the above keyword to query "+gucci +shirt +red" then I'm sure documents of example (1) above will definitely be responded. But does it work for cases (2) and (3) above ?
When doing these types of queries I like to: create a master BooleanQuery and add several sub-queries that work together to give the best result:
TermQuery: (exact match), someone types in the exact match of the title
PhraseQuery: (use slop), so if you have "Gucci Modern Shirt Red" and someone types in "Gucci Shirt" (notice one word gap) it would match
FuzzyQuery: (slow on large(> 50 million records)/non-memory indexes) to account for potential misspellings
Boolean SubQuery: with all of the terms seperated and OR'ed. Queries matching 1 our of 4 words will have low score however 3/4 words will have a higher score.
Query Parse (as mentioned above with potential field boosts)
Other: i.e. Synonym search on phrases etc.
I would OR all of these types and then filter them out using a Collector minimum score.
The reason I like the master BooleanQuery approach is that you can have a setting where a user chooses "the type" of query. Maybe as simple -> advanced and it is easy to add/remove query types rather quickly on the fly and the query can be built pretty easily giving predicitve results. Boosting records/similarity you are working within the internal Lucene algorithm and results are not sometimes clear.
Performance: I have done queries like this using Lucene 3.0.x on indexes with > 100M records NOT IN MEMORY and it works pretty quickly giving sub-second responses. Fuzzy Query does slow things down, but as stated before that can be made into an advanced search option (or "Search again with...")
No, when not given a a field to search explicitly in the query, it will go to the default field, which it would appear is the "title" in your case. You would need a query more like:
+shirt +color:red +brand:gucci
for instance.
Or, one common usage is to set up a catch all field, in which all (or a large subset) of searchable data is mashed together, allowing you to search everything in a very loose fashion, on that field, in which case you would just use something like:
all:(+shirt +gucci +red)
Or, if you made that field your default field instead:
+shirt +gucci +red
As you indicated.
You could use MultiFieldQueryParser. Add Title, color, brand etc to this.
If you search for "gucci shirt red" then using above Parser would return query like
+((Title:gucci Color:gucci Brand:gucci) (Title:shirt Color:shirt Brand:shirt) (Title:red Color:red Brand:red)
This should solve the problem.
Also, if you want that lets say, for above query you want to show brand with gucci products to be shown 1st then you could apply boost to this field.

SQL - searching database with the LIKE operator

Given your data stored somewhere in a database:
Hello my name is Tom I like dinosaurs to talk about SQL.
SQL is amazing. I really like SQL.
We want to implement a site search, allowing visitors to enter terms and return relating records. A user might search for:
Dinosaurs
And the SQL:
WHERE articleBody LIKE '%Dinosaurs%'
Copes fine with returning the correct set of records.
How would we cope however, if a user mispells dinosaurs? IE:
Dinosores
(Poor sore dino). How can we search allowing for error in spelling? We can associate common misspellings we see in search with the correct spelling, and then search on the original terms + corrected term, but this is time consuming to maintain.
Any way programatically?
Edit
Appears SOUNDEX could help, but can anyone give me an example using soundex where entering the search term:
Dinosores wrocks
returns records instead of doing:
WHERE articleBody LIKE '%Dinosaurs%' OR articleBody LIKE '%Wrocks%'
which would return squadoosh?
If you're using SQL Server, have a look at SOUNDEX.
For your example:
select SOUNDEX('Dinosaurs'), SOUNDEX('Dinosores')
Returns identical values (D526) .
You can also use DIFFERENCE function (on same link as soundex) that will compare levels of similarity (4 being the most similar, 0 being the least).
SELECT DIFFERENCE('Dinosaurs', 'Dinosores'); --returns 4
Edit:
After hunting around a bit for a multi-text option, it seems that this isn't all that easy. I would refer you to the link on the Fuzzt Logic answer provided by #Neil Knight (+1 to that, for me!).
This stackoverflow article also details possible sources for implentations for Fuzzy Logic in TSQL. Once respondant also outlined Full text Indexing as a potential that you might want to investigate.
Perhaps your RDBMS has a SOUNDEX function? You didn't mention which one was involved here.
SQL Server's SOUNDEX
Just to throw an alternative out there. If SSIS is an option, then you can use Fuzzy Lookup.
SSIS Fuzzy Lookup
I'm not sure if introducing a separate "search engine" is possible, but if you look at products like the Google search appliance or Autonomy, these products can index a SQL database and provide more searching options - for example, handling misspellings as well as synonyms, search results weighting, alternative search recommendations, etc.
Also, SQL Server's full-text search feature can be configured to use a thesaurus, which might help:
http://msdn.microsoft.com/en-us/library/ms142491.aspx
Here is another SO question from someone setting up a thesaurus to handle common misspellings:
FORMSOF Thesaurus in SQL Server
Short answer, there is nothing built in to most SQL engines that can do dictionary-based correction of "fat fingers". SoundEx does work as a tool to find words that would sound alike and thus correct for phonetic misspellings, but if the user typed in "Dinosars" missing the final U, or truly "fat-fingered" it and entered "Dinosayrs", SoundEx would not return an exact match.
Sounds like you want something on the level of Google Search's "Did you mean __?" feature. I can tell you that is not as simple as it looks. At a 10,000-foot level, the search engine would look at each of those keywords and see if it's in a "dictionary" of known "good" search terms. If it isn't, it uses an algorithm much like a spell-checker suggestion to find the dictionary word that is the closest match (requires the fewest letter substitutions, additions, deletions and transpositions to turn the given word into the dictionary word). This will require some heavy procedural code, either in a stored proc or CLR Db function in your database, or in your business logic layer.
You can also try the SubString(), to eliminate the first 3 or so characters . Below is an example of how that can be achieved
SELECT Fname, Lname
FROM Table1 ,Table2
WHERE substr(Table1.Fname, 1,3) || substr(Table1.Lname,1 ,3) = substr(Table2.Fname, 1,3) || substr(Table2.Lname, 1 , 3))
ORDER BY Table1.Fname;

Lucene Index and Query Design Question - Searching People

I have recently just started working with Lucene (specifically, Lucene.Net) and have successfully created several indicies and have no problem with any of them. Previously having worked with Endeca, I find that Lucene is lightweight, powerful, and has a much lower learning curve (due mostly to a concise API).
However, I have one specific index/query situation which I am having problems wrapping my head around. What I have is a person directory. People can be searched for in this application, with the goal of returning both exact and approximate matches. Right now, in the index I concatenate the "FirstName" and "LastName" into a single field called "FullName", adding a space between the two. So FirstName:Jon with LastName:Smith yield FullName:Jon Smith. I do anticipate the possibility of middle names and possibly suffix, but that is not important at the moment.
I would like to do the equivalent of a fuzzy search on the name, so someone searching for "John Smith" would still get back "Jon Smith". I had thought about a multisearch, however, this becomes more involved if his name was actually "Jon Del Carmen" or "Jon Paul Del Carmen". I have nothing in what the user types in to delineate the first name or last name pieces.
The only thought that I have is that I could replace spaces in the concatenated value with a character that would not be discarded. If I did this when I built the document for the index and also when I parsed the query, I could treat it as one larger word, right? Is there another way to do this that would work for both simple names ("Jon Smith") and also more complex names ("Jon Paul Del Carmen")?
Any advice would truly be appreciated. Thanks in advance!
Edit: Additional detail follows.
In Luke, I put in the following query:
FullName:jonn smith~
It is being parsed as:
FullName:jonn CreatedOn:smith~0.5
With an Explanation of:
BooleanQuery:boost=1.0000
clauses=2, maxClauses=1024
Clause 0: SHOULD
TermQuery:boost=1.0000
Term: field='FullName' text='jonn'
Cluase 1: SHOULD
FuzzyQuery: boost=1.0000
prefixLen=0, minSimilarity=0.5000
org.apache.lucene.search.FuzzyTermEnum: diff=-1.0000
FilteredTermEnum: Exception null
"CreatedOn" is another Field in the index. I tried putting quotes around the term "jonn smith", but it then treats it like a phrasequery, instead. I am sure that the problem is that I am just not doing something right, but being so green at all of this, I am not sure what that something truly is.
My problem was with how I was building the index. What I ended up doing was making sure that it was not tokenizing the FullName, and the query started returning the correct results. The Explain results from above were due to an ID10T error on my part and is now returning correctly.

Best way to implement a stored procedure with full text search

I would like to run a search with MSSQL Full text engine where given the following user input:
"Hollywood square"
I want the results to have both Hollywood and square[s] in them.
I can create a method on the web server (C#, ASP.NET) to dynamically produce a sql statement like this:
SELECT TITLE
FROM MOVIES
WHERE CONTAINS(TITLE,'"hollywood*"')
AND CONTAINS(TITLE, '"square*"')
Easy enough. HOWEVER, I would like this in a stored procedure for added speed benefit and security for adding parameters.
Can I have my cake and eat it too?
I agreed with above, look into AND clauses
SELECT TITLE
FROM MOVIES
WHERE CONTAINS(TITLE,'"hollywood*" AND "square*"')
However you shouldn't have to split the input sentences, you can use variable
SELECT TITLE
FROM MOVIES
WHERE CONTAINS(TITLE,#parameter)
by the way
search for the exact term (contains)
search for any term in the phrase (freetext)
The last time I had to do this (with MSSQL Server 2005) I ended up moving the whole search functionality over to Lucene (the Java version, though Lucene.Net now exists I believe). I had high hopes of the full text search but this specific problem annoyed me so much I gave up.
Have you tried using the AND logical operator in your string? I pass in a raw string to my sproc and stuff 'AND' between the words.
http://msdn.microsoft.com/en-us/library/ms187787.aspx