I was going thru all the existing questions posts but couldn't get something much relevant.
I have file with millions of records for person first name, last name, address1, address2, country code, date of birth - I would like to check my list of customers with above file on daily basis (my customer list also get updated daily and file also gets updated daily).
For first name and last name I would like fuzzy match (may be lucene fuzzyquery/levenshtein distance 90% match) and for remaining fields country and date of birth I wanted exact match.
I am new to Lucene, but by looking at number of posts, looks like its possible.
My questions are:
How should I index my input file? I need to build index on combination of FN, LN, country, DOB and use the index for search
How I can use Fuzzy query of Lucene here?
Is there any other way I can implement the same?
Rushik, here are a few ideas:
Consider using Solr. It is much easier to start using it, rather than bare Lucene.
Build a Lucene/Solr index of the file. It appears that a document per customer is enough, if you use a multi-valued field or two different fields for addresses.
Do you have a unique id per person? To use Solr, you need one. In Lucene, you can get away without using a unique id.
Store the country code as a "keyword". If you only require exact match for date of birth, you may do the same. For range queries, you will need another representation.
I assume your customer list is smaller than the file. A possible policy would be to daily index the changes in the file (Here a unique id is really handy - otherwise you need to delete by query, which may miss the mark). Then you can optimize the index, and after that run a search for your updated customer list.
What you describe is a BooleanQuery, Whose clauses are fuzzy queries for the first and last names and term queries for the other fields. You can create the query programmaticaly or using the query parser.
Consider using soundex for names as described here.
Some academic papers on this subject are well worth reading (google for the free PDFs):
A Comparison of Personal Name Matching: Techniques and Practical Issues (2006)
Overview of Record Linkage and Current Research Directions (2006)
A Parallel Open Source Data Linkage System (2004)
You should also consider the following libraries/frameworks:
Duke: https://github.com/larsga/Duke
Febrl: http://sourceforge.net/projects/febrl/
(Answered for future visitors.)
Related
I'm using Lucene 4.10.3 with Java 1.7
I'm wondering whether it's possible to order query results the matching term?
Simply put, if my documents conatin a text field;
The query is
text:a*
I want documents with ab, then ac, then ad etc.
The real case is more complex however, what I'm actually trying to accomplish is to "stuff" a relational DB into my lucene Index (probably not the best idea?).
An appropriate example would be :
I have documents representing books in a library. every book has a title and also a list of people who has borrowed this book and the date of borrowing.
when a user searches for a book with title containing "JAVA", I want to give priority to books that were borrowed by this user. This could be accomplished by adding a TextField "borrowers", adding a SHOULD clause on it and ordering by score)
also, if there are several books with "JAVA" that this user has borrowed before, I want to show the most recent borrowed ones first. so I thought to create a TextField "borrowers" that will look like
borrowers : "user1__20150505 user2__20150506" etc.
I will add a BooleanClause borrowers: user1* and order by matching term.
any other solution ideas will be welcome
I understand your real problem is more complex, but maybe this is helpful anyway.
You could first search for Tokens in the index that match your query, then for each matching token executing a query using this token specifically.
See https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/index/TermsEnum.html for that. Just seek to the prefix and iterate until the prefix stops matching.
In general it is sometimes easy to just issue two queries. For example one within the corpus of books the user as borrowed before and another witin the whole corpus.
These approaches may not work, but in that case you could implement a custom Scorer somehow mapping the ordering to a number.
See http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/
I have a business requirement where we need to do somce crazy name matching against records stored in the database and I was wondering if there is any easy way to do it using SQL Server.
Name Stored in the DB : Austin K
Name to be Matched from UI : Austin Kierland
That's just a sample. In reality, there could be whole lot of different permutations and combinations.
If it's other way round, I could've used wild character but in this case, the name in the database is smaller than the search criteria.
Any suggestions?
Realistically - no. Databases were meant for comparing absolute values, not for messy comparisons. The way they store their data internally just isn't fit for really messy matching. Actually even a superpowerful dedicated search engine like Google, that has a LOT of messy matching features, wouldn't be able to pull off your example without prior knowledge.
I don't know how the requirement is precisely worded, but I'd either shoot the feature request with "technically impossible", or implement a rule set for which messy matches are tried - for your example, you could easily 'hard code' that multiple searches are executed when capitalized words are entered, shortening them so a single letter. No idea if that's a solution to your problem though.
You can do a normal search using the LIKE operator which determines whether a specific character string matches a specified pattern. The problem you will run into is the probability of the returning of multiple records or incorrect people. I've had similar requirement myself for a business app and the best solution to the issue is to require other qualifying values rather then just name. If you do a partial name search without other qualifying data you are certainly going to come across the false positive matches and/or multiple records. In my case I built a web service that checks eligibility allowing text search for first & last name but also added date of birth, primary person SSN, and gender which ensured the matching person was in deed the person intended to search for. If my situation was like yours in which name was the only search criteria my recommendation to the business would be we cannot perform the search until qualifying data is entered into the database otherwise there is no accurate way to query the results they are looking for.
I'm creating a self-help FAQ type application and one of the requirements is that the end user has to be able to search for FAQ topics. I have three models of note, listed below with their relevant (i.e. searchable) columns:
Topic: Name, Description
Question: Name, Answer
Problem: Name, Solution
All three tables are linked to Topic via a TopicID column. The idea is to provide a single textbox where the user can enter a search query, something either as a sentence (e.g. "How do I perform X") or a phrase (e.g. "Performing X" or "Perform X"), and provide all Topics/Questions/Problems that have any of the words they entered in either the name or description/answer/solution fields; the model will only ever have those columns searchable and I don't have to worry about filtering out the common words like "How" and such (It would be nice but isn't a requirement as it's not an exact match but a fuzzy match).
For reasons outside of my control, I have to use a Stored Procedure. My question is what would be the most appropriate way to handle a search like this; I've seen similar questions regarding multiple columns but really there is not a variable number of columns, there are always two columns per table that are actually searchable. The issue is that the search query could, in theory, be nearly anything - a sentence, a phrase, a comma-separated list of terms (e.g. "x,y,z"), so I would have to split the search term into components (e.g. split on whitespace) and then search each pair of columns for every term? Is that reasonably easy to do in SQL Server? The alternative, a little messier, is to just pull all the data back and then split the query and filter the results in the server-side code as there shouldn't ever be that many items entered, but I would feel a little dirty doing something like that ;-)
Suggest creating a new Full Text Catalog, and assign the table and columns to that catalog. Ensure your catalog is being updated at the right frequency for your needs.
You can then query this catalog using the FREETEXT predicate. It sounds like you need to match on those suffixes like 'ing', so suggest FREETEXT over CONTAINS in this case.
You can use a variable in this search, so it'll be easy to fit into a stored proc.
declare #token varchar(256);
select #token = 'perform';
select * from Problem
where freetext(Name, #token)
or freetext(Solution, #token);
--this will match 'perform' and 'performing'
More and more, I'm seeing searches that not only find a substring in a specific column, but they appear to search in all columns. An example is in Amazon, where you can search for "Arnold" and it finds both the movie Running Man starring Arnold Schwarzeneggar, and the Gund toy Arnold the Snoring Pig. I don't know what the term is for this type of search (Wide search? Global search?), and that bugs me. But what I really want to know is what is the normal pattern for accomplishing this type of search in a QUICK way.
The obvious, and slow, way to do it would be to search for the substring "Arnold" in the title, "Arnold" in the author, "Arnold" in the description, etc.
The first quick solution that comes to mind is to store a mapping for each word used to describe a product to the product itself, and then search that word mapping. That could be quick, but doesn't seem very space-efficient to me.
There are probably a hundred ways to accomplish this, some of which probably don't even use a database. But what is the norm?
I've done this in the past by storing an XML version of items in an XML column in the table, then searching in that column instead of the others.
Maybe they're not storing the data the way you expect.
They could, for example, store all titles, authors, descriptions, and every other searchable field in one table with an attribute to distinguish the field's type.
I'm looking for a pattern for performing a dynamic search on multiple tables.
I have no control over the legacy (and poorly designed) database table structure.
Consider a scenario similar to a resume search where a user may want to perform a search against any of the data in the resume and get back a list of resumes that match their search criteria. Any field can be searched at anytime and in combination with one or more other fields.
The actual sql query gets created dynamically depending on which fields are searched. Most solutions I've found involve complicated if blocks, but I can't help but think there must be a more elegant solution since this must be a solved problem by now.
Yeah, so I've started down the path of dynamically building the sql in code. Seems godawful. If I really try to support the requested ability to query any combination of any field in any table this is going to be one MASSIVE set of if statements. shiver
I believe I read that COALESCE only works if your data does not contain NULLs. Is that correct? If so, no go, since I have NULL values all over the place.
As far as I understand (and I'm also someone who has written against a horrible legacy database), there is no such thing as dynamic WHERE clauses. It has NOT been solved.
Personally, I prefer to generate my dynamic searches in code. Makes testing convenient. Note, when you create your sql queries in code, don't concatenate in user input. Use your #variables!
The only alternative is to use the COALESCE operator. Let's say you have the following table:
Users
-----------
Name nvarchar(20)
Nickname nvarchar(10)
and you want to search optionally for name or nickname. The following query will do this:
SELECT Name, Nickname
FROM Users
WHERE
Name = COALESCE(#name, Name) AND
Nickname = COALESCE(#nick, Nickname)
If you don't want to search for something, just pass in a null. For example, passing in "brian" for #name and null for #nick results in the following query being evaluated:
SELECT Name, Nickname
FROM Users
WHERE
Name = 'brian' AND
Nickname = Nickname
The coalesce operator turns the null into an identity evaluation, which is always true and doesn't affect the where clause.
Search and normalization can be at odds with each other. So probably first thing would be to get some kind of "view" that shows all the fields that can be searched as a single row with a single key getting you the resume. then you can throw something like Lucene in front of that to give you a full text index of those rows, the way that works is, you ask it for "x" in this view and it returns to you the key. Its a great solution and come recommended by joel himself on the podcast within the first 2 months IIRC.
What you need is something like SphinxSearch (for MySQL) or Apache Lucene.
As you said in your example lets imagine a Resume that will composed of several fields:
List item
Name,
Adreess,
Education (this could be a table on its own) or
Work experience (this could grow to its own table where each row represents a previous job)
So searching for a word in all those fields with WHERE rapidly becomes a very long query with several JOINS.
Instead you could change your framework of reference and think of the Whole resume as what it is a Single Document and you just want to search said document.
This is where tools like Sphinx Search do. They create a FULL TEXT index of your 'document' and then you can query sphinx and it will give you back where in the Database that record was found.
Really good search results.
Don't worry about this tools not being part of your RDBMS it will save you a lot of headaches to use the appropriate model "Documents" vs the incorrect one "TABLES" for this application.