Lucene Index and Query Design Question - Searching People - indexing

I have recently just started working with Lucene (specifically, Lucene.Net) and have successfully created several indicies and have no problem with any of them. Previously having worked with Endeca, I find that Lucene is lightweight, powerful, and has a much lower learning curve (due mostly to a concise API).
However, I have one specific index/query situation which I am having problems wrapping my head around. What I have is a person directory. People can be searched for in this application, with the goal of returning both exact and approximate matches. Right now, in the index I concatenate the "FirstName" and "LastName" into a single field called "FullName", adding a space between the two. So FirstName:Jon with LastName:Smith yield FullName:Jon Smith. I do anticipate the possibility of middle names and possibly suffix, but that is not important at the moment.
I would like to do the equivalent of a fuzzy search on the name, so someone searching for "John Smith" would still get back "Jon Smith". I had thought about a multisearch, however, this becomes more involved if his name was actually "Jon Del Carmen" or "Jon Paul Del Carmen". I have nothing in what the user types in to delineate the first name or last name pieces.
The only thought that I have is that I could replace spaces in the concatenated value with a character that would not be discarded. If I did this when I built the document for the index and also when I parsed the query, I could treat it as one larger word, right? Is there another way to do this that would work for both simple names ("Jon Smith") and also more complex names ("Jon Paul Del Carmen")?
Any advice would truly be appreciated. Thanks in advance!
Edit: Additional detail follows.
In Luke, I put in the following query:
FullName:jonn smith~
It is being parsed as:
FullName:jonn CreatedOn:smith~0.5
With an Explanation of:
clauses=2, maxClauses=1024
Clause 0: SHOULD
Term: field='FullName' text='jonn'
Cluase 1: SHOULD
FuzzyQuery: boost=1.0000
prefixLen=0, minSimilarity=0.5000 diff=-1.0000
FilteredTermEnum: Exception null
"CreatedOn" is another Field in the index. I tried putting quotes around the term "jonn smith", but it then treats it like a phrasequery, instead. I am sure that the problem is that I am just not doing something right, but being so green at all of this, I am not sure what that something truly is.

My problem was with how I was building the index. What I ended up doing was making sure that it was not tokenizing the FullName, and the query started returning the correct results. The Explain results from above were due to an ID10T error on my part and is now returning correctly.


Clean unstructured place name to a structured format

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

Lucene, certain keywords in queries (e.g. "TO" in range queries) are case sensitive

In Lucene, searches look case-insensitive to the user by default due to the standard analyzer. That is what users expect, and that works fine.
However, for a few words like "TO" in range queries, or "AND"/"OR", those keywords are case sensitive. That's not what user's expect.
Is there a reason for this? Lucene basically "just works" by default so am a little surprised by that. Maybe there's a good reason behind it and I shouldn't touch it.
How would I go about making those keywords case insensitive? As the rest of the query is case insensitive by default, I could just convert the entire query to uppercase? Are there any problems I'm going to encounter if I do that? Is there a better way?
Is there a reason for this?
The real question here might not be "why does lucene do this?", but rather "why does google do this?", as I believe Google's use of this pattern predates Lucene's. Regardless, though, the reasoning isn't too hard to deduce. There needs to be a way of differentiating the word "and" from the the query operator "AND".
Say my query is: Jack and Jill went up the hill
I'm just searching a phrase that happens to contain the word "and". The end result I want is (eliminating stop words, and such):
field:jack field:jill field:went field:up field:hill
Rather than:
+field:jack +field:jill field:went field:up field:hill
If the word is uppercased, it's a decent indicator the user intended the word as an operator.
If all ands became operands, users might be confused why a search for "bread and butter pickles" (becomes +bread +butter pickles) turns up hits about toast, but not about other types of pickles.
Similar for lists of things, like "Abby, Ben, Chris, Dave and Elmer" (becomes abby ben chris +dave +elmer), which all hits would require Dave and Elmer to be present, but the rest of the names would be optional.
How to make them case insensitive?
Uppercasing the whole thing, or every instance of an AND, OR or TO, could be a bit promblematic. Take these, for example:
[to TO tz] works, [TO TO TZ] throws an exception
and another thing works, AND ANOTHER THING throws an exception
You could check for a ParseException after uppercasing, and try parsing the original query in that case. Might create a bit of an inconsistency, but it beats just failing entirely.

Testing phrases to see if they match each other

I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?
An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:
It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
testing phrases
phrases to
to see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.
When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.

First Name Variations in a Database

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.
Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.
As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.
Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!
Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.
The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?
Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!
In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).
In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.
The term you are looking for is Hypocorism:
And Wikipedia lists many of them. You could bang out some Python or Perl to scrape that page and put it in a db.
I would go with a structure like this:
create table given_names (
id int primary key,
name text not null unique
create table hypocorisms (
id int references given_names(id),
name text not null,
primary key (id, name)
insert into given_names values (1, 'William');
insert into hypocorisms values (1, 'Bill');
insert into hypocorisms values (1, 'Billy');
Then you could write a function/sproc to normalize a name:
normalize_given_name('Bill'); --returns William
One issue you will face is that different names can have the same hypocorism (Albert -> Al, Alan -> Al)
I think your basic approach is solid. I don't think fulltext is going to help you. For seeding, seems to have large amount of the data you want.
Are you using SQl Server 2005 Express with Advanced Services as to me it sounds you would benefit from the Full Text indexing and more specifically Contains and Containstable which you can use with specific instructions here is a link for the uses of Containstable:
and here is the download link for SQL Server 2005 With Advanced Services:
Hope this helps,
You can use the SQL Server Full Text Search and do an inflectional search.
Basically like:
SELECT ProductId, ProductName
FROM ProductModel
WHERE CONTAINS(CatalogDescription, ' FORMSOF(THESAURUS, metal) ')
Check out:
Not sure what your application is, but if your users know at the time of sign up that people from their past might be searching the database for them, you could offer them the chance in the user profile to define other names they might be known as (including last names, women change these all the time and makes finding them much harder!) and that they want people to be able to search on. Store these in a separate related table. Then search on that. Just make the structure such that you can define one name as the main name (the one you use for everything except the search.)
You'll find that you're dabbling in an area known as "Natural Language Processing" and you'll need to do several things, most of which can be found under the topic of stemming.
Simplistic stemming simply breaks the word apart, but more advanced algorithms associate words that mean the same thing - for instance Google might use stemming to convert "cat" and "kitten" to "feline" and search for all three, weighing the actual word provided by the user as slightly heavier so exact matches return before stemmed matches.
It's a known problem, and there are open source stemmers available.
No, Full Text searches will not help to solve your problem.
I think you might want to take a look at some of the following links: (Funny, no one mentioned SoundEx till now)
SoundEx - MSDN
SoundEx - Google results
InformIT - Tolerant Search algorithms
Basically SoundEx allows you to evaluate the level of similarity in similar sounding words. The function is also available on SQL 2005.
As a side issue, instead of returning similar results, it might prove more intuitive to the user to use a AJAX based script to deliver similar sounding names before the user initiates his/her search. That way you can show the user "similar names" or "did you mean..." kind of data.
Here's an idea for automatically finding "name synonyms" like Bill/William. That problem has been studied in the broader context of synonyms in general: inducing them from statistics of which words commonly appear in the same contexts in a large text corpus like the Web. You could try combining that approach with a list of names like Moby Names; I don't know if it's been done before.
Here are some pointers.

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files and, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me:
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Imagine a paintCode like this:
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article