Access Query - Compare Multiple User Selections Against Each Other - sql

I'm running into a conceptual problem that I cannot seem to conquer in my mind.
Let's say I want a user to enter what they're currently wearing into a database via a form. Throwing 'T-Shirt' and 'Blue' in a new row is incredibly easy. However, let's say I want to compare one users against others, and rank in order from most similar to least.
This becomes a huge nightmare when you consider the amount of options available.
Undershirt
Overshirt
Jacket
Scarf/Necklaces
Headwear
Pants
Underwear
Leggings
Socks
Footwear
Accessories
As I see it, I could hard-code in the 11 categories above and let a user make selections from drop-drop boxes tailored to each category. Now, let's use an example of 'Undershirt' and 'Overshirt'. Depending on the person, a long-sleeved shirt could be used as either; they're still wearing one. If I make users put values in categories, User A might put it in one and User B might but it in another category. And they wouldn't get compared because of that, separate categories.
Now, instead of hard-coding in categories (and thus making a limit of how much a user can enter), I could put each item into its own row and search by User ID. But let's say a person enters in shorts one day, and next throws in jeans and a shirt. How can I make sure that they're compared separately (e.g., dress compared to shorts, dress compared to jeans+shirt) and not (dress compared to shorts+jeans+shirt).
As to actually comparing, each item vs. each other could be performed via a 2d lookup table. (Row Dress vs. Column Jeans would net a zero, Row Dress vs. Column Dress would net a one)

The appropriate design for this would depend on the acceptable margin of error. If there is zero acceptable error, then you must present the users with the categories and they specify true/false yes/no for each one or select from a limited set of possible answers.
HANDS:
gloves
mittens
brass knuckles
[Caveat: user could be wearing brass knuckles inside the mittens. You have to take into account
whether values are mutually exclusive or not. Barefoot <> no socks.
Someone who is barefoot is not wearing socks but someone not wearings socks may be wearing docksiders]
FEET1:
anklet socks
sheer stockings
fishnet stockings
ragg wool hiking socks
kneesocks
gym socks
no socks
FEET2:
mocassins
running shoes
sandals
wing-tips
uggs
spike heels
...
HEAD:
sombrero
beret
baseball hat
pirate's hat
beanie
knitted cap
NECK:
scarf
mock turtleneck aka dickie
Et cetera et cetera ad nauseam.
Or if margin of error is very generous, you could allow simple freeform text-entry and match/partial-match on words. Slightly less error : you could set up a synonyms table and match on the synonyms of the supplied words.

As a general rule, get the database design right and worry about reporting later. If this is not just a thought exercise, you may like to say what you are actually comparing, because with the above, a person is quite likely to say "tuxedo" or "evening dress", and let the details be inferred, whereas in some other area, this may not be possible. Even so, it seems that you would need a minimum of three columns (fields) for each item:
Timestamp
Major category (jeans, trousers, skirt)
Item (Levi's, tweeds, mini)
If accuracy is particularly important, you will need a trained interviewer :)
I have just noticed underwear in that list, which is even more complicated, because what would qualify as full underwear for a lady of a certain age is by no means the same as that for a gentleman of ten years.

Related

Need to format table from BigQuery for specific words in the body of text

I'm working with Google BigQuery to scrape the reddit comments database. I'll start with the query I'm working on:
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
subreddit,
author AS comment_author,
ups AS upvotes,
LOWER(body)
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE
body CONTAINS 'acid'
OR body CONTAINS 'ecstasy'
OR body CONTAINS 'fire'
OR body CONTAINS 'heroin'
LIMIT 10;
I need to scrape the reddit database for a list of about 30 drug-related word (I limited it to 3 for brevity).
I'm having trouble with two things:
I want to be able to correctly query the DB, but a lot of the results that are returned do not meet the criteria a.k.a. do not contain any of the matching words.
I want to be able to create a column which displays the specific word which was matched....so if it matched the word 'drug', that word would appear in a 'word_matched' column, along with the body, author, date, etc.
I've tried regular expressions as well for matching the words, but that doesn't seem to be helping either:
WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))
Any and all help will be greatly appreciated. Thanks all!
Below adressing both points of the question
1. Have in output only matching words and not those which are part of another/different word. This is easy to accomplish using REGEXP_MATCH function
2. Have column wich consists of all matching words. (i think it makes more sense to have all matching words vs. just one as it is asked in question.
SELECT
[date],
subreddit,
comment_author,
upvotes,
GROUP_CONCAT(word) AS matches,
body
FROM (
SELECT
[date],
subreddit,
comment_author,
upvotes,
body,
word
FROM (
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS [date],
subreddit,
author AS comment_author,
ups AS upvotes,
LOWER(body) AS body
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE REGEXP_MATCH(body, r'\b(drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)\b')
) x
CROSS JOIN (
SELECT SPLIT(list,'|') AS word FROM
(SELECT 'drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers' AS list)
) y
HAVING body CONTAINS word
)
GROUP BY [date], subreddit, comment_author, upvotes, body
LIMIT 1000
Above solution provides list of matching words on best-effort basis, so please note:
If column matches consists one word - it is for sure exact matched word
But if this columns consist of few words - still one of those is exact match, but others can be not exact match.
I think for lengthy body - it still valuable to have those at least as a hint to what to look for. For example as in
drug,meth,heroin,alcohol,benzos it also inhibits the reuptake of serotonin and norepinephrine which gives a hell of a lot worse withdrawal symptoms than most other drugs(incl. heroin, meth, coke and etc.). from what i have heard the only things that rival tramadol it terms of withdrawal are benzos and alcohol.
liquor,beer,alcohol,booze 1. reinforce #3 - it is not cheap to live here. not by any stretch. expect to pay more than the rest of the country pays for everything. even franchises that operate nation-wide have special wa/perth pricing. 2. petrol has literally just dropped to $1 this past month, i wouldn't go as far as quoting that as our average price just yet. average is still between $1.20-1.30. 3. parking is free at beaches & parks, do not expect to get free parking anywhere in the city though. if you're using public parking in the city all day, expect to pay $50 unless you get in early. 4. forget bribing the cops, don't even call them "mate". last time i was pulled over (last week, random stop) i said "evening mate" as i was handing him my license and was responded with "don't call me mate, i'm not your friend, i don't know you". 5. unlike the rest of the world, regular stores do not sell alcohol here. liquor stores only, don't expect to buy beer from a gas station or grocery store. 6. rent is expensive, food is expensive, booze is expensive, being alive is expensive.
drug,meth,heroin,beer that's simply not true. first there's a difference between legalization and decriminalization. second, some european countries have places to go to safely use drugs. there is middle ground between allowing heroin to be sold all over town and having users go to prison. heroin, meth and some other drugs are not good things for society and their use should encouraged by making it as easy to buy as a 6 pack of beer. i'm not really sure why you can't see a middle ground because it's clearly not as black and white as you say. you can go after the dealers while leaving the users alone.
drug,fire,joint,smoke not a story about a rave, but still relevant i think: i was working a job called "fire watch," which is just what it sounds like, at a nine inch nails concert a few years ago. our comrades, the security workers, were far from seasoned professionals. they were mostly college temps with a yellow security tee shirt and a flashlight; they didn't even have radios. the job is basically to make sure people don't go into restricted areas. ...but this one boy scout took it upon himself to tame the metal masses. mid-concert, he pulled me close and shouted "they're smoking pot!" i shrugged, and shot him an "and?" look. i guess he thought i should care because technically a joint is a tiny dangerous drug fire, and i was on the fire crew. he then proceeded to disappear into the crowd, shoving people out of the way on his heroic journey toward the countless smoke puff origins. the next time i saw him he was bleeding out of his face and getting a flashlight in the eyes from an onsite emt. i guess it's pretty harsh to say that he deserved the beating, but it's hard to argue that he didn't go asking for it. i guess the moral of my story is that security people are just people, and some people's shittyness is inflamed when combined with authority. it sounds like your event just happened to be warded by a gaggle of douches, probably being captained by king fuckwad who really wanted to be a cop, but couldn't pass the exams.
Note: If you need list of only exact matches, it is still relatively easy to do with BigQuery User-Defined Functions
I suggest debugging this using REGEXP_EXTRACT. I tried running your query, and it kept finding things like "meth" in "something", which might be what you're seeing. You probably want to check for word boundaries around the match, since some of your words you are searching for can be contained in several normal, non-drug-related words.
Something like the following should help in debugging:
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
subreddit,
author AS comment_author,
ups AS upvotes,
REGEXP_EXTRACT(body, '(drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)') AS match,
LOWER(body),
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))
LIMIT 10;

TSQL grouping on fuzzy column

I would like to group all the merchant transactions from a single table, and just get a count. The problem is, the merchant, let's say redbox, will have a redbox plus the store number added to the end(redbox 4562,redbox*1234). I will also include the category for grouping purpose.
Category Merchant
restaurant bruger king 123 main st
restaurant burger king 456 abc ave
restaurant mc donalds * 45877d2d
restaurant mc 'donalds *888544d
restaurant subway 454545
travelsubway MTA
gas station mc donalds gas
travel nyc taxi
travel nyc-taxi
The question: How can I group the merchants when they have address or store locations added on to them.All I need is a count for each merchant.
The short answer is there is no way to accurately do this, especially with just pure SQL.
You can find exact matches, and you can find wildcard matches using the LIKE operator or a (potentially huge) series of regular expressions, but you cannot find similar matches nor can you find potential misspellings of matches.
There's a few potential approaches I can think of to solve this problem, depending on what type of application you're building.
First, normalize the merchant data in your database. I'd recommend against storing the exact, unprocessed string such as Bruger King in your database. If you come across a merchant that doesn't match a known set of merchants, ask the user if it already matches something in your database. When data goes in, process it then and match it to an existing known merchant.
Store a similarity coefficient. You might have some luck using something like a Jaccard index to judge how similar two strings are. Perhaps after stripping out the numbers, this could work fairly well. At the very least, it could allow you to create a user interface that can attempt to guess what merchant it is. Also, some database engines have full-text indexing operators that can descibe things like similar to or sounds like. Those could potentially be worth investigating.
Remember merchant matches per user. If a user corrects bruger king 123 main st to Burger King, store that relation and remember it in the future without having to prompt the user. This data could also be used to help other users correct their data.
But what if there is no UI? Perhaps you're trying to do some automated data processing. I really see no way to handle this without some sort of human intervention, though some of the techniques described above could help automate this process. I'd also look at the source of your data. Perhaps there's a distinct merchant ID you can use as a key, or perhaps there exists somewhere a list of all known merchants (maybe credit card companies provide this API?) If there's boat loads of data to process, another option would be to partially automate it using a service such as Amazon's Mechanical Turk.
You can use LIKE
SELECT COUNT(*) AS "COUNT", "BURGER KING"
FROM <tables>
WHERE restaurant LIKE "%king%"
UNION ALL
SELECT COUNT(*) AS "COUNT", "JACK IN THE BOX"
FROM <tables>
Where resturant LIKE "jack in the box%"
You may have to move the wildcards around depending on how the records were spelled out.
It depends a bit on what database you use, but most have some kind of REGEXP_INSTR or other function you can use to check for the first index of a pattern. You can then write something like this
SELECT SubStr(merchant, 1, REGEXP_INSTR(merchant, '[0-9]')), count('x')
FROM Expenses
GROUP BY SubStr(merchant, 1, REGEXP_INSTR(merchant, '[0-9]'))
This assumes that the merchant name doesn't have a number and the store number does. However you still may need to strip out any special chars with a replace (like *, -, etc).

Program to optimize cost

This is my problem set for one of my CS class and I am kind of stuck. Here is the summary of the problem.
Create a program that will:
1) take a list of grocery stores and its available items and prices
2) take a list of required items that you need to buy
3) output a supermarket where you can get all your items with the cheapest price
input: supermarkets.list, [tomato, orange, turnip]
output: supermarket_1
The list looks something like
supermarket_1
$2.00 tomato
$3.00 orange
$4.00 tomato, orange, turnip
supermarket_2
$3.00 tomato
$2.00 orange
$3.00 turnip
$15.00 tomato, orange, turnip
If we want to buy tomato and orange, then the optimal solution would be buying
from supermarket_1 $4.00. Note that it is possible for an item to be bough
twice. So if wanted to buy 2 tomatoes, buying from supermarket_1 would be the
optimal solution.
So far, I have been able to put the dataset into a data structure that I hope will allow me to easily do operations on it. I basically have a dictionary of supermarkets and the value would point to a another dictionary containing the mapping from each entry to its price.
supermarket_1 --> [turnip --> $2.00]
[orange --> $1.50]
One way is to use brute force, to get all combinations and find whichever satisfies the solution and find the one with the minimum. So far, this is what I can come up with. There is no assumption that the price of a combination of two items would be less than buying each separately.
Any suggestions hints are welcome
Finding the optimal solution for a specific supermarket is a generalization of the set cover problem, which is NP-complete. The reduction goes as follows:
Given an instance of the set cover problem, just define a cost function assigning 1 to each combination, apply an algorithm that solves your problem, and you obtain an optimal solution of the set cover instance. (Finding the minimal price hence corresponds to finding the minimum number of covering sets.) Thus, your Problem is NP-hard, and you cannot expect to finde a solution that runs in polynomial time.
You really should implement the brute-force method you mentioned. I too recommand you to do this as a first step. If the performance is not sufficient, you can try a
using a MIP-formulation and a solver like CPLEX, or you have to devolop a heuristic approach.
For a single supermarket, it is rather trivial to find a mixed integer program (MIP). Let x_i be the integer number how often product combination i is contained in a solution, c_i its cost and w_ij the number how often product j is contained in product combination i. Then, you are minimizing
sum_i x_i * c_i
subject to conditions like
sum_i x_i * w_ij >= r_j,
where r_j is the number how often product j is required.
Well, you have one method, so implement it now so you have something that works to submit. A brute-force solution should not take long to code up, then you can get some performance data and you can think about the problem more deeply. Guesstimate the number of supermarkets in a reasonable shopping range in a large city. Create that many supermarket records and link them to product tables with random-ish prices, (this is more work than the solution).
Run your brute-force solution. Does it work? If it outputs a solution, 'manually' add up the prices and list them against three other 'supermarket' records taken at random, (just pick a number), showing that the total is less or equal. Modify the price of an item on your list so that the solution is no longer cheap and re-run, so that you get a different solution.
Is it so fast that no further work is justified? If so, say so in the conclusions section of your report and post the lot to your prof/TA. You understood the task, thought about it, came up with a solution, implemented it, tested it with a representative dataset and so shown that functionality is demonstrable and performance is adequate - your assignment is over, go to the bar, think about the next one over a beer.
I am not sure what you mean by "brute force" solution.
Why don't you just calculate the cost of your list of items in each of the supermarkets, and then select the minimum? Complexity would be in O(#items x #supermarkets) which is good.
Regarding your data structure you can also simply have a matrix (2 dimension array) price[super][item], and use ids for your supermarkets/items.

Algorithm for almost similar values search

I have Persons table in SQL Server 2008.
My goal is to find Persons who have almost similar addresses.
The address is described with columns state, town, street, house, apartment, postcode and phone.
Due to some specific differences in some states (not US) and human factor (mistakes in addresses etc.), address is not filled in the same pattern.
Most common mistakes in addresses
Case sensitivity
Someone wrote "apt.", another one "apartment" or "ap." (although addresses aren't written in English)
Spaces, dots, commas
Differences in writing street names, like 'Dr. Jones str." or "Doctor Jones street" or "D. Jon. st." or "Dr Jones st" etc.
The main problem is that data isn't in the same pattern, so it's really difficult to find similar addresses.
Is there any algorithm for this kind of issue?
Thanks in advance.
UPDATE
As I mentioned address is separated into different columns. Should I generate a string concatenating columns or do your steps for each column?
I assume I shouldn't concatenate columns, but if I'll compare columns separately how should I organize it? Should I find similarities for each column an union them or intersect or anything else?
Should I have some statistics collecting or some kind of educating algorithm?
Suggest approaching it thus:
Create word-level n-grams (a trigram/4-gram might do it) from the various entries
Do a many x many comparison for string comparison and cluster them by string distance. Someone suggested Levenshtein; there are better ones for this kind of task, Jaro-Winkler Distance and Smith-Waterman work better. A libraryt such as SimMetrics would make life a lot easier
Once you have clusters of n-grams, you can resolve the whole string using the constituent subgrams i.e. D.Jones St => Davy Jones St. => DJones St.
Should not be too hard, this is an all-too-common problem.
Update: Based on your update above, here are the suggested steps
Catenate your columns into a single string, perhaps create a db "view" . For example,
create view vwAddress
as
select top 10000
state town, street, house, apartment, postcode,
state+ town+ street+ house+ apartment+ postcode as Address
from ...
Write a separate application (say in Java or C#/VB.NET) and Use an algorithm like JaroWinkler to estimate the string distance for the combined address, to create a many x many comparison. and write into a separate table
address1 | address n | similarity
You can use Simmetrics to get the similarity thus:
JaroWinnkler objJw = new JaroWinkler()
double sim = objJw.GetSimilarity (address1, addres n);
You could also trigram it so that an address such as "1 Jones Street, Sometown, SomeCountry" becomes "1 Jones Street", "Jones Street Sometown", and so on....
and compare the trigrams. (or even 4-grams) for higher accuracy.
Finally you can order by similarity to get a cluster of most similar addresses and decide an approprite threshold. Not sure why you are stuck
I would try to do the following:
split up the address in multiple words, get rid of punctuation at the same time
check all the words for patterns that are typically written differently and replace them with a common name (e.g. replace apartment, ap., ... by apt, replace Doctor by Dr., ...)
put all the words back in one string alphabetically sorted
compare all the addresses using a fuzzy string comparison algorithm, e.g. Levenshtein
tweak the parameters of the Levenshtein algorithm (e.g. you want to allow more differences on longer strings)
finally do a manual check of the strings
Of course, the solution to keep your data 'in shape' is to have explicit fields for each of your characteristics in your database. Otherwise, you will end up doing this exercise every few months.
The main problem I see here is to exactly define equality.
Even if someone writes Jon. and another Jone. - you will never be able to say if they are the same. (Jon-Jonethan,Joneson,Jonedoe whatever ;)
I work in a firm where we have to handle exact this problem - I'm afraid I have to tell you this kind of checking the adress lists for navigation systems is done "by hand" most of the time. Abbrevations are sometimes context dependend, and there are other things that make this difficult. Ofc replacing string etc is done with python - but telling you the MEANING of such an abbr. can only done by script in a few cases. ("St." -> Can be "Saint" and "Street". How to decide? impossible...this is human work.).
Another big problem is as you said "Is there a street "DJones" or a person? Or both? Which one is ment here? Is this DJones the same as Dr Jones or the same as Don Jones? Its impossible to decide!
You can do some work with lists as presented by another answer here - but it will give you enough "false positives" or so.
You have a postcode field!!!
So, why don't you just buy a postcode table for your country
and use that to clean up your street/town/region/province information?
I did a project like this in the last centuary. Basicly it was a consolidation of two customer files after a merger, and, involved names and addresses from three different sources.
Firstly as many posters have suggested, convert all the common words and abbreveations and spelling mistakes to a common form "Apt." "Apatment" etc. to "Apt".
Then look through the name and identifiy the first letter of the first name, plus the first surname. (Not that easy consider "Dr. Med. Sir Henry de Baskerville Smythe") but dont worry where there are amiguities just take both! So if you lucky you get HBASKERVILLE and HSMYTHE. Now get rid of all the vowels as thats where most spelling variations occur so now you have HBSKRVLL HSMTH.
You would also get these strings from "H. Baskerville","Sir Henry Baskerville Smith" and unfortunately "Harold Smith" but we are talking fuzzy matching here!
Perform a similar exercise on the street, and apartment and postcode fields. But do not throw away the original data!
You now come to the interesting bit first you compare each of the original strings and give say 50 points for each string that matches exactly. Then go through you "normalised" strings and give say 20 points for each one that matches exactly. Then go through all the strings and give say 5 points for each four character or more substring they have in common. For each pair compared you will end up with some with scores > 150 which you can consider as a certain match, some with scores less than 50 which you can consider not matched and some inbetween which have some probability of matching.
You need some more tweaking to improve this by adding various rules like "subtract 20 points for a surname of 'smith'". You really have to keep running and tweaking until you get happy with the resulting matches, but, once you look at the results you get a pretty good feel which score to consider a "match" and which are the false positives you need to get rid of.
I think the amount of data could affect what approach works best for you.
I had a similar problem when indexing music from compilation albums with various artists. Sometimes the artist came first, sometimes the song name, with various separator styles.
What I did was to count the number of occurrences on other entries with the same value to make an educated guess wether it was the song name or an artist.
Perhaps you can use soundex or similar algorithm to find stuff that are similar.
EDIT: (maybe I should clarify that I assumed that artist names were more likely to be more frequently reoccurring than song names.)
One important thing that you mention in the comments is that you are going to do this interactively.
This allows to parse user input and also at the same time validate guesses on any abbreviations and to correct a lot of mistakes (the way for example phone number entry works some contact management systems - the system does the best effort to parse and correct the country code, area code and the number, but ultimately the user is presented with the guess and has the chance to correct the input)
If you want to do it really good then keeping database/dictionaries of postcodes, towns, streets, abbreviations and their variations can improve data validation and pre-processing.
So, at least you would have fully qualified address. If you can do this for all the input you will have all the data categorized and matches can then be strict on certain field and less strict on others, with matching score calculated according weights you assign.
After you have consistently pre-processed the input then n-grams should be able to find similar addresses.
Have you looked at SQL Server Integration Services for this? The Fuzzy Lookup component allows you to find 'Near matches': http://msdn.microsoft.com/en-us/library/ms137786.aspx
For new input, you could call the package from .Net code, passing the value row to be checked as a set of parameters, you'd probably need to persist the token index for this to be fast enough for user interaction though.
There's an example of address matching here: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
I'm assuming that response time is not critical and that the problem is finding an existing address in a database, not merging duplicates. I'm also assuming the database contains a large number of addresses (say 3 million), rather than a number that could be cleaned up economically by hand or by Amazon's Mechanical Turk.
Pre-computation - Identify address fragments with high information content.
Identify all the unique words used in each database field and count their occurrences.
Eliminate very common words and abbreviations. (Street, st., appt, apt, etc.)
When presented with an input address,
Identify the most unique word and search (Street LIKE '%Jones%') for existing addresses containing those words.
Use the pre-computed statistics to estimate how many addresses will be in the results set
If the estimated results set is too large, select the second-most unique word and combine it in the search (Street LIKE '%Jones%' AND Town LIKE '%Anytown%')
If the estimated results set is too small, select the second-most unique word and combine it in the search (Street LIKE '%Aardvark%' OR Town LIKE '%Anytown')
if the actual results set is too large/small, repeat the query adding further terms as before.
The idea is to find enough fragments with high information content in the address which can be searched for to give a reasonable number of alternatives, rather than to find the most optimal match. For more tolerance to misspelling, trigrams, tetra-grams or soundex codes could be used instead of words.
Obviously if you have lists of actual states / towns / streets then some data clean-up could take place both in the database and in the search address. (I'm very surprised the Armenian postal service does not make such a list available, but I know that some postal services charge excessive amounts for this information. )
As a practical matter, most systems I see in use try to look up people's accounts by their phone number if possible: obviously whether that is a practical solution depends upon the nature of the data and its accuracy.
(Also consider the lateral-thinking approach: could you find a mail-order mail-list broker company which will clean up your database for you? They might even be willing to pay you for use of the addresses.)
I've found a great article.
Adding some dlls as sql user-defined functions we can use string comparison algorithms using SimMetrics library.
Check it
http://anastasiosyal.com/archive/2009/01/11/18.aspx
the possibilities of such variations are countless and even if such an algorithm exists, it can never be fool-proof. u can't have a spell checker for nouns after all.
what you can do is provide a drop-down list of previously entered field values, so that they can select one, if a particular name already exists.
its better to have separate fields for each value like apartments and so on.
You could throw all addresses at a web service like Google Maps (I don't know whether this one is suitable, though) and see whether they come up with identical GPS coordinates.
One method could be to apply the Levenshtein distance algorithm to the address fields. This will allow you to compare the strings for similarity.
Edit
After looking at the kinds of address differences you are dealing with, this may not be helpful after all.
Another idea is to use learning. For example you could learn, for each abbreviation and its place in the sentence, what the abbreviation means.
3 Jane Dr. -> Dr (in 3rd position (or last)) means Drive
Dr. Jones St -> Dr (in 1st position) means Doctor
You could, for example, use decision trees and have a user train the system. Probably few examples of each use would be enough. You wouldn't classify single-letter abbreviations like D.Jones that could be David Jones, or Dr. Jones as likely. But after a first level of translation you could look up a street index of the town and see if you can expand the D. into a street name.
Again, you would run each address through the decision tree before storing it.
It feels like there should be some commercial products doing this out there.
A possibility is to have a dictionary table in the database that maps all the variants to the 'proper' version of the word:
*Value* | *Meaning*
Apt. | Apartment
Ap. | Apartment
St. | Street
Then you run each word through the dictionary before you compare.
Edit: this alone is too naive to be practical (see comment).

help with tree-like structure

I've got some financial data to store and manipulate. Let's say I have 2 divisions, with offices in 2 cities, in 2 currencies, and 4 bank accounts. (It's actually more complex than that.) I want to show a list like this:
Electronics
Chicago
Dollars
Account 2 -> transactions in acct2 in $ in chicago/electronics
Euros
Account 1 -> transactions in acct1 in E in chicago/electronics
Account 3 -> etc.
Account 4
Brussles
Dollars
Account 1
Euros
Account 3
Account 4
Dessert Toppings
Chicago
Dollars
Account 1
Account 4
Euros
Account 2
Account 4
Brussles
Dollars
Account 2
Euros
Account 3
Account 4
So at each level except the top, the category can appear in multiple places. I've been reading around about the various methods, but none of the examples seem to address my particular use case, where nodes can appear in more than one place in the hierarchy. (Maybe there's a different name for this than "tree" or "hierarchy".)
I guess my hierarchy is actually something like Division > City > Currency with 'Electronics' and 'Euros' merely instances of each level, but I'm not quite sure how that helps or hurts.
A few notes: this is for a demo site, so the dataset won't be large -- ease of set-up and maintenance is more important than query efficiency. (I'm actually considering just building a data object by hand, though I'd much rather do it the right way.) Also, FWIW, we're working in php with an ms access back-end, so any libraries out there that make this easy in that environment would be helpful. (I've found a couple of implementations of the nested set pattern already.)
Are you sure you want to use a hierarchical design for this? To me, the hierarchy seems more a consequence of the desired output format than something intrinsic to your data structure.
And what if you have to display the data in a different order, like City > Currency > Division? Wouldn't that be very cumbersome?
You could use a plain structure instead, with a table for Branches, one for Cities, one for Currencies, and then then one Account table with Branch_ID, City_ID, and Currency_ID as foreign keys.
I'm not sure what database platform you're using. But if you're using MS SQL Server, then you should check out recursive queries using common table expressions (CTEs). They're easy to use and are designed for exactly the type of situation you've illustrated (a bill of materials, for instance). Check out this website for more detail: http://www.mssqltips.com/tip.asp?tip=1520
Good luck!