Object Model to identify customers with similar address - sql

Let's say I have a table of customers and each customer has an address. My task is to design an object model that allows to group the customers by similar address. Example:
John 123 Main St, #A; Los Angeles, CA 90032
Jane 92 N. Portland Ave, #1; Pasadena, CA 91107
Peter 92 N. Portland Avenue, #2; Pasadena, CA 91107
Lester 92 N Portland Av #4; Pasadena, CA 91107
Mark 123 Main Street, #C; Los Angeles, CA 90032
The query should somehow return:
1 Similar_Address_Key1
5 Similar_Address_Key1
2 Similar_Address_key2
3 Similar_Address_key2
4 Similar_Address_key2
What is the best way to accomplish this? Notice the addresses are NOT consistent (some address have "Avenue" others have "Av" and the apartment numbers are different). The existing data of names/address cannot be corrected so doing a GROUP BY (Address) on the table itself is out of the question.
I was thinking to add a SIMILAR_ADDRESSES table that takes an address, evaluates it and gives it a key, so something like:
cust_key address similar_addr_key
1 123 Main St, #A; Los Angeles, CA 90032 1
2 92 N. Portland Ave, #1; Pasadena, CA 91107 2
3 92 N. Portland Avenue, #2; Pasadena, CA 91107 2
4 92 N. Portland Av #4; Pasadena, CA 91107 2
5 123 Main Street, #C; Los Angeles, CA 90032 1
Then group by the similar address key. But the question is how to best accomplish the "evaluation" part. One way would be to modify the address in the SIMILAR_ADDRESSES table so that they are consistent and ignoring things like apt, #, or suite and assign a "key" to each exact match. Another different approach I thought about was to feed the address to a Geolocator service and save the latitude/longitude values to a table and use these values to generate a similar address key.
Any ideas?

Related

Presenting Data uniformly between two different table presentations with SQL

Hello Everyone I have a problem…
Table 1 (sorted) is laid out like this:
User ID Producer ID Company Number
JWROSE 23401 234
KXPEAR 23903 239
LMWEEM 27902 279
KJMORS 18301 183
Table 2 (unsorted) looks like this:
Client Name City Company Number
Rajat Smith London JWROSE
Robert Singh Cleveland KXPEAR
Alberto Johnson New York City LMWEEM
Betty Lee Dallas KJMORS
Chase Galvez Houston 23401
Hassan Jackson Seattle 23903
Tooti Fruity Boise 27902
Joe Trump Tokyo 18301
Donald Biden Cairo 234
Mike Harris Rome 239
Kamala Pence Moscow 279
Adolf Washington Bangkok 183
Now… Table 1 has all of the User IDs and Producer IDs properly rowed with the Company Number.
I want to pull all the data and correctly sorted….
Client Name City User ID Producer ID Company Number
Rajat Smith London JWROSE 23401 234
Robert Singh Cleveland KXPEAR 23903 239
Alberto Johnson New York City LMWEEM 27902 279
Betty Lee Dallas KJMORS 18301 183
Chase Galvez Houston JWROSE 23401 234
Hassan Jackson Seattle KXPEAR 23903 239
Tooti Fruity Boise LMWEEM 27902 279
Joe Trump Tokyo KJMORS 18301 183
Donald Biden Cairo JWROSE 23401 234
Mike Harris Rome KXPEAR 23903 239
Kamala Pence Moscow LMWEEM 27902 279
Adolf Washington Bangkok KJMORS 18301 183
Query:
Select
b.client_name,
b.city.,
a.user_id,
a.producer_id,
a.company_number
From Table 1 A
Left Join Table 2 B On a.company….
And this is where I don’t know what do to….because both tables have all the same variables, but Company Number in Table 2 is mixed with User IDs and Producer IDs... however we know what company Number those ID's are associated to.
As I mention in the comments, and others do, the real problem is your design. "The fact that UserID is clearly a varchar, while the other 2 columns are an int really does not make this any better", and makes this not simple (and certainly not SARGable).
To get the data in the correct order, as well, you need a column to order it on which the data lacks. I have therefore added a pseudo column, MissingIDColumn, to represent this missing column you need to add to your data; which you can do when you fix the design:
SELECT T2.ClientName,
T2.City,
T1.UserID,
T1.ProducerID,
T1.CompanyNumber
FROM (VALUES('JWROSE',23401,234),
('KXPEAR',23903,239),
('LMWEEM',27902,279),
('KJMORS',18301,183))T1(UserID,ProducerID,CompanyNumber)
JOIN (VALUES(1,'Rajat Smith ','London ','JWROSE'),
(2,'Robert Singh ','Cleveland ','KXPEAR'),
(3,'Alberto Johnson ','New York City','LMWEEM'),
(4,'Betty Lee ','Dallas ','KJMORS'),
(5,'Chase Galvez ','Houston ','23401'),
(6,'Hassan Jackson ','Seattle ','23903'),
(7,'Tooti Fruity ','Boise ','27902'),
(8,'Joe Trump ','Tokyo ','18301'),
(9,'Donald Biden ','Cairo ','234'),
(10,'Mike Harris ','Rome ','239'),
(11,'Kamala Pence ','Moscow ','279'),
(12,'Adolf Washington','Bangkok ','183'))T2(MissingIDColumn,ClientName,City,CompanyNumber) ON T2.CompanyNumber IN (T1.UserID,CONVERT(varchar(6),T1.ProducerID),CONVERT(varchar(6),T1.CompanyNumber))
ORDER BY MissingIDColumn;

Pandas loc() function

I am trying to slice item from CSV here is example [enter image description here][1]
df1 = pandas.read_csv("supermarkets.csv")
df1
ID Address City State Country Name Employees
0 1 3666 21st St San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores St San Francisco CA 94119 USA Bready Shop 15
2 3 332 Hill St San Francisco California 94114 USA Super River 25
3 4 3995 23rd St San Francisco CA 94114 USA Ben's Shop 10
4 5 1056 Sanchez St San Francisco California USA Sanchez 12
5 6 551 Alvarado St San Francisco CA 94114 USA Richvalley 20
df2 = df1.loc["735 Dolores St":"332 Hill St","City":"Country"]
df2
In output I am only getting this output
City State Country
How do I correct?
As you can read in pandas documentation .loc[] can access a group of rows and columns by label(s) or a boolean array.
You cannot directly select using the values in the Series.
In your example df1.loc["735 Dolores St":"332 Hill St","City":"Country"] you are getting an empty selection because only "City":"Country" is a valid accessor.
"735 Dolores St":"332 Hill St" will return an empty row selection as they are not labels on the index.
You can try selecting by index with .iloc[[1,2], "City":"Country"] if you want specific rows.
df.loc is primarily label based and commonly slices the rows using an index. In this case, you can use the numeric index or set address as index
print(df)
ID Address City State Country Name Employees
0 1 3666 21st San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores San Francisco CA 94114 USA Bready Shop 15
2 3 332 Hill San Francisco CA 94114 USA Super River 25
df2=df.loc[1:2,'City':'Country']
print(df2)
City State Country
1 San Francisco CA 94114 USA
2 San Francisco CA 94114 USA
Or
df2=df.set_index('Address').loc['735 Dolores':'332 Hill','City':'Country']
print(df2)
City State Country
Address
735 Dolores San Francisco CA 94114 USA
332 Hill San Francisco CA 94114 USA

How do I drop a row from this data frame?

ID Address City State Country Name Employees
0 1 3666 21st St San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores St San Francisco CA 94119 USA Bready Shop 15
2 3 332 Hill St San Francisco Cal USA Super River 25
3 4 3995 23rd St San Francisco CA 94114 USA Ben's Shop 10
4 5 1056 Sanchez St San Francisco California USA Sanchez 12
5 6 551 Alvarado St San Francisco CA 94114 USA Richvalley 20
df=df.drop(['3666 21st St'], axis=1, inplace=True)
I am using this code and still, it's showing an error stating that :
KeyError: "['3666 21st St'] not found in axis"
Can anyone help me solve this?
The drop method only works on the index or column names. There are 2 ways to do what you want:
Make the Address column the index, then drop the value(s) you want to drop. You should use axis=0 for this, and not axis=1. The default is axis=0. Do not use inplace=True if you are assigning the output.
Use a Boolean filter instead of drop.
The 1st method is preferred if the Address values are all distinct. The index of a data frame is effectively a sequence of row labels, so it doesn't make much sense to have duplicate row labels:
df.set_index('Address', inplace=True)
df.drop(['3666 21st St'], inplace=True)
The 2nd method is therefore preferred if the Address column is not distinct:
is_bad_address = df['Address'] == '3666 21st St'
# Alternative if you have multiple bad addresses:
# is_bad_address = df['Address'].isin(['366 21st St'])
df = df.loc[~is_bad_address]
You need to consult the Pandas documentation for the correct usage of the axis= and inplace= keyword arguments. You are using both of them incorrectly. DO NOT COPY AND PASTE CODE WITHOUT UNDERSTANDING HOW IT WORKS.

modified list input when character value has embedded blanks

I am preparing SAS BASE test. In the test book chapter 17 Reading Free-format Data, there is an example about how to read character values with embedded blanks and nonstandard value, such as numbers with comma. I tested it and its result is not what the book described.
data cityrank;
infile datalines;
input rank city & $12. pop86: comma.;
datalines;
1 NEW YORK 7,262,700
2 LOS ANGELES 3,259,340
3 CHICAGO 3,009,530
4 HOUSTON 1,728,910
5 PHILADELPHIA 1,642,900
6 DETROIT 1,086,220
7 DAN DIEGO 1,015,190
8 DALLAS 1,003,520
9 SAN ANTONIA 914,350
;
what I got is like below, data set has 4 obs.
rank city pop86
1 NEW YORK 7,2 2
3 CHICAGO 3,00 4
5 PHILADELPHIA 6
7 DAN DIEGO 1, 8
Am I wrong somewhere typing the program? I have checked again and again that I copy it correctly.
How to modify this program?
Thank you!
I'm guessing from the typos that you didn't copy-paste this, but you typed it in instead.
As such, you (or the book writers) made another typo: there are two spaces after the city names, not one (or at least, should be). That's what the & does: it says "wait for two consecutive delimiters" (allowing a single delimiter to be ignored, so New York is read into one variable instead of split).
So this would be correct:
data cityrank;
infile datalines;
input rank city & $12. pop86: comma.;
datalines;
1 NEW YORK 7,262,700
2 LOS ANGELES 3,259,340
3 CHICAGO 3,009,530
4 HOUSTON 1,728,910
5 PHILADELPHIA 1,642,900
6 DETROIT 1,086,220
7 SAN DIEGO 1,015,190
8 DALLAS 1,003,520
9 SAN ANTONIO 914,350
;
run;

How do I generate a crosswalk ID between two SQL tables

I have a SQL table consisting of names, addresses and some associated numerical data paired with a code. The table is structured such that each number-code pair has its own row with redundant address info. abbreviated version below, let's call it tblPeopleData
Name Address ArbitraryCode ArbitraryData
----------------------------------------------------------------------------
John Adams 45 Main St, Rochester NY a 111
John Adams 45 Main St, Rochester NY a 231
John Adams 45 Main St, Rochester NY a 123
John Adams 45 Main St, Rochester NY b 111
John Adams 45 Main St, Rochester NY c 111
John Adams 45 Main St, Rochester NY d 123
John Adams 45 Main St, Rochester NY d 124
Jane McArthur 12 1st Ave, Chicago IL a 111
Jane McArthur 12 1st Ave, Chicago IL a 231
Jane McArthur 12 1st Ave, Chicago IL a 123
Jane McArthur 12 1st Ave, Chicago IL b 111
Jane McArthur 12 1st Ave, Chicago IL c 111
Jane McArthur 12 1st Ave, Chicago IL e 123
Jane McArthur 12 1st Ave, Chicago IL e 124
My problem is that this table is absolutely massive (~10 million rows) and I'm trying to split it up to make traversal less staggeringly sluggish.
What I've done so far is to make a table of just addresses, using something like:
SELECT DISTINCT Address FROM tblPeopleData (etc.)
Leaving me with:
Name Address
------------------------------------------
John Adams 45 Main St, Rochester NY
Jane McArthur 12 1st Ave, Chicago IL
...just a list of addresses. I want to be able to look up each address and see which names reside at that address, so I assigned each address a UniqueID, such that now I have (this table is around ~500,000 rows in my dataset):
Name Address AddressID
--------------------------------------------------------
John Adams 45 Main St, Rochester NY 000001
Jane McArthur 12 1st Ave, Chicago IL 000002
In order to be able to look up people by address though, I need this AddressID field added to tblPeopleData, such that each address in tblPeopleData is associated with its AddressID and this is added to every row, such that I would have:
Name Address ArbitraryCode ArbitraryData AddressID
----------------------------------------------------------------------------------------
John Adams 45 Main St, Rochester NY a 111 00001
John Adams 45 Main St, Rochester NY a 231 00001
John Adams 45 Main St, Rochester NY a 123 00001
John Adams 45 Main St, Rochester NY b 111 00001
John Adams 45 Main St, Rochester NY c 111 00001
John Adams 45 Main St, Rochester NY d 123 00001
John Adams 45 Main St, Rochester NY d 124 00001
Jane McArthur 12 1st Ave, Chicago IL a 111 00002
Jane McArthur 12 1st Ave, Chicago IL a 231 00002
Jane McArthur 12 1st Ave, Chicago IL a 123 00002
Jane McArthur 12 1st Ave, Chicago IL b 111 00002
Jane McArthur 12 1st Ave, Chicago IL c 111 00002
Jane McArthur 12 1st Ave, Chicago IL e 123 00002
Jane McArthur 12 1st Ave, Chicago IL e 124 00002
How do I make this jump from having UniqueIDs for AddressID in my unique addresses table, to adding these all to each row with a corresponding address back in my tbPeopleData?
Just backfill the calculated AddressID back to tblPeopleData - you can combine an UPDATE with a FROM (like you would do in a select)
UPDATE tblPeopleData
SET AddressID = a.AddressID
FROM tblPeopleData pd
INNER JOIN tblAddressData a
ON pd.Address = a.Address
You would alter the table to have the address id:
alter table tblPeopleData add AddressId int references Address(AddressId);
Then you can update the value using a JOIN:
update tblPeopleData pd JOIN
Address a
ON pd.Address = a.Address
pd.AddressId = a.AddressId;
You will definitely want an index on Address(Address) for this.
Then, you can drop the old column:
alter table drop column Address;
Note:
It might be faster to save the results in a temporary table, because the update is going to generate lots and lots of log records. For this, truncate the original table, and re-load the data:
SELECT . . . , a.AddressId
INTO tmp_tblPeopleData
FROM tblPeopleData pd JOIN
Address a
ON pd.Address = a.Address;
TRUNCATE TABLE tblPeopleData;
INSERT INTO tblPeopleData( . . .)
SELECT . . .
FROM tmp_tblPeopleData;