Pandas loc() function - pandas

I am trying to slice item from CSV here is example [enter image description here][1]
df1 = pandas.read_csv("supermarkets.csv")
df1
ID Address City State Country Name Employees
0 1 3666 21st St San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores St San Francisco CA 94119 USA Bready Shop 15
2 3 332 Hill St San Francisco California 94114 USA Super River 25
3 4 3995 23rd St San Francisco CA 94114 USA Ben's Shop 10
4 5 1056 Sanchez St San Francisco California USA Sanchez 12
5 6 551 Alvarado St San Francisco CA 94114 USA Richvalley 20
df2 = df1.loc["735 Dolores St":"332 Hill St","City":"Country"]
df2
In output I am only getting this output
City State Country
How do I correct?

As you can read in pandas documentation .loc[] can access a group of rows and columns by label(s) or a boolean array.
You cannot directly select using the values in the Series.
In your example df1.loc["735 Dolores St":"332 Hill St","City":"Country"] you are getting an empty selection because only "City":"Country" is a valid accessor.
"735 Dolores St":"332 Hill St" will return an empty row selection as they are not labels on the index.
You can try selecting by index with .iloc[[1,2], "City":"Country"] if you want specific rows.

df.loc is primarily label based and commonly slices the rows using an index. In this case, you can use the numeric index or set address as index
print(df)
ID Address City State Country Name Employees
0 1 3666 21st San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores San Francisco CA 94114 USA Bready Shop 15
2 3 332 Hill San Francisco CA 94114 USA Super River 25
df2=df.loc[1:2,'City':'Country']
print(df2)
City State Country
1 San Francisco CA 94114 USA
2 San Francisco CA 94114 USA
Or
df2=df.set_index('Address').loc['735 Dolores':'332 Hill','City':'Country']
print(df2)
City State Country
Address
735 Dolores San Francisco CA 94114 USA
332 Hill San Francisco CA 94114 USA

Related

Python Pandas: add missing row of two dataframe and keep the extra columns [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Error: pandas hashtable keyerror
(3 answers)
Closed 2 years ago.
I would like to add the missing row dataframe df1 and keep the extra columns information
In [183]: df1
Out[183]:
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Boston US N.America
3 London UK Europe
4 Beijing China Asia
5 Omaha US N.America
In [183]: df2
Out[183]:
City
0 Chicago
1 San Franciso
2 Sao Paulo
3 Boston
4 London
5 Beijing
6 Tokyo
7 Omaha
The desired result after the merge is
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Sao Paulo nan nan
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo nan nan
7 Omaha US N.America
I am trying with pd.merge(df2, df1, on='City', how='outer') but return keyerror.
Try the code below, using pd.merge, left_join, your desired output:
merged = pd.merge(df2,df1,how='left',on='City')
print(merged)
City Country Region
0 Chicago US N.America
1 San Fransicsco NaN NaN
2 Sao Paolo NaN NaN
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo NaN NaN
7 Omaha US N.America
If you want to use an outer join, you can get this result using the below code:
merged_outer = pd.merge(df2, df1, on='City', how='outer')
print(merged_outer)
City Country Region
0 Chicago US N.America
1 San Fransicsco NaN NaN
2 Sao Paolo NaN NaN
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo NaN NaN
7 Omaha US N.America
8 San Franciso US N.America
DF1 & DF2 respectively:
df1
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Boston US N.America
3 London UK Europe
4 Beijing China Asia
5 Omaha US N.America
df2
City
0 Chicago
1 San Fransicsco
2 Sao Paolo
3 Boston
4 London
5 Beijing
6 Tokyo
7 Omaha

Count number of times each item in list occurs in a pandas dataframe column with comma separates vales

I have a list :
citylist = ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Miami']
and a pandas Dataframe df1 with these values
first last city email
John Travis New York a#email.com
Jim Perterson San Franciso, Los Angeles b#email.com
Nancy Travis Chicago b1#email.com
Jake Templeton Los Angeles b3#email.com
John Myers New York b4#email.com
Peter Johnson San Franciso, Chicago b5#email.com
Aby Peters Los Angeles b6#email.com
Amy Thomas San Franciso b7#email.com
Jessica Thompson Los Angeles, Chicago, New York b8#email.com
I want to count the number of times each city from citylist occurs in the dataframe column 'city':
New York 3
San Francisco 3
Los Angeles 4
Chicago 3
Miami 0
Currently I have
dftest = df1.groupby(by='city', as_index=False).agg({'id': pd.Series.nunique})
and it ends counting "Los Angeles, Chicago, New York" as 1 unique value
Is there any way to get counts as I have show above?
Thanks
Try this:
Fix data first:
df1['city'] = df1['city'].str.replace('Franciso', 'Francisco')
Use this:
(df1['city'].str.split(', ')
.explode()
.value_counts(sort=False)
.reindex(citylist, fill_value=0))
Output:
New York 3
San Francisco 3
Los Angeles 4
Chicago 3
Miami 0
Name: city, dtype: int64
You can use Series.str.count:
pd.Series([df['city'].str.count(c).sum() for c in citylist], index=citylist)
Another more efficient approach as suggested by #ScottBoston
pd.Series({c:sum(c in i for i in df['city']) for c in citylist})
New York 3
San Francisco 0
Los Angeles 4
Chicago 3
Miami 0
dtype: int64

How do I drop a row from this data frame?

ID Address City State Country Name Employees
0 1 3666 21st St San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores St San Francisco CA 94119 USA Bready Shop 15
2 3 332 Hill St San Francisco Cal USA Super River 25
3 4 3995 23rd St San Francisco CA 94114 USA Ben's Shop 10
4 5 1056 Sanchez St San Francisco California USA Sanchez 12
5 6 551 Alvarado St San Francisco CA 94114 USA Richvalley 20
df=df.drop(['3666 21st St'], axis=1, inplace=True)
I am using this code and still, it's showing an error stating that :
KeyError: "['3666 21st St'] not found in axis"
Can anyone help me solve this?
The drop method only works on the index or column names. There are 2 ways to do what you want:
Make the Address column the index, then drop the value(s) you want to drop. You should use axis=0 for this, and not axis=1. The default is axis=0. Do not use inplace=True if you are assigning the output.
Use a Boolean filter instead of drop.
The 1st method is preferred if the Address values are all distinct. The index of a data frame is effectively a sequence of row labels, so it doesn't make much sense to have duplicate row labels:
df.set_index('Address', inplace=True)
df.drop(['3666 21st St'], inplace=True)
The 2nd method is therefore preferred if the Address column is not distinct:
is_bad_address = df['Address'] == '3666 21st St'
# Alternative if you have multiple bad addresses:
# is_bad_address = df['Address'].isin(['366 21st St'])
df = df.loc[~is_bad_address]
You need to consult the Pandas documentation for the correct usage of the axis= and inplace= keyword arguments. You are using both of them incorrectly. DO NOT COPY AND PASTE CODE WITHOUT UNDERSTANDING HOW IT WORKS.

Object Model to identify customers with similar address

Let's say I have a table of customers and each customer has an address. My task is to design an object model that allows to group the customers by similar address. Example:
John 123 Main St, #A; Los Angeles, CA 90032
Jane 92 N. Portland Ave, #1; Pasadena, CA 91107
Peter 92 N. Portland Avenue, #2; Pasadena, CA 91107
Lester 92 N Portland Av #4; Pasadena, CA 91107
Mark 123 Main Street, #C; Los Angeles, CA 90032
The query should somehow return:
1 Similar_Address_Key1
5 Similar_Address_Key1
2 Similar_Address_key2
3 Similar_Address_key2
4 Similar_Address_key2
What is the best way to accomplish this? Notice the addresses are NOT consistent (some address have "Avenue" others have "Av" and the apartment numbers are different). The existing data of names/address cannot be corrected so doing a GROUP BY (Address) on the table itself is out of the question.
I was thinking to add a SIMILAR_ADDRESSES table that takes an address, evaluates it and gives it a key, so something like:
cust_key address similar_addr_key
1 123 Main St, #A; Los Angeles, CA 90032 1
2 92 N. Portland Ave, #1; Pasadena, CA 91107 2
3 92 N. Portland Avenue, #2; Pasadena, CA 91107 2
4 92 N. Portland Av #4; Pasadena, CA 91107 2
5 123 Main Street, #C; Los Angeles, CA 90032 1
Then group by the similar address key. But the question is how to best accomplish the "evaluation" part. One way would be to modify the address in the SIMILAR_ADDRESSES table so that they are consistent and ignoring things like apt, #, or suite and assign a "key" to each exact match. Another different approach I thought about was to feed the address to a Geolocator service and save the latitude/longitude values to a table and use these values to generate a similar address key.
Any ideas?

How do I generate a crosswalk ID between two SQL tables

I have a SQL table consisting of names, addresses and some associated numerical data paired with a code. The table is structured such that each number-code pair has its own row with redundant address info. abbreviated version below, let's call it tblPeopleData
Name Address ArbitraryCode ArbitraryData
----------------------------------------------------------------------------
John Adams 45 Main St, Rochester NY a 111
John Adams 45 Main St, Rochester NY a 231
John Adams 45 Main St, Rochester NY a 123
John Adams 45 Main St, Rochester NY b 111
John Adams 45 Main St, Rochester NY c 111
John Adams 45 Main St, Rochester NY d 123
John Adams 45 Main St, Rochester NY d 124
Jane McArthur 12 1st Ave, Chicago IL a 111
Jane McArthur 12 1st Ave, Chicago IL a 231
Jane McArthur 12 1st Ave, Chicago IL a 123
Jane McArthur 12 1st Ave, Chicago IL b 111
Jane McArthur 12 1st Ave, Chicago IL c 111
Jane McArthur 12 1st Ave, Chicago IL e 123
Jane McArthur 12 1st Ave, Chicago IL e 124
My problem is that this table is absolutely massive (~10 million rows) and I'm trying to split it up to make traversal less staggeringly sluggish.
What I've done so far is to make a table of just addresses, using something like:
SELECT DISTINCT Address FROM tblPeopleData (etc.)
Leaving me with:
Name Address
------------------------------------------
John Adams 45 Main St, Rochester NY
Jane McArthur 12 1st Ave, Chicago IL
...just a list of addresses. I want to be able to look up each address and see which names reside at that address, so I assigned each address a UniqueID, such that now I have (this table is around ~500,000 rows in my dataset):
Name Address AddressID
--------------------------------------------------------
John Adams 45 Main St, Rochester NY 000001
Jane McArthur 12 1st Ave, Chicago IL 000002
In order to be able to look up people by address though, I need this AddressID field added to tblPeopleData, such that each address in tblPeopleData is associated with its AddressID and this is added to every row, such that I would have:
Name Address ArbitraryCode ArbitraryData AddressID
----------------------------------------------------------------------------------------
John Adams 45 Main St, Rochester NY a 111 00001
John Adams 45 Main St, Rochester NY a 231 00001
John Adams 45 Main St, Rochester NY a 123 00001
John Adams 45 Main St, Rochester NY b 111 00001
John Adams 45 Main St, Rochester NY c 111 00001
John Adams 45 Main St, Rochester NY d 123 00001
John Adams 45 Main St, Rochester NY d 124 00001
Jane McArthur 12 1st Ave, Chicago IL a 111 00002
Jane McArthur 12 1st Ave, Chicago IL a 231 00002
Jane McArthur 12 1st Ave, Chicago IL a 123 00002
Jane McArthur 12 1st Ave, Chicago IL b 111 00002
Jane McArthur 12 1st Ave, Chicago IL c 111 00002
Jane McArthur 12 1st Ave, Chicago IL e 123 00002
Jane McArthur 12 1st Ave, Chicago IL e 124 00002
How do I make this jump from having UniqueIDs for AddressID in my unique addresses table, to adding these all to each row with a corresponding address back in my tbPeopleData?
Just backfill the calculated AddressID back to tblPeopleData - you can combine an UPDATE with a FROM (like you would do in a select)
UPDATE tblPeopleData
SET AddressID = a.AddressID
FROM tblPeopleData pd
INNER JOIN tblAddressData a
ON pd.Address = a.Address
You would alter the table to have the address id:
alter table tblPeopleData add AddressId int references Address(AddressId);
Then you can update the value using a JOIN:
update tblPeopleData pd JOIN
Address a
ON pd.Address = a.Address
pd.AddressId = a.AddressId;
You will definitely want an index on Address(Address) for this.
Then, you can drop the old column:
alter table drop column Address;
Note:
It might be faster to save the results in a temporary table, because the update is going to generate lots and lots of log records. For this, truncate the original table, and re-load the data:
SELECT . . . , a.AddressId
INTO tmp_tblPeopleData
FROM tblPeopleData pd JOIN
Address a
ON pd.Address = a.Address;
TRUNCATE TABLE tblPeopleData;
INSERT INTO tblPeopleData( . . .)
SELECT . . .
FROM tmp_tblPeopleData;