How do I drop a row from this data frame? - pandas

ID Address City State Country Name Employees
0 1 3666 21st St San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores St San Francisco CA 94119 USA Bready Shop 15
2 3 332 Hill St San Francisco Cal USA Super River 25
3 4 3995 23rd St San Francisco CA 94114 USA Ben's Shop 10
4 5 1056 Sanchez St San Francisco California USA Sanchez 12
5 6 551 Alvarado St San Francisco CA 94114 USA Richvalley 20
df=df.drop(['3666 21st St'], axis=1, inplace=True)
I am using this code and still, it's showing an error stating that :
KeyError: "['3666 21st St'] not found in axis"
Can anyone help me solve this?

The drop method only works on the index or column names. There are 2 ways to do what you want:
Make the Address column the index, then drop the value(s) you want to drop. You should use axis=0 for this, and not axis=1. The default is axis=0. Do not use inplace=True if you are assigning the output.
Use a Boolean filter instead of drop.
The 1st method is preferred if the Address values are all distinct. The index of a data frame is effectively a sequence of row labels, so it doesn't make much sense to have duplicate row labels:
df.set_index('Address', inplace=True)
df.drop(['3666 21st St'], inplace=True)
The 2nd method is therefore preferred if the Address column is not distinct:
is_bad_address = df['Address'] == '3666 21st St'
# Alternative if you have multiple bad addresses:
# is_bad_address = df['Address'].isin(['366 21st St'])
df = df.loc[~is_bad_address]
You need to consult the Pandas documentation for the correct usage of the axis= and inplace= keyword arguments. You are using both of them incorrectly. DO NOT COPY AND PASTE CODE WITHOUT UNDERSTANDING HOW IT WORKS.

Related

Count number of times each item in list occurs in a pandas dataframe column with comma separates vales

I have a list :
citylist = ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Miami']
and a pandas Dataframe df1 with these values
first last city email
John Travis New York a#email.com
Jim Perterson San Franciso, Los Angeles b#email.com
Nancy Travis Chicago b1#email.com
Jake Templeton Los Angeles b3#email.com
John Myers New York b4#email.com
Peter Johnson San Franciso, Chicago b5#email.com
Aby Peters Los Angeles b6#email.com
Amy Thomas San Franciso b7#email.com
Jessica Thompson Los Angeles, Chicago, New York b8#email.com
I want to count the number of times each city from citylist occurs in the dataframe column 'city':
New York 3
San Francisco 3
Los Angeles 4
Chicago 3
Miami 0
Currently I have
dftest = df1.groupby(by='city', as_index=False).agg({'id': pd.Series.nunique})
and it ends counting "Los Angeles, Chicago, New York" as 1 unique value
Is there any way to get counts as I have show above?
Thanks
Try this:
Fix data first:
df1['city'] = df1['city'].str.replace('Franciso', 'Francisco')
Use this:
(df1['city'].str.split(', ')
.explode()
.value_counts(sort=False)
.reindex(citylist, fill_value=0))
Output:
New York 3
San Francisco 3
Los Angeles 4
Chicago 3
Miami 0
Name: city, dtype: int64
You can use Series.str.count:
pd.Series([df['city'].str.count(c).sum() for c in citylist], index=citylist)
Another more efficient approach as suggested by #ScottBoston
pd.Series({c:sum(c in i for i in df['city']) for c in citylist})
New York 3
San Francisco 0
Los Angeles 4
Chicago 3
Miami 0
dtype: int64

Pandas loc() function

I am trying to slice item from CSV here is example [enter image description here][1]
df1 = pandas.read_csv("supermarkets.csv")
df1
ID Address City State Country Name Employees
0 1 3666 21st St San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores St San Francisco CA 94119 USA Bready Shop 15
2 3 332 Hill St San Francisco California 94114 USA Super River 25
3 4 3995 23rd St San Francisco CA 94114 USA Ben's Shop 10
4 5 1056 Sanchez St San Francisco California USA Sanchez 12
5 6 551 Alvarado St San Francisco CA 94114 USA Richvalley 20
df2 = df1.loc["735 Dolores St":"332 Hill St","City":"Country"]
df2
In output I am only getting this output
City State Country
How do I correct?
As you can read in pandas documentation .loc[] can access a group of rows and columns by label(s) or a boolean array.
You cannot directly select using the values in the Series.
In your example df1.loc["735 Dolores St":"332 Hill St","City":"Country"] you are getting an empty selection because only "City":"Country" is a valid accessor.
"735 Dolores St":"332 Hill St" will return an empty row selection as they are not labels on the index.
You can try selecting by index with .iloc[[1,2], "City":"Country"] if you want specific rows.
df.loc is primarily label based and commonly slices the rows using an index. In this case, you can use the numeric index or set address as index
print(df)
ID Address City State Country Name Employees
0 1 3666 21st San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores San Francisco CA 94114 USA Bready Shop 15
2 3 332 Hill San Francisco CA 94114 USA Super River 25
df2=df.loc[1:2,'City':'Country']
print(df2)
City State Country
1 San Francisco CA 94114 USA
2 San Francisco CA 94114 USA
Or
df2=df.set_index('Address').loc['735 Dolores':'332 Hill','City':'Country']
print(df2)
City State Country
Address
735 Dolores San Francisco CA 94114 USA
332 Hill San Francisco CA 94114 USA

modified list input when character value has embedded blanks

I am preparing SAS BASE test. In the test book chapter 17 Reading Free-format Data, there is an example about how to read character values with embedded blanks and nonstandard value, such as numbers with comma. I tested it and its result is not what the book described.
data cityrank;
infile datalines;
input rank city & $12. pop86: comma.;
datalines;
1 NEW YORK 7,262,700
2 LOS ANGELES 3,259,340
3 CHICAGO 3,009,530
4 HOUSTON 1,728,910
5 PHILADELPHIA 1,642,900
6 DETROIT 1,086,220
7 DAN DIEGO 1,015,190
8 DALLAS 1,003,520
9 SAN ANTONIA 914,350
;
what I got is like below, data set has 4 obs.
rank city pop86
1 NEW YORK 7,2 2
3 CHICAGO 3,00 4
5 PHILADELPHIA 6
7 DAN DIEGO 1, 8
Am I wrong somewhere typing the program? I have checked again and again that I copy it correctly.
How to modify this program?
Thank you!
I'm guessing from the typos that you didn't copy-paste this, but you typed it in instead.
As such, you (or the book writers) made another typo: there are two spaces after the city names, not one (or at least, should be). That's what the & does: it says "wait for two consecutive delimiters" (allowing a single delimiter to be ignored, so New York is read into one variable instead of split).
So this would be correct:
data cityrank;
infile datalines;
input rank city & $12. pop86: comma.;
datalines;
1 NEW YORK 7,262,700
2 LOS ANGELES 3,259,340
3 CHICAGO 3,009,530
4 HOUSTON 1,728,910
5 PHILADELPHIA 1,642,900
6 DETROIT 1,086,220
7 SAN DIEGO 1,015,190
8 DALLAS 1,003,520
9 SAN ANTONIO 914,350
;
run;

Object Model to identify customers with similar address

Let's say I have a table of customers and each customer has an address. My task is to design an object model that allows to group the customers by similar address. Example:
John 123 Main St, #A; Los Angeles, CA 90032
Jane 92 N. Portland Ave, #1; Pasadena, CA 91107
Peter 92 N. Portland Avenue, #2; Pasadena, CA 91107
Lester 92 N Portland Av #4; Pasadena, CA 91107
Mark 123 Main Street, #C; Los Angeles, CA 90032
The query should somehow return:
1 Similar_Address_Key1
5 Similar_Address_Key1
2 Similar_Address_key2
3 Similar_Address_key2
4 Similar_Address_key2
What is the best way to accomplish this? Notice the addresses are NOT consistent (some address have "Avenue" others have "Av" and the apartment numbers are different). The existing data of names/address cannot be corrected so doing a GROUP BY (Address) on the table itself is out of the question.
I was thinking to add a SIMILAR_ADDRESSES table that takes an address, evaluates it and gives it a key, so something like:
cust_key address similar_addr_key
1 123 Main St, #A; Los Angeles, CA 90032 1
2 92 N. Portland Ave, #1; Pasadena, CA 91107 2
3 92 N. Portland Avenue, #2; Pasadena, CA 91107 2
4 92 N. Portland Av #4; Pasadena, CA 91107 2
5 123 Main Street, #C; Los Angeles, CA 90032 1
Then group by the similar address key. But the question is how to best accomplish the "evaluation" part. One way would be to modify the address in the SIMILAR_ADDRESSES table so that they are consistent and ignoring things like apt, #, or suite and assign a "key" to each exact match. Another different approach I thought about was to feed the address to a Geolocator service and save the latitude/longitude values to a table and use these values to generate a similar address key.
Any ideas?

How do I Sum a total based on Grouping

I've got data (which changes every time) in 2 columns - basically state and number. This is an example:
Example Data
State Total
Connecticut 624
Georgia 818
Washington 10
Arkansas 60
New Jersey 118
Ohio 2,797
N. Carolina 336
Illinois 168
California 186
Utah 69
Texas 183
Minnesota 172
Kansas 945
Florida 113
Arizona 1,430
S. Dakota 293
Puerto Rico 184
Each state needs to be grouped. The groupings are as follows:
Groupings
**US Group 1**
California
District of Columbia
Florida
Hawaii
Illinois
Michigan
Nevada
New York
Pennsylvania
Texas
**US Group 3**
Iowa
Idaho
Kansas
Maine
Missouri
Montana
North Dakota
Nebraska
New Hampshire
South Dakota
Utah
Wyoming
Every other state belongs in US Group 2..
What I am trying to do is sum a total for each group. So in this example I would have totals of:
Totals
650 in Group 1 (4 states)
6365 in Group 2 (9 states)
1307 in Group 3 (3 states)
So what I would like to do each time I get a new spreadsheet with this data, is not have to create an if/countif/sumif formula each time. I figure it would be much more efficient to select my data and possibly run a macro which will do that (possibly checking against some legend or something)
Can anyone point me in the right direction? I have been banging my head against the VBA editor for 2 days now...
Here is one way.
Step 1: Create a named range for each of your groups.
Step 2: Try this formula: =SUMPRODUCT(SUMIF(A2:A18,Group1,B2:B18))
Formula Breakdown:
A2:A18 is the the state names
Group1 is the named range that has each of your states in group 1
B2:B18 is the values you want to sum.
It's important that your state names and the values you want summed are the same size (number of rows). You should also standardize your state names. Having S. Dakota in your data and South Dakota in your named range won't work. Either add in the different variations of the state name(s) to your list, or standardize your data coming in.
To get a clear visual of what the formula is doing, use the Evaluate Formula button on the Formulas Tab, it will be much better than me trying to explain it.
EDIT
Try this formula for summing up values that are not in Group1 or Group3:
=SUMPRODUCT(--(NOT(ISNUMBER(MATCH(A2:A18,Group1,0)))),--(NOT(ISNUMBER(MATCH(A2:A18,Group3,0)))),B2:B18)
Seemed to work on my end. Basically it works by only summing valyes in B2:B18 where both match functions return N/A (meaning it's not in the defined group list).
Use a vlookup with a mapping of your states to groups. Then from the group number, add it if it's found, or add 0.