Getting the KeyError: 'BasePay' for the BasePay element while its therein the DataFrame but missing while using mean() function.
My pandas version is '0.23.3' python3.6.3
>>> import numpy as np
>>> salDataF = pd.read_csv('Salaries.csv', low_memory=False)
>>> salDataF.head()
Id EmployeeName JobTitle BasePay OvertimePay OtherPay ... TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.0 400184.25 ... 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 ... 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.6 ... 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.0 56120.71 198306.9 ... 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.6 9737.0 182234.59 ... 326373.19 326373.19 2011 NaN San Francisco NaN
[5 rows x 13 columns]
>>> EmpSal = salDataF.groupby('Year').mean()
KeyboardInterrupt
>>> salDataF.groupby('Year').mean()
Id TotalPay TotalPayBenefits Notes
Year
2011 18080.0 71744.103871 71744.103871 NaN
2012 54542.5 74113.262265 100553.229232 NaN
2013 91728.5 77611.443142 101440.519714 NaN
2014 129593.0 75463.918140 100250.918884 NaN
>>> EmpSal = salDataF.groupby('Year').mean()['BasePay']
Error: KeyError: 'BasePay'
Here is problem BasePay is not numeric, so salDataF.groupby('Year').mean() exclude all non numeric columns by design.
Solution is first try astype:
salDataF['BasePay'] = salDataF['BasePay'].astype(float)
...and if failed because some non numeric data use to_numeric with errors='coerce' for convert them to NaNs
salDataF['BasePay'] = pd.to_numeric(salDataF['BasePay'], errors='coerce')
and then better is select column before mean:
EmpSal = salDataF.groupby('Year')['BasePay'].mean()
Related
The following is a sample from my data frame:
import pandas as pd
import numpy as np
d=['SKODASUPERB','SKODASUPERB',\
'SKODASUPERB','MERCEDES-BENZE CLASS','ASTON MARTINVIRAGE'\
,'ASTON MARTINVIRAGE','ASTON MARTINVIRAGE','TOYOTAHIACE',\
'TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS',\
'FERRARI360','FERRARILAFERRARI']
x=['SUV','SUV','nan','nan','SPORTS','SPORTS','SPORTS',\
'nan','SEDAN','SEDAN','SEDAN','SEDAN','SPORT','SPORT']
df=pd.DataFrame({'make_model':d,'body':x})
df.body=df.body.replace('nan',np.NaN)
df.head()
Out[24]:
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB NaN
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
There are some null values in the 'body' column
df.body.isnull().sum()
Out[25]: 3
So i am trying to fill the null values in body column by using the mode of body type for a particular make_model. For instance, 2 observations of SKODASUPERB have body as 'SUV' and 1 observation has body as null. So the mode of body for SKODASUPERB would be 'SUV' and i want 'SUV to be filled in for the third observation too. For this i am using the following code
make_model_list=df.make_model.unique().tolist()
for x in make_model_list:
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']=\
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']\
.fillna(df.loc[df['make_model']==x,'body'].mode())
Unfortunately, the loop is breaking as some observation dont have a mode value
df.body.isnull().sum()
Out[30]: 3
How can i force the loop to run even if there is no mode 'body' value for a particular make_model. I know that i can use continue command, but i am not sure how to write it.
Assuming that make_model and body are distinct values:
donor = df.dropna().groupby(by=['make_model']).agg(pd.Series.mode).reset_index()
df = df.merge(donor, how='left', on=['make_model'])
df['body_x'].fillna(df.body_y, inplace=True)
df.drop(columns=['body_y'], inplace=True)
df.columns = ['make_model', 'body']
df
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB SUV
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
5 ASTON MARTINVIRAGE SPORTS
6 ASTON MARTINVIRAGE SPORTS
7 TOYOTAHIACE NaN
8 TOYOTAAVENSIS SEDAN
9 TOYOTAAVENSIS SEDAN
10 TOYOTAAVENSIS SEDAN
11 TOYOTAAVENSIS SEDAN
12 FERRARI360 SPORT
13 FERRARILAFERRARI SPORT
Finally, I have worked out a solution. It was just a matter of putting try and exception. This solution works perfectly for the purpose of my project and has filled 95% of the missing values. I have slightly changed the data to show that this method is effective:
d=['SKODASUPERB','SKODASUPERB',\
'SKODASUPERB','MERCEDES-BENZE CLASS','ASTON MARTINVIRAGE'\
,'ASTON MARTINVIRAGE','ASTON MARTINVIRAGE','TOYOTAHIACE',\
'TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS',\
'FERRARI360','FERRARILAFERRARI']
x=['SUV','SUV','nan','nan','SPORTS','SPORTS','nan',\
'nan','SEDAN','SEDAN','nan','SEDAN','SPORT','SPORT']
df=pd.DataFrame({'make_model':d,'body':x})
df.body=df.body.replace('nan',np.NaN)
df
Out[6]:
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB NaN
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
5 ASTON MARTINVIRAGE SPORTS
6 ASTON MARTINVIRAGE NaN
7 TOYOTAHIACE NaN
8 TOYOTAAVENSIS SEDAN
9 TOYOTAAVENSIS SEDAN
10 TOYOTAAVENSIS NaN
11 TOYOTAAVENSIS SEDAN
df.body.isnull().sum()
Out[7]: 5
My Solution
for x in make_model_list:
try:
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']=\
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body'].fillna\
(df.loc[df['make_model']==x,'body'].value_counts().index[0])
except:
pass
df.body.isnull().sum()
Out[9]: 2 #null values have dropped from 5 to 2.
Those 2 null values couldn't be filled because there was no frequent or mode value for them at all.
Working with pandas data frame. How Can I separate Scope and its value in new Type columns and value is added in various row. Please see the 1st iamge
If all columns are ordered in pairs is possible filter even and odd columns and recreate DataFrame:
df1 = pd.DataFrame({'Verified By':df.iloc[:, 1::2].stack(dropna=False).to_numpy(),
'tCO2':df.iloc[:, ::2].stack(dropna=False).to_numpy()})
print (df1)
Verified By tCO2
0 Cventure LLC 12.915
1 Cventure LLC 61.801
2 NaN 78.551
3 NaN 5.712
4 NaN 49.513
5 Cventure LLC 24.063
6 Carbon Trust 679.000
7 NaN 4.445
8 Cventure LLC 56290.000
Another idea is split by first 2 spaces and reshape by DataFrame.stack:
df.columns = df.columns.str.split(n=2, expand=True)
df1 = df.stack([0,1]).droplevel(0)
df1.index = df1.index.map(lambda x: f'{x[0]} {x[1]}')
df1 = df1.rename_axis('Score').reset_index()
print (df1)
Score Verified By tCO2
0 Scope 1 Cventure LLC 12.915
1 Scope 2 Cventure LLC 61.801
2 Scope 3 NaN 78.551
3 Scope 1 NaN 5.712
4 Scope 2 NaN 49.513
5 Scope 3 Cventure LLC 24.063
6 Scope 1 Carbon Trust 679.000
7 Scope 2 NaN 4.445
8 Scope 3 Cventure LLC 56290.000
I have found a .txt file with the names of more than 5000 cities around the world. The link is here. The text within is all messy. I would like, in Python, to read the file and store it into a list, so I could search the name of a city whenever I want?
I tried loading it as a dataframe with
import pandas as pd
cities = pd.read_csv('cities15000.txt',error_bad_lines=False)
However, everything looks very messy.
Is there an easier way to achieve this?
Thanks in advance!
The linked file is like a CSV (Comma Separated Values) but instead of commas it uses tabs as the field separator. Set the sep parameter of the pd.read_csv function to \t, i.e. the tab character.
In [18]: import pandas as pd
...:
...: pd.read_csv('cities15000.txt', sep = '\t', header = None)
Out[18]:
0 1 2 3 4 5 ... 13 14 15 16 17 18
0 3040051 les Escaldes les Escaldes Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo... 42.50729 1.53414 ... NaN 15853 NaN 1033 Europe/Andorra 2008-10-15
1 3041563 Andorra la Vella Andorra la Vella ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ... 42.50779 1.52109 ... NaN 20430 NaN 1037 Europe/Andorra 2020-03-03
2 290594 Umm Al Quwain City Umm Al Quwain City Oumm al Qaiwain,Oumm al Qaïwaïn,Um al Kawain,U... 25.56473 55.55517 ... NaN 62747 NaN 2 Asia/Dubai 2019-10-24
3 291074 Ras Al Khaimah City Ras Al Khaimah City Julfa,Khaimah,RAK City,RKT,Ra's al Khaymah,Ra'... 25.78953 55.94320 ... NaN 351943 NaN 2 Asia/Dubai 2019-09-09
4 291580 Zayed City Zayed City Bid' Zayed,Bid’ Zayed,Madinat Za'id,Madinat Za... 23.65416 53.70522 ... NaN 63482 NaN 124 Asia/Dubai 2019-10-24
... ... ... ... ... ... ... ... ... ... .. ... ... ...
24563 894701 Bulawayo Bulawayo BUQ,Bulavajas,Bulavajo,Bulavejo,Bulawayo,bu la... -20.15000 28.58333 ... NaN 699385 NaN 1348 Africa/Harare 2019-09-05
24564 895061 Bindura Bindura Bindura,Bindura Town,Kimberley Reefs,Биндура -17.30192 31.33056 ... NaN 37423 NaN 1118 Africa/Harare 2010-08-03
24565 895269 Beitbridge Beitbridge Bajtbridz,Bajtbridzh,Beitbridge,Beitbridzas,Be... -22.21667 30.00000 ... NaN 26459 NaN 461 Africa/Harare 2013-03-12
24566 1085510 Epworth Epworth Epworth -17.89000 31.14750 ... NaN 123250 NaN 1508 Africa/Harare 2012-01-19
24567 1106542 Chitungwiza Chitungwiza Chitungviza,Chitungwiza,Chytungviza,Citungviza... -18.01274 31.07555 ... NaN 340360 NaN 1435 Africa/Harare 2019-09-05
[24568 rows x 19 columns]
I have a couple of data frames. I want to get data from 2 columns from first data frame for marking the rows that are present in second data frame.
First data frame (df1) looks like this
Sup4 Seats Primary Seats Back up Seats
Pa 3 2 1
Ka 2 1 1
Ga 1 0 1
Gee 1 1 0
Re 2 2 0
(df2) looks like
Sup4 First Last Primary Seats Backup Seats Rating
Pa Peter He NaN NaN 2.3
Ka Sonia Du NaN NaN 2.99
Ga Agnes Bla NaN NaN 3.24
Gee Jeffery Rus NaN NaN 3.5
Gee John Cro NaN NaN 1.3
Pa Pavol Rac NaN NaN 1.99
Pa Ciara Lee NaN NaN 1.88
Re David Wool NaN NaN 2.34
Re Stefan Rot NaN NaN 2
Re Franc Bor NaN NaN 1.34
Ka Tania Le NaN NaN 2.35
the output i require for each Sup4 name is to be grouped also by sorting the Rating from highest to lowest and then mark the columns for seats based on the df1 columns Primary Seats and Backup seats.
i did grouping and sorting for first Sup4 name Pa for sample and i have to do for all the names
Sup4 First Last Primary Seats Backup Seats Rating
Pa Peter He M 2.3
Pa Pavol Rac M 1.99
Pa Ciara Lee M 1.88
Ka Sonia Du M 2.99
Ka Tania Le M 2.35
Ga Agnes Bla M 3.24
:
:
:
continues like this
I have tried until grouping and sorting
sorted_df = df2.sort_values(['Sup4','Rating'],ascending=[True,False])
however i need help to pass df1 columns values to mark in second dataframe
Solution #1:
You can do a merge, but you need to include some logic to update your Seats columns. Also, it is important to mention that you need to decide what to do with data with unequal lengths. ~GeeandRe` have unequal lengths in both dataframes. More information in Solution #2:
df3 = (pd.merge(df2[['Sup4', 'First', 'Last', 'Rating']], df1, on='Sup4')
.sort_values(['Sup4', 'Rating'], ascending=[True, False]))
s = df3.groupby('Sup4', sort=False).cumcount() + 1
df3['Backup Seats'] = np.where(s - df3['Primary Seats'] > 0, 'M', '')
df3['Primary Seats'] = np.where(s <= df3['Primary Seats'], 'M', '')
df3 = df3[['Sup4', 'First', 'Last', 'Primary Seats', 'Backup Seats', 'Rating']]
df3
Out[1]:
Sup4 First Last Primary Seats Backup Seats Rating
5 Ga Agnes Bla M 3.24
6 Gee Jeffery Rus M 3.5
7 Gee John Cro M 1.3
3 Ka Sonia Du M 2.99
4 Ka Tania Le M 2.35
0 Pa Peter He M 2.3
1 Pa Pavol Rac M 1.99
2 Pa Ciara Lee M 1.88
8 Re David Wool M 2.34
9 Re Stefan Rot M 2.0
10 Re Franc Bor M 1.34
Solution #2:
After doing this solution, I realized Solution #1 would be much simpler, but I thought I mine as well include this. Also, this gives you insight on what do with values that had unequal size in both dataframes. You can reindex the first dataframe and use combine_first() but you have to do some preparation. Again, you need to decide what to do with data with unequal lengths. In my answer, I have simply excluded Sup4 groups with unequal lengths to guarantee that the indices align when finally calling combine_first():
# Purpose of `mtch` is to check if rows in second dataframe are equal to the count of seats in first.
# If not, then I have excluded the `Sup4` with unequal lengths in both dataframes
mtch = df1.groupby('Sup4')['Seats'].first().eq(df2.groupby('Sup4').size())
df1 = df1.sort_values('Sup4', ascending=True)[df1['Sup4'].isin(mtch[mtch].index)]
df1 = df1.reindex(df1.index.repeat(df1['Seats'])).reset_index(drop=True)
#`reindex` the dataframe, get the cumulative count, and manipulate data with `np.where`
df1 = df1.reindex(df1.index.repeat(df1['Seats'])).reset_index(drop=True)
s = df1.groupby('Sup4').cumcount() + 1
df1['Backup Seats'] = np.where(s - df1['Primary Seats'] > 0, 'M', '')
df1['Primary Seats'] = np.where(s <= df1['Primary Seats'], 'M', '')
#like df1, in df2 we exclude groups with uneven lengths and sort
df2 = (df2[df2['Sup4'].isin(mtch[mtch].index)]
.sort_values(['Sup4', 'Rating'], ascending=[True, False]).reset_index(drop=True))
#can use `combine_first` since we have ensured that the data is sorted and of equal lengths in both dataframes
df3 = df2.combine_first(df1)
#order columns and only include required columns
df3 = df3[['Sup4', 'First', 'Last', 'Primary Seats', 'Backup Seats', 'Rating']]
df3
Out[1]:
Sup4 First Last Primary Seats Backup Seats Rating
0 Ga Agnes Bla M 3.24
1 Ka Sonia Du M 2.99
2 Ka Tania Le M 2.35
3 Pa Peter He M 2.3
4 Pa Pavol Rac M 1.99
5 Pa Ciara Lee M 1.88
I have this df:
data = pd.read_csv('attacks.csv', encoding="latin-1")
new_data = data.loc[:,'Name':'Investigator or Source']
new_data.head(5)
Name Sex Age Injury Fatal (Y/N) Time Species Investigator or Source
0 Julie Wolfe F 57 No injury to occupant, outrigger canoe and pad... N 18h00 White shark R. Collier, GSAF
1 Adyson McNeely F 11 Minor injury to left thigh N 14h00 -15h00 NaN K.McMurray, TrackingSharks.com
2 John Denges M 48 Injury to left lower leg from surfboard skeg N 07h45 NaN K.McMurray, TrackingSharks.com
3 male M NaN Minor injury to lower leg N NaN 2 m shark B. Myatt, GSAF
4 Gustavo Ramos M NaN Lacerations to leg & hand shark PROVOKED INCIDENT N NaN Tiger shark, 3m A .Kipper
How can I get the unique values of the 'Species' category?
I'm trying with:
new_data["Species"].unique()
But it does not work.
Thank you!
you can also try:
uniqueSpecies = set(new_data["Species"])
in case you wanna drop NaN
uniqueSpecies = set(new_data["Species"].dropna())