Pandas Groupby Multiple Conditions KeyError - pandas

I have a df called df_out with column names such as this in the following insert but I cannot for some reason use 'groupby' function with the column headers since it keeps giving me KeyError: 'year'. I"ve researched and tried stripping white space, resetting the index, allowing white space before my groupby setting, etc and I cannot get past this KeyError. The df_out looks like this:
df_out.columns
Out[185]:
Index(['year', 'month', 'BARTON CHAPEL', 'BARTON I', 'BIG HORN I',
'BLUE CREEK', 'BUFFALO RIDGE I', 'CAYUGA RIDGE', 'COLORADO GREEN',
'DESERT WIND', 'DRY LAKE I', 'EL CABO', 'GROTON', 'NEW HARVEST',
'PENASCAL I', 'RUGBY', 'TULE'],
dtype='object', name='plant_name')
But, when I use df_out.head(), I get a different answer with the leading column of 'plant_name' so this maybe is where the error is coming from or related. Here is the output columns from -
df_out.head()
Out[187]:
plant_name year month BARTON CHAPEL BARTON I BIG HORN I BLUE CREEK \
0 1991 1 6.432285 7.324126 5.170067 6.736384
1 1991 2 7.121324 6.973586 4.922693 7.473527
2 1991 3 8.125793 8.681317 5.796599 8.401855
3 1991 4 7.454972 8.037764 7.272292 7.961625
4 1991 5 7.012809 6.530013 6.626949 6.009825
plant_name BUFFALO RIDGE I CAYUGA RIDGE COLORADO GREEN DESERT WIND \
0 7.163790 7.145323 5.783629 5.682003
1 7.595744 7.724717 6.245952 6.269524
2 8.111411 9.626075 7.918871 6.657648
3 8.807458 8.618806 7.011444 5.848736
4 7.734852 6.267097 7.410013 5.099610
plant_name DRY LAKE I EL CABO GROTON NEW HARVEST PENASCAL I \
0 4.721089 10.747285 7.456640 6.921801 6.296425
1 5.095923 8.891057 7.239762 7.449122 6.484241
2 8.409637 12.238508 8.274046 8.824758 8.444960
3 7.893694 10.837139 6.381736 8.840431 7.282444
4 8.496976 8.636882 6.856747 7.469825 7.999530
plant_name RUGBY TULE
0 7.028360 4.110605
1 6.394687 5.257128
2 6.859462 10.789516
3 7.590153 7.425153
4 7.556546 8.085255
My groupby statement that is getting the KeyError looks like this and I'm trying to calculate the average by rows of year and month based on a subset of columns from df_out found in the list - 'west':
west=['BIG HORN I','DRY LAKE I', 'TULE']
westavg = df_out[df_out.columns[df_out.columns.isin(west)]].groupby(['year','month']).mean()
thank you very much,

Your code can be broken down as:
westavg = (df_out[df_out.columns[df_out.columns.isin(west)]]
.groupby(['year','month']).mean()
)
which is not working because ['year','month'] are not columns of df_out[df_out.columns[df_out.columns.isin(west)]].
Try:
west_cols = [c for c in df_out if c in west]
westavg = df_out.groupby(['year','month'])[west_cols].mean()

Ok, with the help of Quang Hoang below, I understood the problem and came up with this answer that works that I am able to understand a bit better using .intersection:
westavg = df_out[df_out.columns.intersection(west)].mean(axis=1)
#gives me average of each row from the subset of columns defined by the list 'west'`.

Related

How to replace values of a column based on another data frame?

I have a column containing symbols of chemical elements and other substances. Something like this:
Commoditie
sn
sulfuric acid
cu
sodium chloride
au
df1 = pd.DataFrame(['sn', 'sulfuric acid', 'cu', 'sodium chloride', 'au'], columns=['Commodities'])
And I have another data frame containing the symbols of the chemical elements and their respective names. Like this:
Name
Symbol
sn
tin
cu
copper
au
gold
df2 = pd.DataFrame({'Name': ['tin', 'copper', 'gold'], 'Symbol': ['sn', 'cu', 'au']})
I need to replace the symbols (in the first dataframe)(df1['Commoditie']) with the names (in the second one) (df2['Names']), so that it outputs like the following:
I need the
Output:
Commoditie
tin
sulfuric acid
copper
sodium chloride
gold
I tried using for loops and lambda but got different results than expected. I have tried many things and googled, I think it's something basic, but I just can't find an answer.
Thank you in advance!
first, convert df2 to a dictionary:
replace_dict=dict(df2[['Symbol','Name']].to_dict('split')['data'])
#{'sn': 'tin', 'cu': 'copper', 'au': 'gold'}
then use replace function:
df1['Commodities']=df1['Commodities'].replace(replace_dict)
print(df1)
'''
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
'''
Try:
for i, row in df2.iterrows():
df1.Commodities = df1.Commodities.str.replace(row.Symbol, row.Name)
which gives df1 as:
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
EDIT: Note that it's very likely to be far more efficient to skip defining df2 at all and just zip your lists of names and symbols together and iterate over that.

How can I set conditions for dataframes?

/we.tl/t-ghXIOjPznq
Here is my xlsx file.
https://imgur.com/b8kTbNV
I have such a dataframe. I want to define only for conditions where LITHOLOGY column is 1. In order to do that;
df2 = pd.read_excel('V131BLOG.xlsx')
LITHOLOGY = [1] &
df2[df2.LITHOLOGY.isin(LITHOLOGY)]
There hasn't been a problem so far. I was able to filter as I wanted.
https://imgur.com/wcSvokM
In addition to these, I want to see cells with LITHOLOGY column as 1 If It's thickness is bigger than 15cms. What I mean is that, the cumulative difference of consecutive cells of DEPTH_MD column should be bigger than 10cms. I have not made any progress on this. What path should I follow?
As you can see in this (https://imgur.com/a/02nlUUl) figure, there can be seen serial group of LITHOLOGY column as 1. But when you check the DEPTH_MD values, upper group is equal to 10cms, on the other side, lower group is equal 5cms. I want to create a dataframe that only contains bigger than 10cms DEPTH_MD values.
Input:
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP
1980 329.00 26.8964 25.47160 2 2.99103 2.62130
1981 329.05 26.8574 32.54390 2 2.94772 2.58945
1982 329.10 27.1297 28.83750 1 2.90123 2.55601
1983 329.15 26.9742 17.91150 2 2.80383 2.52327
1984 329.20 28.3946 31.94310 2 2.76041 2.49050
1985 329.25 30.9402 17.63760 1 2.71992 2.46051
1986 329.30 35.2419 17.69170 1 2.67355 2.42852
1987 329.35 37.9206 17.74620 1 2.61838 2.33619
1988 329.40 39.9189 24.84460 2 2.56200 2.28671
1989 329.45 41.4947 7.03354 2 2.50669 2.23887
1990 329.50 41.5473 7.03354 2 2.42167 2.19944
1991 329.55 41.0158 10.58260 2 2.40039 2.17235
Output except:
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP
1985 329.25 30.9402 17.6376 1 2.71992 2.46051
1986 329.30 35.2419 17.6917 1 2.67355 2.42852
1987 329.35 37.9206 17.7462 1 2.61838 2.33619
Group the consecutive 'LITHOLOGY' rows then compute the thickness and finally broadcast to all rows:
df['THICKNESS'] = (
df.groupby(df['LITHOLOGY'].ne(df['LITHOLOGY'].shift()).cumsum())['DEPTH_MD']
.transform(lambda x: x.diff().sum())
)
out = df[(df['LITHOLOGY'] == 1) & (df['THICKNESS'] >= 0.1)]
Output:
>>> out
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP THICKNESS
1985 329.25 30.9402 17.6376 1 2.71992 2.46051 0.1
1986 329.30 35.2419 17.6917 1 2.67355 2.42852 0.1
1987 329.35 37.9206 17.7462 1 2.61838 2.33619 0.1

Applying a function to list of columns of a dataframe?

I scraped this table from this URL:
"https://www.patriotsoftware.com/blog/accounting/average-cost-living-by-state/"
Which looks like this:
State Annual Mean Wage (All Occupations) Median Monthly Rent Value of a Dollar
0 Alabama $44,930 $998 $1.15
1 Alaska $59,290 $1,748 $0.95
2 Arizona $50,930 $1,356 $1.04
3 Arkansas $42,690 $953 $1.15
4 California $61,290 $2,518 $0.87
And then I wrote this function to help me turn the strings into ints:
def money_string_to_int(s):
return int(s.replace(",", "").replace("$",""))
money_string_to_int("$1,23")
My function works when I apply it to only one column. I found this answer here about using on multiple columns: How to apply a function to multiple columns in Pandas
But my code below does not work and produces no errors:
ls = ['Annual Mean Wage (All Occupations)', 'Median Monthly Rent',
'Value of a Dollar']
ppe_table[ls] = ppe_table[ls].apply(money_string_to_int)
Lets try
df.set_index('State').apply(lambda x: (x.str.replace('[$,]','').astype(float))).reset_index()

Averaging dataframes with many string columns and display back all the columns

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})
Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

Pandas Dataframe: Divide Column entries by number of occurence

my Problem:
I have this DF:
df_problem = pd.DataFrame({"Share":['5%','6%','9%','9%', '9%'],"level_1":[0,0,1,2,3], 'BO':['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
The Problem is, that the 9% are actually divided by the three shareholders. So I want to giv each of them their share of 3% and put it to their names. It then should look like this:
df_solution = pd.DataFrame({"Share":['5%','6%','3%','3%', '3%'],"level_1":[0,0,0,1,2], 'BO': ['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
How do I do this in a simple way?
You could try something like this:
f_problem['Share'] = (f_problem['Share'].str.replace('%', '').astype(float) /
f_problem.groupby('Share')['BO'].
transform('count')).astype(str) + '%'
>>> f_problem
Share level_1 BO
0 5.0% 0 Nestle
1 6.0% 0 Procter
2 3.0% 1 Nestle
3 3.0% 2 Tesla
4 3.0% 3 Jeff
Please note that I have assumed that the value of the column 'Share' to be float as you could see above.