Append new columns to a pandas dataframe in a groupby object - pandas

I would like to add columns to a pandas dataframe in a groupby object
# create the dataframe
idx = ['a','b','c'] * 10
df = pd.DataFrame({
'f1' : np.random.randn(30),
'f2' : np.random.randn(30),
'f3' : np.random.randn(30),
'f4' : np.random.randn(30),
'f5' : np.random.randn(30)},
index = idx)
colnum = [1,2,3,4,5]
newcol = ['a' + str(s) for s in colnum]
# group by the index
df1 = df.groupby(df.index)
Trying to loop over each group in the groupby object and add new columns to the current dataframe in the group
for group in df1:
tmp = group[1]
for s in range(len(tmp.columns)):
print(s)
tmp.loc[:,newcol[s]] = tmp[[tmp.columns[s]]] * colnum[s]
group[1] = tmp
I'm unable to add the new dataframe to the group object
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
TypeError: 'tuple' object does not support item assignment
Is there a way to replace the dataframe in the groupby object with a new dataframe ?

Base on your code: (PS: df.mul([1,2,3,4,5]) work for you example out put)
grouplist=[]
for _,group in df1:
tmp = group
for s in range(len(tmp.columns)):
print(s)
tmp.loc[:,newcol[s]] = tmp[[tmp.columns[s]]] * colnum[s]
grouplist.append(tmp)
grouplist[1]
Out[217]:
f1 f2 f3 f4 f5 a1 a2 \
b -0.262064 -1.148832 -1.835077 -0.244675 -0.215145 -0.262064 -2.297664
b -1.595659 -0.448111 -0.908683 -0.157839 0.208497 -1.595659 -0.896222
b 0.373039 -0.557571 1.154175 -0.172326 1.236915 0.373039 -1.115142
b -1.485564 1.508292 0.420220 -0.380387 -0.725848 -1.485564 3.016584
b -0.760250 -0.380997 -0.774745 -0.853975 0.041411 -0.760250 -0.761994
b 0.600410 1.822984 -0.310327 -0.281853 0.458621 0.600410 3.645968
b -0.707724 1.706709 -0.208969 -1.696045 -1.644065 -0.707724 3.413417
b -0.892057 1.225944 -1.027265 -1.519110 -0.861458 -0.892057 2.451888
b -0.454419 -1.989300 2.241945 -1.071738 -0.905364 -0.454419 -3.978601
b 1.171569 -0.827023 -0.404192 -1.495059 0.500045 1.171569 -1.654046
a3 a4 a5
b -5.505230 -0.978700 -1.075727
b -2.726048 -0.631355 1.042483
b 3.462526 -0.689306 6.184576
b 1.260661 -1.521547 -3.629239
b -2.324236 -3.415901 0.207056
b -0.930980 -1.127412 2.293105
b -0.626908 -6.784181 -8.220324
b -3.081796 -6.076439 -4.307289
b 6.725834 -4.286954 -4.526821
b -1.212577 -5.980235 2.500226

Related

Build a decision Column by ANDing multiple columns in pandas

I have a pandas data frame which is shown below:
>>> x = [[1,2,3,4,5],[1,2,4,4,3],[2,4,5,6,7]]
>>> columns = ['a','b','c','d','e']
>>> df = pd.DataFrame(data = x, columns = columns)
>>> df
a b c d e
0 1 2 3 4 5
1 1 2 4 4 3
2 2 4 5 6 7
I have an array of objects (conditions) as shown below:
[
{
'header' : 'a',
'condition' : '==',
'values' : [1]
},
{
'header' : 'b',
'condition' : '==',
'values' : [2]
},
...
]
and an assignHeader which is:
assignHeader = decision
now I want to do an operation which builds up all the conditions from the conditions array by looping through it, for example something like this:
pConditions = []
for eachCondition in conditions:
header = eachCondition['header']
values = eachCondition['values']
if eachCondition['condition'] == "==":
pConditions.append(df[header].isin(values))
else:
pConditions.append(~df[header].isin(values))
df[assignHeader ] = and(pConditions)
I was thinking of using all operator in pandas but am unable to crack the right syntax to do so. The list I shared can go big and dynamic and so I want to use this nested approach and check for the equality. Does anyone know a way to do so?
Final Output:
conditons = [df['a']==1,df['b']==2]
>>> df['decision'] = (df['a']==1) & (df['b']==2)
>>> df
a b c d e decision
0 1 2 3 4 5 True
1 1 2 4 4 3 True
2 2 4 5 6 7 False
Here conditions array will be variable. And I want to have a function which takes df, 'newheadernameandconditions` as input and returns the output as shown below:
>>> df
a b c d e decision
0 1 2 3 4 5 True
1 1 2 4 4 3 True
2 2 4 5 6 7 False
where newheadername = 'decision'
I was able to solve the problem using the code shown below. I am not sure if this is kind of fast way of getting things done, but would love to know your inputs in case you have any specific thing to point out.
def andMerging(conditions, mergeHeader, df):
if len(conditions) != 0:
df[mergeHeader] = pd.concat(conditions, axis = 1).all(axis = 1)
return df
where conditions are an array of pd.Series with boolean values.
And conditions are formatted as shown below:
def prepareForConditionMerging(conditionsArray, df):
conditions = []
for prop in conditionsArray:
condition = prop['condition']
values = prop['values']
header = prop['header']
if type(values) == str:
values = [values]
if condition=="==":
conditions.append(df[header].isin(values))
else:
conditions.append(~df[header].isin(values))
# Here we can add more conditions such as greater than less than etc.
return conditions

Pandas add a summary column that counts values that are not empty strings

I have a table that looks like this:
A B C
1 foo
2 foobar blah
3
I want to count up the non empty columns from A, B and C to get a summary column like this:
A B C sum
1 foo 1
2 foobar blah 2
3 0
Here is how I'm trying to do it:
import pandas as pd
df = { 'A' : ["foo", "foobar", ""],
'B' : ["", "blah", ""],
'C' : ["","",""]}
df = pd.DataFrame(df)
print(df)
df['sum'] = df[['A', 'B', 'C']].notnull().sum(axis=1)
df['sum'] = (df[['A', 'B', 'C']] != "").sum(axis=1)
These last two lines are different ways to get what I want but they aren't working. Any suggestions?
df['sum'] = (df[['A', 'B', 'C']] != "").sum(axis=1)
Worked. Thanks for the assistance.
This one-liner worked for me :)
df["sum"] = df.replace("", np.nan).T.count().reset_index().iloc[:,1]

pandas assigning values bases on another columns

my dataframe
Terrain
M1
M2
F
G
S
B1
B2
I want to open another column Terrain_Type and assign values for example if Terrain is M1,M2,B1,B2 as Composite in Terrain_Type and S in Terrain as Sod in Terrain_Type and instead of F and G i would like to assign Gravel in Terrain Type column.
I have tried tried this code
data['Terrain_Type'] = data['Terrain'].map({['M1','M2','B1','B2']:'Composite', 'S':'Sod',['F','G']:'Gravel'})
But it didnt work out. Could anyone suggest me how to solve this error in my code
You need to map with a valid dictionary, and in what you have, you are using a list as a key which can be problematic. So let's suppose the dictionary is like this:
import pandas as pd
data = pd.DataFrame({'Terrain':['M1','M2','F','G','S','B1','B2']})
d = {'Composite':['M1','M2','B1','B2'],'Sod':['S'],'Gravel':['F','G']}
We can create a reverse of this, which maps the terrain to the type:
new_dic = {}
for k,v in d.items():
for x in v:
new_dic[x]=k
new_dic
{'M1': 'Composite',
'M2': 'Composite',
'B1': 'Composite',
'B2': 'Composite',
'S': 'Sod',
'F': 'Gravel',
'G': 'Gravel'}
Then this will work:
data["Terain_Type"] = data["Terrain"].map(new_dic)
data
Terrain Terain_Type
0 M1 Composite
1 M2 Composite
2 F Gravel
3 G Gravel
4 S Sod
5 B1 Composite
6 B2 Composite
L1 = ['M1','M2','B1','B2']
d1 = dict.fromkeys(L1, 'Composite')
L2 = ['F','G']
d2 = dict.fromkeys(L2, 'Gravel')
L3 = ['S']
d3 = dict.fromkeys(L3, 'Sod')
d = {**d1, **d2, **d3}
Map:
df['Terrain_Type'] = df['Terrain'].map(d)
Output:
Terrain Terrain_Type
0 M1 Composite
1 M2 Composite
2 F Gravel
3 G Gravel
4 S Sod
5 B1 Composite
6 B2 Composite
I believe the following will work for you :)
def get_terrain_type(row):
if row in ["M1", "M2", "B1", "B2"]:
return "Composite"
elif row == "S":
return "Sod"
else:
return "Gravel"
data["Terain_Type"] = data["Terrain"].map(lambda x: get_terrain_type(x))

How to replace pd.NamedAgg to a code compliant with pandas 0.24.2?

Hello I am obliged to downgrade Pandas versioon to '0.24.2'
As a result, the function pd.NamedAgg is not recognizable anymore.
import pandas as pd
import numpy as np
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
Can you help me please change my code to make it compliant with the version 0.24.2??
Thank you a lot.
Sample:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
df = pd.DataFrame({
'A':list('a')*6,
'B':[4,5,4,5,5,4],
'C':[7]*6,
'Foo':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Because there is only one column Foo for processing add column Foo after groupby and pass tuples with new columns names with aggregate functions:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Another idea is pass dictionary of lists of aggregate functions:
agg_df = df.groupby(agg_cols).agg({'Foo':['max', 'min']})
agg_df.columns = [f'{b}_{a}' for a, b in agg_df.columns]
agg_df = agg_df.reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1

Multi-column calculation in pandas

I've got this long algebra formula that I need to apply to a dataframe:
def experience_mod(A, B, C, D, T, W):
E = (T-A)
F = (C-D)
xmod = (A + B + (E*W) + ((1-W)*F))/(D + B + (F*W) + ((1-W)*F))
return xmod
A = loss['actual_primary_losses']
B = loss['ballast']
C = loss['ExpectedLosses']
D = loss['ExpectedPrimaryLosses']
T = loss['ActualIncurred']
W = loss['weight']
How would I write this to calculate the experience_mod() for every row?
something like this?
loss['ExperienceRating'] = loss.apply(experience_mod(A,B,C,D,T,W) axis = 0)
Pandas and the underlying library, numpy, it's using, support vectorized operations, so given two dataframes A and B, operations like A + B, A - B etc are valid.
Your code works fine, you need to apply the function to the columns directly and assign the results back to the new column ExperienceRating,
Here's a working example:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.random.randn(6,6), columns=list('ABCDTW'))
In [4]: df
Out[4]:
A B C D T W
0 0.049617 0.082861 2.289549 -0.783082 -0.691990 -0.071152
1 0.722605 0.209683 -0.347372 0.254951 0.468615 -0.132794
2 -0.301469 -1.849026 -0.334381 -0.365116 -0.238384 -1.999025
3 -0.554925 -0.859044 -0.637079 -1.040336 0.627027 -0.955889
4 -2.024621 -0.539384 0.006734 0.117628 -0.215070 -0.661466
5 1.942926 -0.433067 -1.034814 -0.292179 0.744039 0.233953
In [5]: def experience_mod(A, B, C, D, T, W):
...: E = (T-A)
...: F = (C-D)
...:
...: xmod = (A + B + (E*W) + ((1-W)*F))/(D + B + (F*W) + ((1-W)*F))
...:
...: return xmod
...:
In [6]: experience_mod(df["A"], df["B"], df["C"], df["D"], df["T"], df["W"])
Out[6]:
0 1.465387
1 -2.060483
2 1.000469
3 1.173070
4 7.406756
5 -0.449957
dtype: float64
In [7]: df['ExperienceRating'] = experience_mod(df["A"], df["B"], df["C"], df["D"], df["T"], df["W"])
In [8]: df
Out[8]:
A B C D T W ExperienceRating
0 0.049617 0.082861 2.289549 -0.783082 -0.691990 -0.071152 1.465387
1 0.722605 0.209683 -0.347372 0.254951 0.468615 -0.132794 -2.060483
2 -0.301469 -1.849026 -0.334381 -0.365116 -0.238384 -1.999025 1.000469
3 -0.554925 -0.859044 -0.637079 -1.040336 0.627027 -0.955889 1.173070
4 -2.024621 -0.539384 0.006734 0.117628 -0.215070 -0.661466 7.406756
5 1.942926 -0.433067 -1.034814 -0.292179 0.744039 0.233953 -0.449957