Multi-column calculation in pandas - pandas

I've got this long algebra formula that I need to apply to a dataframe:
def experience_mod(A, B, C, D, T, W):
E = (T-A)
F = (C-D)
xmod = (A + B + (E*W) + ((1-W)*F))/(D + B + (F*W) + ((1-W)*F))
return xmod
A = loss['actual_primary_losses']
B = loss['ballast']
C = loss['ExpectedLosses']
D = loss['ExpectedPrimaryLosses']
T = loss['ActualIncurred']
W = loss['weight']
How would I write this to calculate the experience_mod() for every row?
something like this?
loss['ExperienceRating'] = loss.apply(experience_mod(A,B,C,D,T,W) axis = 0)

Pandas and the underlying library, numpy, it's using, support vectorized operations, so given two dataframes A and B, operations like A + B, A - B etc are valid.
Your code works fine, you need to apply the function to the columns directly and assign the results back to the new column ExperienceRating,
Here's a working example:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.random.randn(6,6), columns=list('ABCDTW'))
In [4]: df
Out[4]:
A B C D T W
0 0.049617 0.082861 2.289549 -0.783082 -0.691990 -0.071152
1 0.722605 0.209683 -0.347372 0.254951 0.468615 -0.132794
2 -0.301469 -1.849026 -0.334381 -0.365116 -0.238384 -1.999025
3 -0.554925 -0.859044 -0.637079 -1.040336 0.627027 -0.955889
4 -2.024621 -0.539384 0.006734 0.117628 -0.215070 -0.661466
5 1.942926 -0.433067 -1.034814 -0.292179 0.744039 0.233953
In [5]: def experience_mod(A, B, C, D, T, W):
...: E = (T-A)
...: F = (C-D)
...:
...: xmod = (A + B + (E*W) + ((1-W)*F))/(D + B + (F*W) + ((1-W)*F))
...:
...: return xmod
...:
In [6]: experience_mod(df["A"], df["B"], df["C"], df["D"], df["T"], df["W"])
Out[6]:
0 1.465387
1 -2.060483
2 1.000469
3 1.173070
4 7.406756
5 -0.449957
dtype: float64
In [7]: df['ExperienceRating'] = experience_mod(df["A"], df["B"], df["C"], df["D"], df["T"], df["W"])
In [8]: df
Out[8]:
A B C D T W ExperienceRating
0 0.049617 0.082861 2.289549 -0.783082 -0.691990 -0.071152 1.465387
1 0.722605 0.209683 -0.347372 0.254951 0.468615 -0.132794 -2.060483
2 -0.301469 -1.849026 -0.334381 -0.365116 -0.238384 -1.999025 1.000469
3 -0.554925 -0.859044 -0.637079 -1.040336 0.627027 -0.955889 1.173070
4 -2.024621 -0.539384 0.006734 0.117628 -0.215070 -0.661466 7.406756
5 1.942926 -0.433067 -1.034814 -0.292179 0.744039 0.233953 -0.449957

Related

How to replace pd.NamedAgg to a code compliant with pandas 0.24.2?

Hello I am obliged to downgrade Pandas versioon to '0.24.2'
As a result, the function pd.NamedAgg is not recognizable anymore.
import pandas as pd
import numpy as np
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
Can you help me please change my code to make it compliant with the version 0.24.2??
Thank you a lot.
Sample:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
df = pd.DataFrame({
'A':list('a')*6,
'B':[4,5,4,5,5,4],
'C':[7]*6,
'Foo':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Because there is only one column Foo for processing add column Foo after groupby and pass tuples with new columns names with aggregate functions:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Another idea is pass dictionary of lists of aggregate functions:
agg_df = df.groupby(agg_cols).agg({'Foo':['max', 'min']})
agg_df.columns = [f'{b}_{a}' for a, b in agg_df.columns]
agg_df = agg_df.reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1

Lookup into dataframe from another with iloc

Have a dataframe with location and column for lookup as follows:
import pandas as pd
import numpy as np
i = ['dog', 'cat', 'bird', 'donkey'] * 100000
df1 = pd.DataFrame(np.random.randint(1, high=380, size=len(i)),
['cat', 'bird', 'donkey', 'dog'] * 100000).reset_index()
df1.columns = ['animal', 'locn']
df1.head()
The dataframe to be looked at is as follows:
df = pd.DataFrame(np.random.randn(len(i), 2), index=i,
columns=list('AB')).rename_axis('animal').sort_index(0).reset_index()
df
Looking for a faster way to assign a column with value of B, for every record in df1.
df1.assign(val=[df[df.animal == a].iloc[b].B for a, b in zip(df1.animal, df1['locn'])])
...is pretty slow.
Use GroupBy.cumcount for counter of animal column for positions, so possible use merge with left join:
df['locn'] = df.groupby('animal').cumcount()
df1['new'] = df1.merge(df.reset_index(), on=['animal','locn'], how='left')['B']
Verfify in smaller dataframe:
np.random.seed(2019)
i = ['dog', 'cat', 'bird', 'donkey'] * 100
df1 = pd.DataFrame(np.random.randint(1, high=10, size=len(i)),
['cat', 'bird', 'donkey', 'dog'] * 100).reset_index()
df1.columns = ['animal', 'locn']
print (df1)
df = pd.DataFrame(np.random.randn(len(i), 2), index=i,
columns=list('AB')).rename_axis('animal').sort_index(0).reset_index()
df1 = df1.assign(val=[df[df.animal == a].iloc[b].B for a, b in zip(df1.animal, df1['locn'])])
df['locn'] = df.groupby('animal').cumcount()
df1['new'] = df1.merge(df.reset_index(), on=['animal','locn'], how='left')['B']
locn = df.groupby('animal').cumcount()
df1 = df1.assign(new1 = df1.merge(df.reset_index().assign(locn = locn),
on=['animal','locn'], how='left')['B'])
print (df1.head(10))
animal locn val new new1
0 cat 9 -0.535465 -0.535465 -0.535465
1 bird 3 0.296240 0.296240 0.296240
2 donkey 6 0.222638 0.222638 0.222638
3 dog 9 1.115175 1.115175 1.115175
4 cat 7 0.608889 0.608889 0.608889
5 bird 9 -0.025648 -0.025648 -0.025648
6 donkey 1 0.324736 0.324736 0.324736
7 dog 1 0.533579 0.533579 0.533579
8 cat 8 -1.818238 -1.818238 -1.818238
9 bird 9 -0.025648 -0.025648 -0.025648

Fill pandas fields with tuples as elements by slicing

Sorry if this question has been asked before, but I did not find it here nor somewhere else:
I want to fill some of the fields of a column with tuples. Currently I would have to resort to:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
df['b'] = ''
df['b'] = df['b'].astype(object)
mytuple = ('x','y')
for l in df[df.a % 2 == 0].index:
df.set_value(l, 'b', mytuple)
with df being (which is what I want)
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)
This does not look very elegant to me and probably not very efficient. Instead of the loop, I would prefer something like
df.loc[df.a % 2 == 0, 'b'] = np.array([mytuple] * sum(df.a % 2 == 0), dtype=tuple)
which (of course) does not work. How can I improve my above method by using slicing?
In [57]: df.loc[df.a % 2 == 0, 'b'] = pd.Series([mytuple] * len(df.loc[df.a % 2 == 0])).values
In [58]: df
Out[58]:
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)

pandas groupby by the dictionary

I encounter a problem:
import pandas
df=pandas.DataFrame({"code":['a','a','b','c','d'],
'data':[3,4,3,6,7],})
mat={'group1':['a','b'],'group2':['a','c'],'group3':{'a','b','c','d'}}
the df like this
code data
0 a 3
1 a 4
2 b 3
3 c 6
4 d 7
I wanted the mean of the group1,group2,group3. In this example the key:group1 match the value:a,b , So I find the code equal a or b in df. the mean of group1 is (3+4+3)/3
group2 -> 'a','c' -> (3+4+6)/3
group3 -> 'a','b','c','d' ->(3+4+3+6+7)/5
I try to use groupby. It's doesn't works.
thx!
IIUC you can do something like as follows:
In [133]: rules = {
...: 'grp1': ['a','b'],
...: 'grp2': ['a','c'],
...: 'grp3': list('abcd')
...: }
...:
...: r = pd.DataFrame(
...: [{r:df.loc[df.code.isin(rules[r]), 'data'].mean()}
...: for r in rules
...: ]
...: ).stack()
...:
In [134]: r
Out[134]:
0 grp1 3.333333
1 grp2 4.333333
2 grp3 4.600000
dtype: float64

Pandas custom file format

I have a huge Pandas DataFrame that I need to write away to a format that RankLib can understand. Example with a target, a query ID and 3 features is this:
5 qid:4 1:12 2:0.6 3:13
1 qid:4 1:8 2:0.4 3:11
I have written my own function that iterates over the rows and writes them away like this:
data_file = open(filename, 'w')
for index, row in data.iterrows():
line = str(row['score'])
line += ' qid:'+str(row['srch_id'])
counter = 0
for feature in feature_columns:
counter += 1
line += ' '+str(counter)+':'+str(row[feature])
data_file.write(line+'\n')
data_file.close()
Since I have about 200 features and 5m rows this is obviously very slow. Is there a better approach using the I/O of Pandas itself?
you can do it this way:
Data:
In [155]: df
Out[155]:
f1 f2 f3 score srch_id
0 12 0.6 13 5 4
1 8 0.4 11 1 4
2 11 0.7 14 2 10
In [156]: df.dtypes
Out[156]:
f1 int64
f2 float64
f3 int64
score object
srch_id int64
dtype: object
Solution:
feature_columns = ['f1','f2','f3']
cols2id = {col:str(i+1) for i,col in enumerate(feature_columns)}
def f(x):
if x.name in feature_columns:
return cols2id[x.name] + ':' + x.astype(str)
elif x.name == 'srch_id':
return 'quid:' + x.astype(str)
else:
return x
(df.apply(lambda x: f(x))[['score','srch_id'] + feature_columns]
.to_csv('d:/temp/out.csv', sep=' ', index=False, header=None)
)
out.csv:
5 quid:4 1:12 2:0.6 3:13
1 quid:4 1:8 2:0.4 3:11
2 quid:10 1:11 2:0.7 3:14
cols2id helper dict:
In [158]: cols2id
Out[158]: {'f1': '1', 'f2': '2', 'f3': '3'}