Multi-column calculation in pandas

Multi-column calculation in pandas - pandas

I've got this long algebra formula that I need to apply to a dataframe:
def experience_mod(A, B, C, D, T, W):
E = (T-A)
F = (C-D)
xmod = (A + B + (E*W) + ((1-W)*F))/(D + B + (F*W) + ((1-W)*F))
return xmod
A = loss['actual_primary_losses']
B = loss['ballast']
C = loss['ExpectedLosses']
D = loss['ExpectedPrimaryLosses']
T = loss['ActualIncurred']
W = loss['weight']
How would I write this to calculate the experience_mod() for every row?
something like this?
loss['ExperienceRating'] = loss.apply(experience_mod(A,B,C,D,T,W) axis = 0)

Pandas and the underlying library, numpy, it's using, support vectorized operations, so given two dataframes A and B, operations like A + B, A - B etc are valid.
Your code works fine, you need to apply the function to the columns directly and assign the results back to the new column ExperienceRating,
Here's a working example:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.random.randn(6,6), columns=list('ABCDTW'))
In [4]: df
Out[4]:
A B C D T W
0 0.049617 0.082861 2.289549 -0.783082 -0.691990 -0.071152
1 0.722605 0.209683 -0.347372 0.254951 0.468615 -0.132794
2 -0.301469 -1.849026 -0.334381 -0.365116 -0.238384 -1.999025
3 -0.554925 -0.859044 -0.637079 -1.040336 0.627027 -0.955889
4 -2.024621 -0.539384 0.006734 0.117628 -0.215070 -0.661466
5 1.942926 -0.433067 -1.034814 -0.292179 0.744039 0.233953
In [5]: def experience_mod(A, B, C, D, T, W):
...: E = (T-A)
...: F = (C-D)
...:
...: xmod = (A + B + (E*W) + ((1-W)*F))/(D + B + (F*W) + ((1-W)*F))
...:
...: return xmod
...:
In [6]: experience_mod(df["A"], df["B"], df["C"], df["D"], df["T"], df["W"])
Out[6]:
0 1.465387
1 -2.060483
2 1.000469
3 1.173070
4 7.406756
5 -0.449957
dtype: float64
In [7]: df['ExperienceRating'] = experience_mod(df["A"], df["B"], df["C"], df["D"], df["T"], df["W"])
In [8]: df
Out[8]:
A B C D T W ExperienceRating
0 0.049617 0.082861 2.289549 -0.783082 -0.691990 -0.071152 1.465387
1 0.722605 0.209683 -0.347372 0.254951 0.468615 -0.132794 -2.060483
2 -0.301469 -1.849026 -0.334381 -0.365116 -0.238384 -1.999025 1.000469
3 -0.554925 -0.859044 -0.637079 -1.040336 0.627027 -0.955889 1.173070
4 -2.024621 -0.539384 0.006734 0.117628 -0.215070 -0.661466 7.406756
5 1.942926 -0.433067 -1.034814 -0.292179 0.744039 0.233953 -0.449957

Related

How to replace pd.NamedAgg to a code compliant with pandas 0.24.2?

Hello I am obliged to downgrade Pandas versioon to '0.24.2'
As a result, the function pd.NamedAgg is not recognizable anymore.
import pandas as pd
import numpy as np
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
Can you help me please change my code to make it compliant with the version 0.24.2??
Thank you a lot.

Sample:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
df = pd.DataFrame({
'A':list('a')*6,
'B':[4,5,4,5,5,4],
'C':[7]*6,
'Foo':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Because there is only one column Foo for processing add column Foo after groupby and pass tuples with new columns names with aggregate functions:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Another idea is pass dictionary of lists of aggregate functions:
agg_df = df.groupby(agg_cols).agg({'Foo':['max', 'min']})
agg_df.columns = [f'{b}_{a}' for a, b in agg_df.columns]
agg_df = agg_df.reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1

Lookup into dataframe from another with iloc

Have a dataframe with location and column for lookup as follows:
import pandas as pd
import numpy as np
i = ['dog', 'cat', 'bird', 'donkey'] * 100000
df1 = pd.DataFrame(np.random.randint(1, high=380, size=len(i)),
['cat', 'bird', 'donkey', 'dog'] * 100000).reset_index()
df1.columns = ['animal', 'locn']
df1.head()
The dataframe to be looked at is as follows:
df = pd.DataFrame(np.random.randn(len(i), 2), index=i,
columns=list('AB')).rename_axis('animal').sort_index(0).reset_index()
df
Looking for a faster way to assign a column with value of B, for every record in df1.
df1.assign(val=[df[df.animal == a].iloc[b].B for a, b in zip(df1.animal, df1['locn'])])
...is pretty slow.

Use GroupBy.cumcount for counter of animal column for positions, so possible use merge with left join:
df['locn'] = df.groupby('animal').cumcount()
df1['new'] = df1.merge(df.reset_index(), on=['animal','locn'], how='left')['B']
Verfify in smaller dataframe:
np.random.seed(2019)
i = ['dog', 'cat', 'bird', 'donkey'] * 100
df1 = pd.DataFrame(np.random.randint(1, high=10, size=len(i)),
['cat', 'bird', 'donkey', 'dog'] * 100).reset_index()
df1.columns = ['animal', 'locn']
print (df1)
df = pd.DataFrame(np.random.randn(len(i), 2), index=i,
columns=list('AB')).rename_axis('animal').sort_index(0).reset_index()
df1 = df1.assign(val=[df[df.animal == a].iloc[b].B for a, b in zip(df1.animal, df1['locn'])])
df['locn'] = df.groupby('animal').cumcount()
df1['new'] = df1.merge(df.reset_index(), on=['animal','locn'], how='left')['B']
locn = df.groupby('animal').cumcount()
df1 = df1.assign(new1 = df1.merge(df.reset_index().assign(locn = locn),
on=['animal','locn'], how='left')['B'])
print (df1.head(10))
animal locn val new new1
0 cat 9 -0.535465 -0.535465 -0.535465
1 bird 3 0.296240 0.296240 0.296240
2 donkey 6 0.222638 0.222638 0.222638
3 dog 9 1.115175 1.115175 1.115175
4 cat 7 0.608889 0.608889 0.608889
5 bird 9 -0.025648 -0.025648 -0.025648
6 donkey 1 0.324736 0.324736 0.324736
7 dog 1 0.533579 0.533579 0.533579
8 cat 8 -1.818238 -1.818238 -1.818238
9 bird 9 -0.025648 -0.025648 -0.025648

Fill pandas fields with tuples as elements by slicing

Sorry if this question has been asked before, but I did not find it here nor somewhere else:
I want to fill some of the fields of a column with tuples. Currently I would have to resort to:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
df['b'] = ''
df['b'] = df['b'].astype(object)
mytuple = ('x','y')
for l in df[df.a % 2 == 0].index:
df.set_value(l, 'b', mytuple)
with df being (which is what I want)
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)
This does not look very elegant to me and probably not very efficient. Instead of the loop, I would prefer something like
df.loc[df.a % 2 == 0, 'b'] = np.array([mytuple] * sum(df.a % 2 == 0), dtype=tuple)
which (of course) does not work. How can I improve my above method by using slicing?

In [57]: df.loc[df.a % 2 == 0, 'b'] = pd.Series([mytuple] * len(df.loc[df.a % 2 == 0])).values
In [58]: df
Out[58]:
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)

pandas groupby by the dictionary

I encounter a problem:
import pandas
df=pandas.DataFrame({"code":['a','a','b','c','d'],
'data':[3,4,3,6,7],})
mat={'group1':['a','b'],'group2':['a','c'],'group3':{'a','b','c','d'}}
the df like this
code data
0 a 3
1 a 4
2 b 3
3 c 6
4 d 7
I wanted the mean of the group1,group2,group3. In this example the key:group1 match the value:a,b , So I find the code equal a or b in df. the mean of group1 is (3+4+3)/3
group2 -> 'a','c' -> (3+4+6)/3
group3 -> 'a','b','c','d' ->(3+4+3+6+7)/5
I try to use groupby. It's doesn't works.
thx!

IIUC you can do something like as follows:
In [133]: rules = {
...: 'grp1': ['a','b'],
...: 'grp2': ['a','c'],
...: 'grp3': list('abcd')
...: }
...:
...: r = pd.DataFrame(
...: [{r:df.loc[df.code.isin(rules[r]), 'data'].mean()}
...: for r in rules
...: ]
...: ).stack()
...:
In [134]: r
Out[134]:
0 grp1 3.333333
1 grp2 4.333333
2 grp3 4.600000
dtype: float64

Pandas custom file format

I have a huge Pandas DataFrame that I need to write away to a format that RankLib can understand. Example with a target, a query ID and 3 features is this:
5 qid:4 1:12 2:0.6 3:13
1 qid:4 1:8 2:0.4 3:11
I have written my own function that iterates over the rows and writes them away like this:
data_file = open(filename, 'w')
for index, row in data.iterrows():
line = str(row['score'])
line += ' qid:'+str(row['srch_id'])
counter = 0
for feature in feature_columns:
counter += 1
line += ' '+str(counter)+':'+str(row[feature])
data_file.write(line+'\n')
data_file.close()
Since I have about 200 features and 5m rows this is obviously very slow. Is there a better approach using the I/O of Pandas itself?

you can do it this way:
Data:
In [155]: df
Out[155]:
f1 f2 f3 score srch_id
0 12 0.6 13 5 4
1 8 0.4 11 1 4
2 11 0.7 14 2 10
In [156]: df.dtypes
Out[156]:
f1 int64
f2 float64
f3 int64
score object
srch_id int64
dtype: object
Solution:
feature_columns = ['f1','f2','f3']
cols2id = {col:str(i+1) for i,col in enumerate(feature_columns)}
def f(x):
if x.name in feature_columns:
return cols2id[x.name] + ':' + x.astype(str)
elif x.name == 'srch_id':
return 'quid:' + x.astype(str)
else:
return x
(df.apply(lambda x: f(x))[['score','srch_id'] + feature_columns]
.to_csv('d:/temp/out.csv', sep=' ', index=False, header=None)
)
out.csv:
5 quid:4 1:12 2:0.6 3:13
1 quid:4 1:8 2:0.4 3:11
2 quid:10 1:11 2:0.7 3:14
cols2id helper dict:
In [158]: cols2id
Out[158]: {'f1': '1', 'f2': '2', 'f3': '3'}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Multi-column calculation in pandas - pandas

Related

How to replace pd.NamedAgg to a code compliant with pandas 0.24.2?

Lookup into dataframe from another with iloc

Fill pandas fields with tuples as elements by slicing

pandas groupby by the dictionary

Pandas custom file format

Categories

Resources