pd.df find rows pairwise using groupby and change bogus values - pandas

My pd.DataFrame looks like this example but has about 10mio rows, hence I am looking for an efficient solution.
import pandas as pd
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df.index = pd.to_datetime(df.index)
df.opt_expiry = pd.to_datetime(df.opt_expiry)
Out[2]:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.001 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
here is what I am looking to achieve:
1) find the pairs with identical timestamp, opt_expiry and strike:
groups = df.groupby(['timestamp','opt_expiry','strike'])
2) for each group check if the sum of the absolute delta equals 1. If true find the maximum of the two sigma values and assign that to both rows as the new, correct sigma. pseudo code:
for group in groups:
# if sum of absolute deltas != 1
if (abs(group.delta[0]) + abs(group.delta[1])) != 1:
correct_sigma = group.sigma.max()
group.sigma = correct_sigma
Expected output:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.170 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40

Revised answer. I believe there could be a shorter answer out there. Maybe put it up as bounty
Data
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df
Working
Absolute delta sum for each groupfor each row
df['absdelta']=df['delta'].abs()
Absolute delta sum and maximum sigma for each group in a new dataframe df2
df2=df.groupby(['timestamp','opt_expiry','strike']).agg({'absdelta':'sum','sigma':'max'})#.reset_index().drop(columns=['timestamp','opt_expiry'])
df2
Merge df2 with df
df3=df.merge(df2, left_on='strike', right_on='strike',
suffixes=('', '_right'))
df3
mask groups with sum absolute delta not equal to 1
m=df3['absdelta_right']!=1
m
Using mask, apply maximum sigma to entities in groups masked above
df3.loc[m,'sigma']=df3.loc[m,'sigma_right']
Slice to return to original dataframe
df3.iloc[:,:-4]
Output

Related

Divide Dataframe Multiindex level 1 by the sum of level 0

I have created a DataFrame like this:
df = pd.DataFrame(
{
'env': ['us', 'us', 'us', 'eu'],
'name': ['first', 'first', 'first', 'second'],
'default_version': ['2.0.1','2.0.1','2.0.1', '2.1.1'],
'version': ['2.2.1', '2.2.2.4', '2.3', '2.2.24'],
'count_events': [1, 8, 102, 244],
'unique_users': [1, 3, 72, 111]
}
)
df = df.pivot_table(index=['env', 'name', 'default_version'], \
columns='version', values=['count_events', 'unique_users'], aggfunc=np.sum)
Next I'm looking for is to find sum of all count_events at level=1 and sum of all unique_users at level=1, so I can find percentage of count_events and unique_users in each version.
I have generated the sum with the following code, but I don't know how to generate the %.
sums = df.sum(level=0, axis=1)
sums.columns = pd.MultiIndex.from_product([sums.columns, ['SUM']])
final_result = pd.concat([df, sums], axis=1)
It would not be a problem to change the sum code if necessary.
You can reindex your sums to match the shape of the original data using a combination of reindex and set_axis:
In [14]: fraction = (
...: df / (
...: sums
...: .reindex(df.columns.get_level_values(0), axis=1)
...: .set_axis(df.columns, axis=1)
...: )
...: ).fillna(0)
In [15]: fraction
Out[15]:
count_events unique_users
version 2.2.1 2.2.2.4 2.2.24 2.3 2.2.1 2.2.2.4 2.2.24 2.3
env name default_version
eu second 2.1.1 0.000000 0.000000 1.0 0.000000 0.000000 0.000000 1.0 0.000000
us first 2.0.1 0.009009 0.072072 0.0 0.918919 0.013158 0.039474 0.0 0.947368

How to assigne a new column after groupby in pandas

I want to groupby my data and create a new column assignment.
Given the following data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x1', 'x1', 'x1', 'x2', 'x2', 'x2'], 'col2': [1, 2, 3, 4, 5, 6]})
df['col3']=df[['col1','col2']].groupby('col1').rolling(2).mean().reset_index()
Expected output = pd.DataFrame({'col1': ['x1', 'x1', 'x1', 'x2', 'x2', 'x2'], 'col2': [1, 2, 3, 4, 5, 6], 'col3': [NAN, 1.5, 2.5, NAN, 4.5, 5.5]})
However, this does not work. Is there an straightforward way to do it?
A combination of groupby, apply and assign:
df.groupby('col1', as_index = False).apply(lambda g: g.assign(col3 = g['col2'].rolling(2).mean())).reset_index(drop = True)
output:
col1 col2 col3
0 x1 1 NaN
1 x1 2 1.5
2 x1 3 2.5
3 x2 4 NaN
4 x2 5 4.5
5 x2 6 5.5

How to use apply for multiple Pandas dataset columns?

I am hardly trying to fill some columns with NaN values, selected from a previous list. The code is going to the else path and never makes the correct modifications...
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': [0.0, np.nan, np.nan, 100],
'C': [20, 0.0002, 10000, np.nan],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
num_cols = ['B', 'C']
fill_mean = lambda col: col.fillna(col.mean()) if col.name in num_cols else col
df2.apply(fill_mean, axis=1)
You can do this much simpler using
df1.fillna(df1.mean())
This will fill the numeric columns' nas by the column mean:
A B C D
0 A0 0.0 20.000000 D0
1 A1 50.0 0.000200 D1
2 A2 50.0 10000.000000 D2
3 A3 100.0 3340.000067 D3
I am not sure if your desired output it just the mean on all columns (single row). If that is the case, may be the below solution could help.
df = df1.select_dtypes(include='float').mean().to_frame().T
df = pd.concat([df, df.reindex(columns = df1.select_dtypes(exclude='float').columns)], axis=1, sort=False)
print(df)
B C A D
0 50.0 3340.000067 NaN NaN

Pandas - Unexpected results when indexing a DataFrame containing missing entries

First, we create a large dataset with MultiIndex whose first record contains missing values np.NaN
In [200]: data = []
...: val = 0
...: for ind_1 in range(3000):
...: if ind_1 == 0:
...: data.append({'ind_1': 0, 'ind_2': np.NaN, 'val': np.NaN})
...: else:
...: for ind_2 in range(3000):
...: data.append({'ind_1': ind_1, 'ind_2': ind_2, 'val': val})
...: val += 1
...: df = pd.DataFrame(data).set_index(['ind_1', 'ind_2'])
In [201]: df
Out[201]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
3.0 3.0
... ...
2999 2995.0 8996995.0
2996.0 8996996.0
2997.0 8996997.0
2998.0 8996998.0
2999.0 8996999.0
[8997001 rows x 1 columns]
I want to select all rows where ind_1 < 3 and ind_2 < 3
First I create an MultiIndex i1 where ind_1 < 3
In [202]: i1 = df.loc[df.index.get_level_values('ind_1') < 3].index
In [203]: i1
Out[203]:
MultiIndex([(0, nan),
(1, 0.0),
(1, 1.0),
(1, 2.0),
(1, 3.0),
(1, 4.0),
(1, 5.0),
(1, 6.0),
(1, 7.0),
(1, 8.0),
...
(2, 2990.0),
(2, 2991.0),
(2, 2992.0),
(2, 2993.0),
(2, 2994.0),
(2, 2995.0),
(2, 2996.0),
(2, 2997.0),
(2, 2998.0),
(2, 2999.0)],
names=['ind_1', 'ind_2'], length=6001)
Then I create an MultiIndex i2 where ind_2 < 3
In [204]: i2 = df.loc[~(df.index.get_level_values('ind_2') > 2)].index
In [205]: i2
Out[205]:
MultiIndex([( 0, nan),
( 1, 0.0),
( 1, 1.0),
( 1, 2.0),
( 2, 0.0),
( 2, 1.0),
( 2, 2.0),
( 3, 0.0),
( 3, 1.0),
( 3, 2.0),
...
(2996, 2.0),
(2997, 0.0),
(2997, 1.0),
(2997, 2.0),
(2998, 0.0),
(2998, 1.0),
(2998, 2.0),
(2999, 0.0),
(2999, 1.0),
(2999, 2.0)],
names=['ind_1', 'ind_2'], length=8998)
Logically, the solution should be the intersection of these two sets
In [206]: df.loc[i1 & i2]
Out[206]:
val
ind_1 ind_2
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0
Why is the first record (0, nan) filtered out?
Use boolean arrays i1, i2 instead of indexes
In [27]: i1 = df.index.get_level_values('ind_1') < 3
In [28]: i2 = ~(df.index.get_level_values('ind_2') > 2)
In [29]: i1
Out[29]: array([ True, True, True, ..., False, False, False])
In [30]: i2
Out[30]: array([ True, True, True, ..., False, False, False])
In [31]: df.loc[i1 & i2]
Out[31]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0

pandas 0.20.3 DataFrame behavior changes for pyspark.ml.vectors object in a column

The following code works in pandas 0.20.0 but not in 0.20.3:
import pandas as pd
from pyspark.ml.linalg import Vectors
df = pd.DataFrame({'A': [1,2,3,4],
'B': [1,2,3,4],
'C': [1,2,3,4],
'D': [1,2,3,4]},
index=[0, 1, 2, 3])
df.apply(lambda x: pd.Series(Vectors.dense([x["A"], x["B"]])), axis=1)
This produces from pandas 0.20.0:
0
0 [1.0, 1.0]
1 [2.0, 2.0]
2 [3.0, 3.0]
3 [4.0, 4.0]
but it is different in pandas 0.20.3:
0 1
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
How can I achieve the first behavior in 0.20.3?