split dataframe column including list of lists on multiple rows into pandas - pandas

I have a dataframe df:
import pandas as pd
df = pd.DataFrame([
[[[3,0.5, 0.4, 0.7, 5],[2, 0.5, 1, 0.8, 2],[1, 0.5, 1, 1, 2]], 'b'],
[[[1, 0.5, 0.6, 0.01, 1],[2, 0.5, 0.3, 0.2, 3],[1, 0.8, 1.0, 0.04, 3]], 'd']],
index = ['row1', 'row2'],
columns=['col1', 'col2'])
I would like to split col1, including list of lists, on multiple lines as follows:
col1 col2
row1 [3,0.5, 0.4, 0.7, 5] b
row1 [2, 0.5, 1, 0.8, 2] b
row1 [1, 0.5, 1, 1, 2] b
row2 [1, 0.5, 0.6, 0.01, 1] d
row2 [2, 0.5, 0.3, 0.2, 3] d
row2 [1, 0.8, 1.0, 0.04, 3] d
and next split col1 in 2 columns, retaining only the second and the third elements
new_col1 new_col2 col2
row1 0.5 0.4 b
row1 0.5 1 b
row1 0.5 1 b
row2 0.5 0.6 d
row2 0.5 0.3 d
row2 0.8 1.0 d
How it can be done make using pandas?

For the first step there may not be anything better than a loop:
df2 = pd.DataFrame()
for row in df.index:
col = df.ix[row, 'col1']
N = len(col)
df2 = df2.append(pd.DataFrame(
[[c, df.ix[row, 'col2']] for c in col],
index=[row] * N,
columns = ['col1', 'col2']))
For the second step, just add the new columns and delete the original one:
df3 = df2.copy()
df3['new_col1'] = [c[1] for c in df3['col1']]
df3['new_col2'] = [c[2] for c in df3['col1']]
del df3['col1']

Related

Pandas rolling mean only for non-NaNs

If have a DataFrame:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]
'A1': [1, 1, 2, 2, 2]
'A2': [1, 2, 3, 3, 3]})
I want to create a grouped-by on columns "A1" and "A2" and then apply a rolling-mean on "B" with window 3. If less values are available, that is fine, the mean should still be computed. But I do not want any values if there is no original entry.
Result should be:
pd.DataFrame({'B': [0, 1, 2, np.nan, 3]})
Applying df.rolling(3, min_periods=1).mean() yields:
pd.DataFrame({'B': [0, 1, 2, 2, 3]})
Any ideas?
Reason is for mean with widows=3 is ouput some scalars, not NaNs, possible solution is set NaN manually after rolling:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A': [1, 1, 2, 2, 2]})
df['C'] = df['B'].rolling(3, min_periods=1).mean().mask(df['B'].isna())
df['D'] = df.groupby('A')['B'].rolling(3, min_periods=1).mean().droplevel(0).mask(df['B'].isna())
print (df)
B A C D
0 0.0 1 0.0 0.0
1 1.0 1 0.5 0.5
2 2.0 2 1.0 2.0
3 NaN 2 NaN NaN
4 4.0 2 3.0 3.0
EDIT: For multiple grouping columns remove levels in Series.droplevel:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A1': [1, 1, 2, 2, 2],
'A2': [1, 2, 3, 3, 3]})
df['D'] = df.groupby(['A1','A2'])['B'].rolling(3, min_periods=1).mean().droplevel(['A1','A2']).mask(df['B'].isna())
print (df)
B A1 A2 D
0 0.0 1 1 0.0
1 1.0 1 2 1.0
2 2.0 2 3 2.0
3 NaN 2 3 NaN
4 4.0 2 3 3.0

How to assigne a new column after groupby in pandas

I want to groupby my data and create a new column assignment.
Given the following data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x1', 'x1', 'x1', 'x2', 'x2', 'x2'], 'col2': [1, 2, 3, 4, 5, 6]})
df['col3']=df[['col1','col2']].groupby('col1').rolling(2).mean().reset_index()
Expected output = pd.DataFrame({'col1': ['x1', 'x1', 'x1', 'x2', 'x2', 'x2'], 'col2': [1, 2, 3, 4, 5, 6], 'col3': [NAN, 1.5, 2.5, NAN, 4.5, 5.5]})
However, this does not work. Is there an straightforward way to do it?
A combination of groupby, apply and assign:
df.groupby('col1', as_index = False).apply(lambda g: g.assign(col3 = g['col2'].rolling(2).mean())).reset_index(drop = True)
output:
col1 col2 col3
0 x1 1 NaN
1 x1 2 1.5
2 x1 3 2.5
3 x2 4 NaN
4 x2 5 4.5
5 x2 6 5.5

How to use apply for multiple Pandas dataset columns?

I am hardly trying to fill some columns with NaN values, selected from a previous list. The code is going to the else path and never makes the correct modifications...
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': [0.0, np.nan, np.nan, 100],
'C': [20, 0.0002, 10000, np.nan],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
num_cols = ['B', 'C']
fill_mean = lambda col: col.fillna(col.mean()) if col.name in num_cols else col
df2.apply(fill_mean, axis=1)
You can do this much simpler using
df1.fillna(df1.mean())
This will fill the numeric columns' nas by the column mean:
A B C D
0 A0 0.0 20.000000 D0
1 A1 50.0 0.000200 D1
2 A2 50.0 10000.000000 D2
3 A3 100.0 3340.000067 D3
I am not sure if your desired output it just the mean on all columns (single row). If that is the case, may be the below solution could help.
df = df1.select_dtypes(include='float').mean().to_frame().T
df = pd.concat([df, df.reindex(columns = df1.select_dtypes(exclude='float').columns)], axis=1, sort=False)
print(df)
B C A D
0 50.0 3340.000067 NaN NaN

pd.df find rows pairwise using groupby and change bogus values

My pd.DataFrame looks like this example but has about 10mio rows, hence I am looking for an efficient solution.
import pandas as pd
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df.index = pd.to_datetime(df.index)
df.opt_expiry = pd.to_datetime(df.opt_expiry)
Out[2]:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.001 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
here is what I am looking to achieve:
1) find the pairs with identical timestamp, opt_expiry and strike:
groups = df.groupby(['timestamp','opt_expiry','strike'])
2) for each group check if the sum of the absolute delta equals 1. If true find the maximum of the two sigma values and assign that to both rows as the new, correct sigma. pseudo code:
for group in groups:
# if sum of absolute deltas != 1
if (abs(group.delta[0]) + abs(group.delta[1])) != 1:
correct_sigma = group.sigma.max()
group.sigma = correct_sigma
Expected output:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.170 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
Revised answer. I believe there could be a shorter answer out there. Maybe put it up as bounty
Data
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df
Working
Absolute delta sum for each groupfor each row
df['absdelta']=df['delta'].abs()
Absolute delta sum and maximum sigma for each group in a new dataframe df2
df2=df.groupby(['timestamp','opt_expiry','strike']).agg({'absdelta':'sum','sigma':'max'})#.reset_index().drop(columns=['timestamp','opt_expiry'])
df2
Merge df2 with df
df3=df.merge(df2, left_on='strike', right_on='strike',
suffixes=('', '_right'))
df3
mask groups with sum absolute delta not equal to 1
m=df3['absdelta_right']!=1
m
Using mask, apply maximum sigma to entities in groups masked above
df3.loc[m,'sigma']=df3.loc[m,'sigma_right']
Slice to return to original dataframe
df3.iloc[:,:-4]
Output

How to set cells of matrix from matrix of columns indexes

I'd like to build a kernel from a list of positions and list of kernel centers. The kernel should be an indicator of the TWO closest centers to each position.
> x = np.array([0.1, .49, 1.9, ]).reshape((3,1)) # Positions
> c = np.array([-2., 0.1, 0.2, 0.4, 0.5, 2.]) # centers
print x
print c
[[ 0.1 ]
[ 0.49]
[ 1.9 ]]
[-2. 0.1 0.2 0.4 0.5 2. ]
What I'd like to get out is:
array([[ 0, 1, 1, 0, 0, 0], # Index 1,2 closest to 0.1
[ 0, 0, 0, 1, 1, 0], # Index 3,4 closest to 0.49
[ 0, 0, 0, 0, 1, 1]]) # Index 4,5 closest to 1.9
I can get:
> dist = np.abs(x-c)
array([[ 2.1 , 0. , 0.1 , 0.3 , 0.4 , 1.9 ],
[ 2.49, 0.39, 0.29, 0.09, 0.01, 1.51],
[ 3.9 , 1.8 , 1.7 , 1.5 , 1.4 , 0.1 ]])
and:
> np.argsort(dist, axis=1)[:,:2]
array([[1, 2],
[4, 3],
[5, 4]])
Here I have a matrix of column indexes, but I but can't see how to use them to set values of those columns in another matrix (using efficient numpy operations).
idx = np.argsort(dist, axis=1)[:,:2]
z = np.zeros(dist.shape)
z[idx]=1 # NOPE
z[idx,:]=1 # NOPE
z[:,idx]=1 # NOPE
One way would be to initialize zeros array and then index with advanced-indexing -
out = np.zeros(dist.shape,dtype=int)
out[np.arange(idx.shape[0])[:,None],idx] = 1
Alternatively, we could play around with dimensions extension to use broadcasting and come up with a one-liner -
out = (idx[...,None] == np.arange(dist.shape[1])).any(1).astype(int)
For performance, I would suggest using np.argpartition to get those indices -
idx = np.argpartition(dist, 2, axis=1)[:,:2]