Calculate median of column with multiple values per cell (ranges) - pandas

I have this code
df = pd.DataFrame({'R': {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', 6: '7'}, 'a': {0: 1.0, 1: 1.0, 2: 2.0, 3: 3.0, 4: 3.0, 5: 2.0, 6: 3.0}, 'nv1': {0: [-1.0], 1: [-1.0], 2: [], 3: [], 4: [-2.0], 5: [-2.0, -1.0, -3.0, -1.0], 6: [-2.0, -1.0, -2.0, -1.0]}})
yielding the following dataframe:
R a nv1
0 1 1.0 [-1.0]
1 2 1.0 [-1.0]
2 3 2.0 []
3 4 3.0 []
4 5 3.0 [-2.0]
5 6 2.0 [-2.0, -1.0, -3.0, -1.0]
6 7 3.0 [-2.0, -1.0, -2.0, -1.0]
I need to calculate median of df['nv1']
df['med'] = median of df['nv1']
Desired output as follows
R a nv1 med
1 1.0 [-1.0] -1
2 1.0 [-1.0] -1
3 2.0 []
4 3.0 []
5 3.0 [-2.0] -2
6 2.0 [-2.0, -1.0, -3.0, -1.0] -1.5
7 3.0 [-2.0, -1.0, -2.0, -1.0] -1.5
I tried both line of codes below independently, but I ran into errors:
df['nv1'] = pd.to_numeric(df['nv1'],errors = 'coerce')
df['med'] = df['nv1'].median()

Use np.median:
df['med'] = df['nv1'].apply(np.median)
Output:
>>> df
R a nv1 med
0 1 1.0 [-1.0] -1.0
1 2 1.0 [-1.0] -1.0
2 3 2.0 [] NaN
3 4 3.0 [] NaN
4 5 3.0 [-2.0] -2.0
5 6 2.0 [-2.0, -1.0, -3.0, -1.0] -1.5
6 7 3.0 [-2.0, -1.0, -2.0, -1.0] -1.5
Or:
df['med'] = df['nv1'].explode().dropna().groupby(level=0).median()

Related

Create series based on two pandas dataframe bool columns

How do I create a series based on two pandas dataframe bool columns?
round_up round_down is_round_up is_round_down High Low
0 0.75 0.7 False True 0.70532 0.69818
1 0.75 0.7 False True 0.70196 0.67268
2 0.75 0.7 False True 0.71243 0.69938
3 0.75 0.7 False True 0.70226 0.69884
4 0.75 0.7 False True 0.70292 0.69952
5 0.75 0.7 True True 0.75100 0.69000
Desired output is a series with round_up if is_round_up or round_down if is_round_down or if both True select is_round_up.
0 0.70
1 0.70
2 0.70
3 0.70
4 0.70
5 0.75
Test data
df = pd.DataFrame({'round_up': {0: 0.75,
1: 0.75,
2: 0.75,
3: 0.75,
4: 0.75,
5: 0.75},
'round_down': {0: 0.70,
1: 0.70,
2: 0.70,
3: 0.70,
4: 0.70,
5: 0.70},
'is_round_up': {0: False, 1: False, 2: False, 3: False, 4: False, 5:True},
'is_round_down': {0: True, 1: True, 2: True, 3: True, 4: True, 5:True},
'High': {0: 0.70532, 1: 0.70196, 2: 0.71243, 3: 0.70226, 4: 0.70292, 5:0.751},
'Low': {0: 0.69818, 1: 0.67268, 2: 0.69938, 3: 0.69884, 4: 0.69952, 5:0.69}})
Edit: To clarify if both columns are False I would expect to see Nan for that row.
Select the columns in order of choice (up is preferred to down in case of both True), mask the False values, bfill and get the first column.
Advantage: the solution is scalable to any number of columns.
(df[['round_up', 'round_down']]
.where(df[['is_round_up', 'is_round_down']].values)
.bfill(axis=1)
.iloc[:,0]
.rename('result')
)
output:
0 0.70
1 0.70
2 0.70
3 0.70
4 0.70
5 0.75
Name: result, dtype: float64

pd.df find rows pairwise using groupby and change bogus values

My pd.DataFrame looks like this example but has about 10mio rows, hence I am looking for an efficient solution.
import pandas as pd
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df.index = pd.to_datetime(df.index)
df.opt_expiry = pd.to_datetime(df.opt_expiry)
Out[2]:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.001 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
here is what I am looking to achieve:
1) find the pairs with identical timestamp, opt_expiry and strike:
groups = df.groupby(['timestamp','opt_expiry','strike'])
2) for each group check if the sum of the absolute delta equals 1. If true find the maximum of the two sigma values and assign that to both rows as the new, correct sigma. pseudo code:
for group in groups:
# if sum of absolute deltas != 1
if (abs(group.delta[0]) + abs(group.delta[1])) != 1:
correct_sigma = group.sigma.max()
group.sigma = correct_sigma
Expected output:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.170 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
Revised answer. I believe there could be a shorter answer out there. Maybe put it up as bounty
Data
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df
Working
Absolute delta sum for each groupfor each row
df['absdelta']=df['delta'].abs()
Absolute delta sum and maximum sigma for each group in a new dataframe df2
df2=df.groupby(['timestamp','opt_expiry','strike']).agg({'absdelta':'sum','sigma':'max'})#.reset_index().drop(columns=['timestamp','opt_expiry'])
df2
Merge df2 with df
df3=df.merge(df2, left_on='strike', right_on='strike',
suffixes=('', '_right'))
df3
mask groups with sum absolute delta not equal to 1
m=df3['absdelta_right']!=1
m
Using mask, apply maximum sigma to entities in groups masked above
df3.loc[m,'sigma']=df3.loc[m,'sigma_right']
Slice to return to original dataframe
df3.iloc[:,:-4]
Output

Pandas - Unexpected results when indexing a DataFrame containing missing entries

First, we create a large dataset with MultiIndex whose first record contains missing values np.NaN
In [200]: data = []
...: val = 0
...: for ind_1 in range(3000):
...: if ind_1 == 0:
...: data.append({'ind_1': 0, 'ind_2': np.NaN, 'val': np.NaN})
...: else:
...: for ind_2 in range(3000):
...: data.append({'ind_1': ind_1, 'ind_2': ind_2, 'val': val})
...: val += 1
...: df = pd.DataFrame(data).set_index(['ind_1', 'ind_2'])
In [201]: df
Out[201]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
3.0 3.0
... ...
2999 2995.0 8996995.0
2996.0 8996996.0
2997.0 8996997.0
2998.0 8996998.0
2999.0 8996999.0
[8997001 rows x 1 columns]
I want to select all rows where ind_1 < 3 and ind_2 < 3
First I create an MultiIndex i1 where ind_1 < 3
In [202]: i1 = df.loc[df.index.get_level_values('ind_1') < 3].index
In [203]: i1
Out[203]:
MultiIndex([(0, nan),
(1, 0.0),
(1, 1.0),
(1, 2.0),
(1, 3.0),
(1, 4.0),
(1, 5.0),
(1, 6.0),
(1, 7.0),
(1, 8.0),
...
(2, 2990.0),
(2, 2991.0),
(2, 2992.0),
(2, 2993.0),
(2, 2994.0),
(2, 2995.0),
(2, 2996.0),
(2, 2997.0),
(2, 2998.0),
(2, 2999.0)],
names=['ind_1', 'ind_2'], length=6001)
Then I create an MultiIndex i2 where ind_2 < 3
In [204]: i2 = df.loc[~(df.index.get_level_values('ind_2') > 2)].index
In [205]: i2
Out[205]:
MultiIndex([( 0, nan),
( 1, 0.0),
( 1, 1.0),
( 1, 2.0),
( 2, 0.0),
( 2, 1.0),
( 2, 2.0),
( 3, 0.0),
( 3, 1.0),
( 3, 2.0),
...
(2996, 2.0),
(2997, 0.0),
(2997, 1.0),
(2997, 2.0),
(2998, 0.0),
(2998, 1.0),
(2998, 2.0),
(2999, 0.0),
(2999, 1.0),
(2999, 2.0)],
names=['ind_1', 'ind_2'], length=8998)
Logically, the solution should be the intersection of these two sets
In [206]: df.loc[i1 & i2]
Out[206]:
val
ind_1 ind_2
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0
Why is the first record (0, nan) filtered out?
Use boolean arrays i1, i2 instead of indexes
In [27]: i1 = df.index.get_level_values('ind_1') < 3
In [28]: i2 = ~(df.index.get_level_values('ind_2') > 2)
In [29]: i1
Out[29]: array([ True, True, True, ..., False, False, False])
In [30]: i2
Out[30]: array([ True, True, True, ..., False, False, False])
In [31]: df.loc[i1 & i2]
Out[31]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0

Why does pandas change the index value in this example?

First we create a raw dataset with MultiIndex-
In [166]: import numpy as np; import pandas as pd
In [167]: data_raw = pd.DataFrame([
...: {'frame': 1, 'face': np.NaN, 'lmark': np.NaN, 'x': np.NaN, 'y': np.NaN},
...: {'frame': 197, 'face': 0, 'lmark': 1, 'x': 969, 'y': 737},
...: {'frame': 197, 'face': 0, 'lmark': 2, 'x': 969, 'y': 740},
...: {'frame': 197, 'face': 0, 'lmark': 3, 'x': 970, 'y': 744},
...: {'frame': 197, 'face': 0, 'lmark': 4, 'x': 972, 'y': 748},
...: {'frame': 197, 'face': 0, 'lmark': 5, 'x': 973, 'y': 752},
...: {'frame': 300, 'face': 0, 'lmark': 1, 'x': 745, 'y': 367},
...: {'frame': 300, 'face': 0, 'lmark': 2, 'x': 753, 'y': 411},
...: {'frame': 300, 'face': 0, 'lmark': 3, 'x': 759, 'y': 455},
...: {'frame': 301, 'face': 0, 'lmark': 1, 'x': 741, 'y': 364},
...: {'frame': 301, 'face': 0, 'lmark': 2, 'x': 746, 'y': 408},
...: {'frame': 301, 'face': 0, 'lmark': 3, 'x': 750, 'y': 452}]).set_index(['frame', 'face', 'lmark'])
Next we calculate the z-scores for each lmark -
In [168]: ((data_raw - data_raw.mean(level='lmark')).abs()) / data_raw.std(level='lmark')
Out[168]:
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
4.0 NaN NaN
5.0 NaN NaN
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270
The index values don't change, as expected.
Now we filter out records where lmark > 3 -
In [170]: data_filtered = data_raw.loc[(slice(None), slice(None), [np.NaN, slice(3)]),:]
In [171]: data_filtered
Out[171]:
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 969.0 737.0
2.0 969.0 740.0
3.0 970.0 744.0
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
301 0.0 1.0 741.0 364.0
2.0 746.0 408.0
3.0 750.0 452.0
and recalculate the z-scores -
In [172]: ((data_filtered - data_filtered.mean(level='lmark')).abs()) / data_filtered.std(level='lmark')
Out[172]:
x y
frame face lmark
1 NaN 1.0 NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270
Why has the value of the first record's lmark index changed from NaN to 1.0?
I think it seems bug.
Solution is use MultiIndex.remove_unused_levels:
data_filtered.index = data_filtered.index.remove_unused_levels()
a = ((data_filtered - data_filtered.mean(level='lmark')).abs()) / data_filtered.std(level='lmark')
print (a)
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270

How to create a nested dictionary from pandas dataframe and again convert it to dataframe?

import pandas as pd
import numpy as np
d = {
'Fruit':['Guava','Orange','Lemon'],
'ID1':[1,2,11],
'ID2':[3,4,12],
'ID3':[5,6,np.nan],
'ID4':[7,8,14],
'ID5':[9,10,np.nan],
'ID6':[11,np.nan,np.nan],
'ID7':[13,np.nan,np.nan],
'ID8':[15,np.nan,np.nan],
'ID9':[17,np.nan,np.nan],
'Category':['Myrtaceae','Citrus','Citrus']
}
df = pd.DataFrame(data = d)
df
How to convert the above dataframe to the following dictionary.
Expected Output :
{
'Myrtacease':{'Guava':{1,3,5,7,9,11,13,15,17}},
'Citrus':{'Orange':{2,4,6,8,10,np.nan,np.nan,np.nan,np.nan},{'Lemon':{11,12,np.nan,14,np.nan,np.nan,np.nan,np.nan,np.nan}},
}
How to again convert the dictionary to a dataframe?
Use list comprehension with groupby:
d = {k: v.set_index('Fruit').T.to_dict('list')
for k, v in df.set_index('Category').groupby(level=0)}
print (d)
{'Citrus': {'Orange': [2.0, 4.0, 6.0, 8.0, 10.0, nan, nan, nan, nan],
'Lemon': [11.0, 12.0, nan, 14.0, nan, nan, nan, nan, nan]},
'Myrtaceae': {'Guava': [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0]}}
Or:
d = {k: v.drop('Category', axis=1).set_index('Fruit').T.to_dict('list')
for k, v in df.groupby('Category')}
And then:
df = (pd.concat({k: pd.DataFrame(v) for k, v in d.items()}, axis=1)
.T
.rename_axis(('Category','Fruit'))
.rename(columns=lambda x: f'ID{x+1}')
.reset_index())
print (df)
Category Fruit ID1 ID2 ID3 ID4 ID5 ID6 ID7 ID8 ID9
0 Citrus Orange 2.0 4.0 6.0 8.0 10.0 NaN NaN NaN NaN
1 Citrus Lemon 11.0 12.0 NaN 14.0 NaN NaN NaN NaN NaN
2 Myrtaceae Guava 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0