Pandas - Unexpected results when indexing a DataFrame containing missing entries - pandas

First, we create a large dataset with MultiIndex whose first record contains missing values np.NaN
In [200]: data = []
...: val = 0
...: for ind_1 in range(3000):
...: if ind_1 == 0:
...: data.append({'ind_1': 0, 'ind_2': np.NaN, 'val': np.NaN})
...: else:
...: for ind_2 in range(3000):
...: data.append({'ind_1': ind_1, 'ind_2': ind_2, 'val': val})
...: val += 1
...: df = pd.DataFrame(data).set_index(['ind_1', 'ind_2'])
In [201]: df
Out[201]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
3.0 3.0
... ...
2999 2995.0 8996995.0
2996.0 8996996.0
2997.0 8996997.0
2998.0 8996998.0
2999.0 8996999.0
[8997001 rows x 1 columns]
I want to select all rows where ind_1 < 3 and ind_2 < 3
First I create an MultiIndex i1 where ind_1 < 3
In [202]: i1 = df.loc[df.index.get_level_values('ind_1') < 3].index
In [203]: i1
Out[203]:
MultiIndex([(0, nan),
(1, 0.0),
(1, 1.0),
(1, 2.0),
(1, 3.0),
(1, 4.0),
(1, 5.0),
(1, 6.0),
(1, 7.0),
(1, 8.0),
...
(2, 2990.0),
(2, 2991.0),
(2, 2992.0),
(2, 2993.0),
(2, 2994.0),
(2, 2995.0),
(2, 2996.0),
(2, 2997.0),
(2, 2998.0),
(2, 2999.0)],
names=['ind_1', 'ind_2'], length=6001)
Then I create an MultiIndex i2 where ind_2 < 3
In [204]: i2 = df.loc[~(df.index.get_level_values('ind_2') > 2)].index
In [205]: i2
Out[205]:
MultiIndex([( 0, nan),
( 1, 0.0),
( 1, 1.0),
( 1, 2.0),
( 2, 0.0),
( 2, 1.0),
( 2, 2.0),
( 3, 0.0),
( 3, 1.0),
( 3, 2.0),
...
(2996, 2.0),
(2997, 0.0),
(2997, 1.0),
(2997, 2.0),
(2998, 0.0),
(2998, 1.0),
(2998, 2.0),
(2999, 0.0),
(2999, 1.0),
(2999, 2.0)],
names=['ind_1', 'ind_2'], length=8998)
Logically, the solution should be the intersection of these two sets
In [206]: df.loc[i1 & i2]
Out[206]:
val
ind_1 ind_2
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0
Why is the first record (0, nan) filtered out?

Use boolean arrays i1, i2 instead of indexes
In [27]: i1 = df.index.get_level_values('ind_1') < 3
In [28]: i2 = ~(df.index.get_level_values('ind_2') > 2)
In [29]: i1
Out[29]: array([ True, True, True, ..., False, False, False])
In [30]: i2
Out[30]: array([ True, True, True, ..., False, False, False])
In [31]: df.loc[i1 & i2]
Out[31]:
val
ind_1 ind_2
0 NaN NaN
1 0.0 0.0
1.0 1.0
2.0 2.0
2 0.0 3000.0
1.0 3001.0
2.0 3002.0

Related

Pandas rolling mean only for non-NaNs

If have a DataFrame:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]
'A1': [1, 1, 2, 2, 2]
'A2': [1, 2, 3, 3, 3]})
I want to create a grouped-by on columns "A1" and "A2" and then apply a rolling-mean on "B" with window 3. If less values are available, that is fine, the mean should still be computed. But I do not want any values if there is no original entry.
Result should be:
pd.DataFrame({'B': [0, 1, 2, np.nan, 3]})
Applying df.rolling(3, min_periods=1).mean() yields:
pd.DataFrame({'B': [0, 1, 2, 2, 3]})
Any ideas?
Reason is for mean with widows=3 is ouput some scalars, not NaNs, possible solution is set NaN manually after rolling:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A': [1, 1, 2, 2, 2]})
df['C'] = df['B'].rolling(3, min_periods=1).mean().mask(df['B'].isna())
df['D'] = df.groupby('A')['B'].rolling(3, min_periods=1).mean().droplevel(0).mask(df['B'].isna())
print (df)
B A C D
0 0.0 1 0.0 0.0
1 1.0 1 0.5 0.5
2 2.0 2 1.0 2.0
3 NaN 2 NaN NaN
4 4.0 2 3.0 3.0
EDIT: For multiple grouping columns remove levels in Series.droplevel:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A1': [1, 1, 2, 2, 2],
'A2': [1, 2, 3, 3, 3]})
df['D'] = df.groupby(['A1','A2'])['B'].rolling(3, min_periods=1).mean().droplevel(['A1','A2']).mask(df['B'].isna())
print (df)
B A1 A2 D
0 0.0 1 1 0.0
1 1.0 1 2 1.0
2 2.0 2 3 2.0
3 NaN 2 3 NaN
4 4.0 2 3 3.0

List of the (row, col) of the n largest values in a numeric pandas DataFrame?

Given a Pandas DataFrame of numeric values how can one produce a list of the .loc cell locations that one can then use to then obtain the corresponding n largest values in the entire DataFame?
For example:
A
B
C
D
E
X
1.3
3.6
33
61.38
0.3
Y
3.14
2.71
64
23.2
21
Z
1024
42
66
137
22.2
T
63.123
111
1.23
14.16
50.49
An n of 3 would produce the (row,col) pairs for the values 1024, 137 and 111.
These locations could then, as usual, be fed to .loc to extract those values from the DataFrame. i.e.
df.loc['Z','A']
df.loc['Z','D']
df.loc['T','B']
Note: It is easy to mistake this question for one that involves .idxmax. That isn't applicable due to the fact that there may be multiple values selected from a row and/or column in the n largest.
You could try:
>>> data = {0 : [1.3, 3.14, 1024, 63.123], 1: [3.6, 2.71, 42, 111], 2 : [33, 64, 66, 1.23], 3 : [61.38, 23.2, 137, 14.16], 4 : [0.3, 21, 22.2, 50.49] }
>>> df = pd.DataFrame(data)
>>> df
0 1 2 3 4
0 1.300 3.60 33.00 61.38 0.30
1 3.140 2.71 64.00 23.20 21.00
2 1024.000 42.00 66.00 137.00 22.20
3 63.123 111.00 1.23 14.16 50.49
>>>
>>> a = list(zip(*df.stack().nlargest(3).index.labels))
>>> a
[(2, 0), (2, 3), (3, 1)]
>>> # then ...
>>> df.loc[a[0]]
1024.0
>>>
>>> # all sorted in decreasing order ...
>>> list(zip(*df.stack().nlargest(20).index.labels))
[(2, 0), (2, 3), (3, 1), (2, 2), (1, 2), (3, 0), (0, 3), (3, 4), (2, 1), (0, 2), (1, 3), (2, 4), (1, 4), (3, 3), (0, 1), (1, 0), (1, 1), (0, 0), (3, 2), (0, 4)]
Edit: In pandas versions 0.24.0 and above, MultiIndex.labels has been replaced by MultiIndex.codes(see Deprecations in What’s new in 0.24.0 (January 25, 2019)). The above code will throw AttributeError: 'MultiIndex' object has no attribute 'labels' and needs to be updated as follows:
>>> a = list(zip(*df.stack().nlargest(3).index.codes))
>>> a
[(2, 0), (2, 3), (3, 1)]
Edit 2: This question has become a "moving target", as the OP keeps changing it (this is my last update/edit). In the last update, OP's dataframe looks as follows:
>>> data = {'A' : [1.3, 3.14, 1024, 63.123], 'B' : [3.6, 2.71, 42, 111], 'C' : [33, 64, 66, 1.23], 'D' : [61.38, 23.2, 137, 14.16], 'E' : [0.3, 21, 22.2, 50.49] }
>>> df = pd.DataFrame(data, index=['X', 'Y', 'Z', 'T'])
>>> df
A B C D E
X 1.300 3.60 33.00 61.38 0.30
Y 3.140 2.71 64.00 23.20 21.00
Z 1024.000 42.00 66.00 137.00 22.20
T 63.123 111.00 1.23 14.16 50.49
The desired output can be obtained using:
>>> a = df.stack().nlargest(3).index
>>> a
MultiIndex([('Z', 'A'),
('Z', 'D'),
('T', 'B')],
)
>>>
>>> df.loc[a[0]]
1024.0
The trick is to use np.unravel_index on the np.argsort
Example:
import numpy as np
import pandas as pd
N = 5
df = pd.DataFrame([[11, 3, 50, -3],
[5, 73, 11, 100],
[75, 9, -2, 44]])
s_ix = np.argsort(df.values, axis=None)[::-1][:N]
labels = np.unravel_index(s_ix, df.shape)
labels = list(zip(*labels))
print(labels) # --> [(1, 3), (2, 0), (1, 1), (0, 2), (2, 3)]
print(df.loc[labels[0]]) # --> 100

pd.df find rows pairwise using groupby and change bogus values

My pd.DataFrame looks like this example but has about 10mio rows, hence I am looking for an efficient solution.
import pandas as pd
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df.index = pd.to_datetime(df.index)
df.opt_expiry = pd.to_datetime(df.opt_expiry)
Out[2]:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.001 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
here is what I am looking to achieve:
1) find the pairs with identical timestamp, opt_expiry and strike:
groups = df.groupby(['timestamp','opt_expiry','strike'])
2) for each group check if the sum of the absolute delta equals 1. If true find the maximum of the two sigma values and assign that to both rows as the new, correct sigma. pseudo code:
for group in groups:
# if sum of absolute deltas != 1
if (abs(group.delta[0]) + abs(group.delta[1])) != 1:
correct_sigma = group.sigma.max()
group.sigma = correct_sigma
Expected output:
opt_expiry strike type sigma delta
timestamp
2004-09-06 2005-12-16 2.0 c 0.250 0.70
2004-09-06 2005-12-16 2.0 p 0.250 -0.30
2004-09-06 2005-12-16 2.5 c 0.170 1.00
2004-09-06 2005-12-16 2.5 p 0.170 -0.25
2004-09-07 2005-06-17 1.5 c 0.195 0.60
2004-09-07 2005-06-17 1.5 p 0.190 -0.40
Revised answer. I believe there could be a shorter answer out there. Maybe put it up as bounty
Data
df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'],
'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],
'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],
'type': ['c', 'p', 'c', 'p', 'c', 'p'],
'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],
'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True)
df
Working
Absolute delta sum for each groupfor each row
df['absdelta']=df['delta'].abs()
Absolute delta sum and maximum sigma for each group in a new dataframe df2
df2=df.groupby(['timestamp','opt_expiry','strike']).agg({'absdelta':'sum','sigma':'max'})#.reset_index().drop(columns=['timestamp','opt_expiry'])
df2
Merge df2 with df
df3=df.merge(df2, left_on='strike', right_on='strike',
suffixes=('', '_right'))
df3
mask groups with sum absolute delta not equal to 1
m=df3['absdelta_right']!=1
m
Using mask, apply maximum sigma to entities in groups masked above
df3.loc[m,'sigma']=df3.loc[m,'sigma_right']
Slice to return to original dataframe
df3.iloc[:,:-4]
Output

Why does pandas change the index value in this example?

First we create a raw dataset with MultiIndex-
In [166]: import numpy as np; import pandas as pd
In [167]: data_raw = pd.DataFrame([
...: {'frame': 1, 'face': np.NaN, 'lmark': np.NaN, 'x': np.NaN, 'y': np.NaN},
...: {'frame': 197, 'face': 0, 'lmark': 1, 'x': 969, 'y': 737},
...: {'frame': 197, 'face': 0, 'lmark': 2, 'x': 969, 'y': 740},
...: {'frame': 197, 'face': 0, 'lmark': 3, 'x': 970, 'y': 744},
...: {'frame': 197, 'face': 0, 'lmark': 4, 'x': 972, 'y': 748},
...: {'frame': 197, 'face': 0, 'lmark': 5, 'x': 973, 'y': 752},
...: {'frame': 300, 'face': 0, 'lmark': 1, 'x': 745, 'y': 367},
...: {'frame': 300, 'face': 0, 'lmark': 2, 'x': 753, 'y': 411},
...: {'frame': 300, 'face': 0, 'lmark': 3, 'x': 759, 'y': 455},
...: {'frame': 301, 'face': 0, 'lmark': 1, 'x': 741, 'y': 364},
...: {'frame': 301, 'face': 0, 'lmark': 2, 'x': 746, 'y': 408},
...: {'frame': 301, 'face': 0, 'lmark': 3, 'x': 750, 'y': 452}]).set_index(['frame', 'face', 'lmark'])
Next we calculate the z-scores for each lmark -
In [168]: ((data_raw - data_raw.mean(level='lmark')).abs()) / data_raw.std(level='lmark')
Out[168]:
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
4.0 NaN NaN
5.0 NaN NaN
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270
The index values don't change, as expected.
Now we filter out records where lmark > 3 -
In [170]: data_filtered = data_raw.loc[(slice(None), slice(None), [np.NaN, slice(3)]),:]
In [171]: data_filtered
Out[171]:
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 969.0 737.0
2.0 969.0 740.0
3.0 970.0 744.0
300 0.0 1.0 745.0 367.0
2.0 753.0 411.0
3.0 759.0 455.0
301 0.0 1.0 741.0 364.0
2.0 746.0 408.0
3.0 750.0 452.0
and recalculate the z-scores -
In [172]: ((data_filtered - data_filtered.mean(level='lmark')).abs()) / data_filtered.std(level='lmark')
Out[172]:
x y
frame face lmark
1 NaN 1.0 NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270
Why has the value of the first record's lmark index changed from NaN to 1.0?
I think it seems bug.
Solution is use MultiIndex.remove_unused_levels:
data_filtered.index = data_filtered.index.remove_unused_levels()
a = ((data_filtered - data_filtered.mean(level='lmark')).abs()) / data_filtered.std(level='lmark')
print (a)
x y
frame face lmark
1 NaN NaN NaN NaN
197 0.0 1.0 1.154565 1.154672
2.0 1.154260 1.154665
3.0 1.153946 1.154654
300 0.0 1.0 0.561956 0.570343
2.0 0.549523 0.569472
3.0 0.540829 0.568384
301 0.0 1.0 0.592609 0.584329
2.0 0.604738 0.585193
3.0 0.613117 0.586270

split dataframe column including list of lists on multiple rows into pandas

I have a dataframe df:
import pandas as pd
df = pd.DataFrame([
[[[3,0.5, 0.4, 0.7, 5],[2, 0.5, 1, 0.8, 2],[1, 0.5, 1, 1, 2]], 'b'],
[[[1, 0.5, 0.6, 0.01, 1],[2, 0.5, 0.3, 0.2, 3],[1, 0.8, 1.0, 0.04, 3]], 'd']],
index = ['row1', 'row2'],
columns=['col1', 'col2'])
I would like to split col1, including list of lists, on multiple lines as follows:
col1 col2
row1 [3,0.5, 0.4, 0.7, 5] b
row1 [2, 0.5, 1, 0.8, 2] b
row1 [1, 0.5, 1, 1, 2] b
row2 [1, 0.5, 0.6, 0.01, 1] d
row2 [2, 0.5, 0.3, 0.2, 3] d
row2 [1, 0.8, 1.0, 0.04, 3] d
and next split col1 in 2 columns, retaining only the second and the third elements
new_col1 new_col2 col2
row1 0.5 0.4 b
row1 0.5 1 b
row1 0.5 1 b
row2 0.5 0.6 d
row2 0.5 0.3 d
row2 0.8 1.0 d
How it can be done make using pandas?
For the first step there may not be anything better than a loop:
df2 = pd.DataFrame()
for row in df.index:
col = df.ix[row, 'col1']
N = len(col)
df2 = df2.append(pd.DataFrame(
[[c, df.ix[row, 'col2']] for c in col],
index=[row] * N,
columns = ['col1', 'col2']))
For the second step, just add the new columns and delete the original one:
df3 = df2.copy()
df3['new_col1'] = [c[1] for c in df3['col1']]
df3['new_col2'] = [c[2] for c in df3['col1']]
del df3['col1']