Pandas dataframe finding largest N elements of each row with row-specific N - pandas

I have a DataFrame:
>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 23.0 24.0 25.0
and a Series that specifies the number of top N values I want from each row
>>> n_max = pd.Series([2,3,4])
What is Panda's way of using df and n_max to find the largest N elements of each (breaking ties with a random pick, just as .nlargest() would do)?
The desired output is
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:
>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
But the goal, again, is to have row-specific N. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank() with a mask but tie breaking doesn't work there...

Based on #ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:
>>> n_max.index = df.index
>>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
>>> selected = df_rank.le(n_max, axis=0)
>>> df[selected]
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0

For performance, I would suggest NumPy -
def mask_variable_largest_per_row(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Sample run -
In [182]: df
Out[182]:
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 5.0 24.0 25.0
In [183]: n_max = pd.Series([2,3,2])
In [184]: mask_variable_largest_per_row(df, n_max)
Out[184]:
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 NaN NaN NaN 24.0 25.0
Further boost : Bringing in numpy.argpartition to replace the numpy.argsort should help, as we don't care about the order of indices to be reset as NaNs. Thus, a numpy.argpartition based one would be -
def mask_variable_largest_per_row_v2(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
N = (n-n_max.values).max()
N = np.clip(N, a_min=0, a_max=n-1)
sidx = a.argpartition(N, axis=1) #sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Runtime test
Other approaches -
def pandas_rank_based(df, n_max):
n_max.index = df.index
df_rank = df.stack(dropna=False).groupby(level=0).rank\
(ascending=False, method='first').unstack()
selected = df_rank.le(n_max, axis=0)
return df[selected]
Verification and timings -
In [387]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
...: out1 = pandas_rank_based(df1, n_max)
...: out2 = mask_variable_largest_per_row(df2, n_max)
...: out3 = mask_variable_largest_per_row_v2(df3, n_max)
...: print np.nansum(out1-out2)==0 # Verify
...: print np.nansum(out1-out3)==0 # Verify
...:
True
True
In [388]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
In [389]: %timeit pandas_rank_based(df1, n_max)
1 loops, best of 3: 559 ms per loop
In [390]: %timeit mask_variable_largest_per_row(df2, n_max)
10 loops, best of 3: 34.1 ms per loop
In [391]: %timeit mask_variable_largest_per_row_v2(df3, n_max)
100 loops, best of 3: 5.92 ms per loop
Pretty good speedups there of 50x+ over the pandas built-in!

Related

Fill only last among of consecutive NaN in Pandas by mean of previous and next valid values

Fill only last among of consecutive NaN in Pandas by mean of previous and next valid values. If one NaN, then fill with mean of next and previous. If two consecutive NaN, impute second one with mean of next and previous valid values.
Series:
expected output:
Idea is remove consecutive missing values without last, then use interpolate and assign back last missing value by condition:
m = df['header'].isna()
mask = m & ~m.shift(-1, fill_value=False)
df.loc[mask, 'header'] = df.loc[mask | ~m, 'header'].interpolate()
print (df)
header
0 10.0
1 20.0
2 20.0
3 20.0
4 30.0
5 NaN
6 35.0
7 40.0
8 10.0
9 NaN
10 NaN
11 30.0
12 50.0
Details:
print (df.assign(m=m, mask=mask))
header m mask
0 10.0 False False
1 20.0 False False
2 20.0 True True
3 20.0 False False
4 30.0 False False
5 NaN True False
6 35.0 True True
7 40.0 False False
8 10.0 False False
9 NaN True False
10 NaN True False
11 30.0 True True
12 50.0 False False
print (df.loc[mask | ~m, 'header'])
0 10.0
1 20.0
2 NaN
3 20.0
4 30.0
6 NaN
7 40.0
8 10.0
11 NaN
12 50.0
Name: header, dtype: float64
Solution for interpolate per groups is:
df.loc[mask, 'header'] = df.loc[mask | ~m, 'header'].groupby(df['groups'])
.transform(lambda x: x.interpolate())
You can try:
s = df['header']
m = s.isna()
df['header'] = s.ffill().add(s.bfill()).div(2).mask(m&m.shift(-1, fill_value=False))
output and intermediates:
header output ffill bfill m m&m.shift(-1)
0 10.0 10.0 10.0 10.0 False False
1 20.0 20.0 20.0 20.0 False False
2 NaN 20.0 20.0 20.0 True False
3 20.0 20.0 20.0 20.0 False False
4 30.0 30.0 30.0 30.0 False False
5 NaN NaN 30.0 40.0 True True
6 NaN 35.0 30.0 40.0 True False
7 40.0 40.0 40.0 40.0 False False
8 10.0 10.0 10.0 10.0 False False
9 NaN NaN 10.0 50.0 True True
10 NaN NaN 10.0 50.0 True True
11 NaN 30.0 10.0 50.0 True False
12 50.0 50.0 50.0 50.0 False False

Performing calculation based off multiple rows in Pandas dataframe

Set up an example dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, False, np.nan, np.nan], [5, False, np.nan, np.nan], [np.nan, True, 'a', 'b'],[np.nan,True,'b','a']],
columns=['value', 'IsRatio', 'numerator','denominator'],index=['a','b','c','d'])
index
value
IsRatio
numerator
denominator
a
10
False
nan
nan
b
5
False
nan
nan
c
nan
True
a
b
d
nan
True
b
a
For rows where IsRatio is True, I would like to lookup the values for the numerator and denominator, and calculate a ratio.
For a single row I can use .loc
numerator_name = df.loc['c','numerator']
denominator_name = df.loc['c','denominator']
df.loc['c','value'] = int(df.loc[numerator_name]['value'])/int(df.loc[denominator_name]['value'])
This will calculate the ratio for a single row
index
value
IsRatio
numerator
denominator
a
10
False
nan
nan
b
5
False
nan
nan
c
2
True
a
b
d
nan
True
b
a
How can I generalise this to all rows? I think I might need an apply function but I can't figure it out.
You can use apply to apply your computation to each row (mind the axis=1 input argument):
df['value'] = df.apply(
lambda x: int(df.loc[x.numerator]['value']) / int(df.loc[x.denominator]['value'])
if x.IsRatio else x.value,
axis=1
)
The result is the following:
value IsRatio numerator denominator
a 10 False nan nan
b 5 False nan nan
c 2 True a b
d 0.5 True b a
Note: you should remove np.array from the creation of the example DataFrame, otherwise the IsRatio column has type str. So df should be defined as follow:
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, False, np.nan, np.nan], [5, False, np.nan, np.nan], [np.nan, True, 'a', 'b'],[np.nan,True,'b','a']],
columns=['value', 'IsRatio', 'numerator','denominator'],index=['a','b','c','d'])
Otherwise, if IsRatio column is actually of type str, you should modify the previous code as following:
df['value'] = df.apply(
lambda x: int(df.loc[x.numerator]['value']) / int(df.loc[x.denominator]['value'])
if x.IsRatio == 'True' else x.value,
axis=1
)
value IsRatio numerator denominator
a 10 False nan nan
b 5 False nan nan
c 2 True a b
d 0.5 True b a
To do as a vectorised solution numpy where() is a good solution.
df = pd.DataFrame(np.array([[10, False, np.nan, np.nan], [5, False, np.nan, np.nan], [np.nan, True, 10, 5],[np.nan,True,30,10]]),
columns=['value', 'IsRatio', 'numerator','denominator'],index=['a','b','c','d'])
# df.assign(v=np.where)
df.IsRatio = df.IsRatio.astype(bool)
df.assign(v=np.where(df.IsRatio, df.numerator/df.denominator, df.value))
value
IsRatio
numerator
denominator
v
a
10
False
nan
nan
10
b
5
False
nan
nan
5
c
nan
True
10
5
2
d
nan
True
30
10
3

Compute rowmeans ignoring na in pandas, like na.rm in R

I have the following data:
a = pd.Series([1, 2, "NA"])
b = pd.Series(["NA", 2, 3])
df = pd.concat([a, b], axis=1)
# 0 1
# 0 1 NA
# 1 2 2
# 2 NA 3
Now I'd like to compute the rowmeans like in R with na.rm=T.
c.mean(skipna=True, axis=0)
# Series([], dtype: float64)
I was expecting:
# 0 1 # 1/1
# 1 2 # (2+2)/2
# 2 3 # 3/1
How do I achieve this?
You have mixed dtypes due to presence of str 'NA', you need to convert to numeric types first:
In [118]:
df.apply(lambda x: pd.to_numeric(x, errors='force')).mean(axis=1)
Out[118]:
0 1
1 2
2 3
dtype: float64
If your original data was true NaN then it works as expected:
In [119]:
a = pd.Series([1, 2, np.NaN])
b = pd.Series([np.NaN, 2, 3])
df = pd.concat([a, b], axis=1)
df.mean(skipna=True, axis=1)
Out[119]:
0 1
1 2
2 3
dtype: float64

Assign a list with a missing value to a Pandas Series in Python

Something wired when I tried to assign a list with missing value np.nan to a Pandas Series
Below are the codes to reproduce the fact.
import numpy as np
import pandas as pd
S = pd.Series(0, index = list('ABCDE'))
>>> S
A 0
B 0
C 0
D 0
E 0
dtype: int64
ind = [True, False, True, False, True]
x = [1, np.nan, 2]
>>> S[ind]
A 0
C 0
E 0
dtype: int64
Assign x to S[ind]
S[ind] = x
Something wired in S
>>> S
A 1
B 0
C 2
D 0
E NaN
dtype: float64
I am expecting S to be
>>> S
A 1
B 0
C NaN
D 0
E 2
dtype: float64
Anyone can give an explanation for this?
You can try this:
S[S[ind].index] = x
or
S[S.index[ind]] = x

How do you filter out rows with NaN in a panda's dataframe

I have a few entries in a panda dataframe that are NaN. How would I remove any row with a NaN?
Just use x.dropna():
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]:
In [3]: df = pd.DataFrame(np.random.randn(5, 2))
In [4]: df.iloc[0, 1] = np.nan
In [5]: df.iloc[4, 0] = np.nan
In [6]: print(df)
0 1
0 2.264727 NaN
1 0.229321 1.615272
2 -0.901608 -1.407787
3 -0.198323 0.521726
4 NaN 0.692340
In [7]: df2 = df.dropna()
In [8]: print(df2)
0 1
1 0.229321 1.615272
2 -0.901608 -1.407787
3 -0.198323 0.521726