Pandas dataframe finding largest N elements of each row with row-specific N - pandas
I have a DataFrame:
>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 23.0 24.0 25.0
and a Series that specifies the number of top N values I want from each row
>>> n_max = pd.Series([2,3,4])
What is Panda's way of using df and n_max to find the largest N elements of each (breaking ties with a random pick, just as .nlargest() would do)?
The desired output is
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:
>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
But the goal, again, is to have row-specific N. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank() with a mask but tie breaking doesn't work there...
Based on #ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:
>>> n_max.index = df.index
>>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
>>> selected = df_rank.le(n_max, axis=0)
>>> df[selected]
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
For performance, I would suggest NumPy -
def mask_variable_largest_per_row(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Sample run -
In [182]: df
Out[182]:
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 5.0 24.0 25.0
In [183]: n_max = pd.Series([2,3,2])
In [184]: mask_variable_largest_per_row(df, n_max)
Out[184]:
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 NaN NaN NaN 24.0 25.0
Further boost : Bringing in numpy.argpartition to replace the numpy.argsort should help, as we don't care about the order of indices to be reset as NaNs. Thus, a numpy.argpartition based one would be -
def mask_variable_largest_per_row_v2(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
N = (n-n_max.values).max()
N = np.clip(N, a_min=0, a_max=n-1)
sidx = a.argpartition(N, axis=1) #sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Runtime test
Other approaches -
def pandas_rank_based(df, n_max):
n_max.index = df.index
df_rank = df.stack(dropna=False).groupby(level=0).rank\
(ascending=False, method='first').unstack()
selected = df_rank.le(n_max, axis=0)
return df[selected]
Verification and timings -
In [387]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
...: out1 = pandas_rank_based(df1, n_max)
...: out2 = mask_variable_largest_per_row(df2, n_max)
...: out3 = mask_variable_largest_per_row_v2(df3, n_max)
...: print np.nansum(out1-out2)==0 # Verify
...: print np.nansum(out1-out3)==0 # Verify
...:
True
True
In [388]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
In [389]: %timeit pandas_rank_based(df1, n_max)
1 loops, best of 3: 559 ms per loop
In [390]: %timeit mask_variable_largest_per_row(df2, n_max)
10 loops, best of 3: 34.1 ms per loop
In [391]: %timeit mask_variable_largest_per_row_v2(df3, n_max)
100 loops, best of 3: 5.92 ms per loop
Pretty good speedups there of 50x+ over the pandas built-in!
Related
Fill only last among of consecutive NaN in Pandas by mean of previous and next valid values
Fill only last among of consecutive NaN in Pandas by mean of previous and next valid values. If one NaN, then fill with mean of next and previous. If two consecutive NaN, impute second one with mean of next and previous valid values. Series: expected output:
Idea is remove consecutive missing values without last, then use interpolate and assign back last missing value by condition: m = df['header'].isna() mask = m & ~m.shift(-1, fill_value=False) df.loc[mask, 'header'] = df.loc[mask | ~m, 'header'].interpolate() print (df) header 0 10.0 1 20.0 2 20.0 3 20.0 4 30.0 5 NaN 6 35.0 7 40.0 8 10.0 9 NaN 10 NaN 11 30.0 12 50.0 Details: print (df.assign(m=m, mask=mask)) header m mask 0 10.0 False False 1 20.0 False False 2 20.0 True True 3 20.0 False False 4 30.0 False False 5 NaN True False 6 35.0 True True 7 40.0 False False 8 10.0 False False 9 NaN True False 10 NaN True False 11 30.0 True True 12 50.0 False False print (df.loc[mask | ~m, 'header']) 0 10.0 1 20.0 2 NaN 3 20.0 4 30.0 6 NaN 7 40.0 8 10.0 11 NaN 12 50.0 Name: header, dtype: float64 Solution for interpolate per groups is: df.loc[mask, 'header'] = df.loc[mask | ~m, 'header'].groupby(df['groups']) .transform(lambda x: x.interpolate())
You can try: s = df['header'] m = s.isna() df['header'] = s.ffill().add(s.bfill()).div(2).mask(m&m.shift(-1, fill_value=False)) output and intermediates: header output ffill bfill m m&m.shift(-1) 0 10.0 10.0 10.0 10.0 False False 1 20.0 20.0 20.0 20.0 False False 2 NaN 20.0 20.0 20.0 True False 3 20.0 20.0 20.0 20.0 False False 4 30.0 30.0 30.0 30.0 False False 5 NaN NaN 30.0 40.0 True True 6 NaN 35.0 30.0 40.0 True False 7 40.0 40.0 40.0 40.0 False False 8 10.0 10.0 10.0 10.0 False False 9 NaN NaN 10.0 50.0 True True 10 NaN NaN 10.0 50.0 True True 11 NaN 30.0 10.0 50.0 True False 12 50.0 50.0 50.0 50.0 False False
Performing calculation based off multiple rows in Pandas dataframe
Set up an example dataframe: import pandas as pd import numpy as np df = pd.DataFrame([[10, False, np.nan, np.nan], [5, False, np.nan, np.nan], [np.nan, True, 'a', 'b'],[np.nan,True,'b','a']], columns=['value', 'IsRatio', 'numerator','denominator'],index=['a','b','c','d']) index value IsRatio numerator denominator a 10 False nan nan b 5 False nan nan c nan True a b d nan True b a For rows where IsRatio is True, I would like to lookup the values for the numerator and denominator, and calculate a ratio. For a single row I can use .loc numerator_name = df.loc['c','numerator'] denominator_name = df.loc['c','denominator'] df.loc['c','value'] = int(df.loc[numerator_name]['value'])/int(df.loc[denominator_name]['value']) This will calculate the ratio for a single row index value IsRatio numerator denominator a 10 False nan nan b 5 False nan nan c 2 True a b d nan True b a How can I generalise this to all rows? I think I might need an apply function but I can't figure it out.
You can use apply to apply your computation to each row (mind the axis=1 input argument): df['value'] = df.apply( lambda x: int(df.loc[x.numerator]['value']) / int(df.loc[x.denominator]['value']) if x.IsRatio else x.value, axis=1 ) The result is the following: value IsRatio numerator denominator a 10 False nan nan b 5 False nan nan c 2 True a b d 0.5 True b a Note: you should remove np.array from the creation of the example DataFrame, otherwise the IsRatio column has type str. So df should be defined as follow: import pandas as pd import numpy as np df = pd.DataFrame([[10, False, np.nan, np.nan], [5, False, np.nan, np.nan], [np.nan, True, 'a', 'b'],[np.nan,True,'b','a']], columns=['value', 'IsRatio', 'numerator','denominator'],index=['a','b','c','d']) Otherwise, if IsRatio column is actually of type str, you should modify the previous code as following: df['value'] = df.apply( lambda x: int(df.loc[x.numerator]['value']) / int(df.loc[x.denominator]['value']) if x.IsRatio == 'True' else x.value, axis=1 ) value IsRatio numerator denominator a 10 False nan nan b 5 False nan nan c 2 True a b d 0.5 True b a
To do as a vectorised solution numpy where() is a good solution. df = pd.DataFrame(np.array([[10, False, np.nan, np.nan], [5, False, np.nan, np.nan], [np.nan, True, 10, 5],[np.nan,True,30,10]]), columns=['value', 'IsRatio', 'numerator','denominator'],index=['a','b','c','d']) # df.assign(v=np.where) df.IsRatio = df.IsRatio.astype(bool) df.assign(v=np.where(df.IsRatio, df.numerator/df.denominator, df.value)) value IsRatio numerator denominator v a 10 False nan nan 10 b 5 False nan nan 5 c nan True 10 5 2 d nan True 30 10 3
Compute rowmeans ignoring na in pandas, like na.rm in R
I have the following data: a = pd.Series([1, 2, "NA"]) b = pd.Series(["NA", 2, 3]) df = pd.concat([a, b], axis=1) # 0 1 # 0 1 NA # 1 2 2 # 2 NA 3 Now I'd like to compute the rowmeans like in R with na.rm=T. c.mean(skipna=True, axis=0) # Series([], dtype: float64) I was expecting: # 0 1 # 1/1 # 1 2 # (2+2)/2 # 2 3 # 3/1 How do I achieve this?
You have mixed dtypes due to presence of str 'NA', you need to convert to numeric types first: In [118]: df.apply(lambda x: pd.to_numeric(x, errors='force')).mean(axis=1) Out[118]: 0 1 1 2 2 3 dtype: float64 If your original data was true NaN then it works as expected: In [119]: a = pd.Series([1, 2, np.NaN]) b = pd.Series([np.NaN, 2, 3]) df = pd.concat([a, b], axis=1) df.mean(skipna=True, axis=1) Out[119]: 0 1 1 2 2 3 dtype: float64
Assign a list with a missing value to a Pandas Series in Python
Something wired when I tried to assign a list with missing value np.nan to a Pandas Series Below are the codes to reproduce the fact. import numpy as np import pandas as pd S = pd.Series(0, index = list('ABCDE')) >>> S A 0 B 0 C 0 D 0 E 0 dtype: int64 ind = [True, False, True, False, True] x = [1, np.nan, 2] >>> S[ind] A 0 C 0 E 0 dtype: int64 Assign x to S[ind] S[ind] = x Something wired in S >>> S A 1 B 0 C 2 D 0 E NaN dtype: float64 I am expecting S to be >>> S A 1 B 0 C NaN D 0 E 2 dtype: float64 Anyone can give an explanation for this?
You can try this: S[S[ind].index] = x or S[S.index[ind]] = x
How do you filter out rows with NaN in a panda's dataframe
I have a few entries in a panda dataframe that are NaN. How would I remove any row with a NaN?
Just use x.dropna(): In [1]: import pandas as pd In [2]: import numpy as np In [3]: In [3]: df = pd.DataFrame(np.random.randn(5, 2)) In [4]: df.iloc[0, 1] = np.nan In [5]: df.iloc[4, 0] = np.nan In [6]: print(df) 0 1 0 2.264727 NaN 1 0.229321 1.615272 2 -0.901608 -1.407787 3 -0.198323 0.521726 4 NaN 0.692340 In [7]: df2 = df.dropna() In [8]: print(df2) 0 1 1 0.229321 1.615272 2 -0.901608 -1.407787 3 -0.198323 0.521726