Masked array assignment - indexing

I have a NxN array A, a NxN array B and a NxN mask (BitMatrix) M. Now I want to copy / assign the values of B to A only for the indices for which M is true. What is the best way to do that?

You can use logical indexing
julia> A = zeros(5,5); B = ones(5,5); M = rand(Bool, 5, 5)
5×5 Matrix{Bool}:
1 0 1 1 0
1 0 1 1 0
1 0 1 1 1
0 0 0 1 0
0 0 0 0 1
julia> A[M] = B[M]; A
5×5 Matrix{Float64}:
1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 1.0 1.0
0.0 0.0 0.0 1.0 0.0
0.0 0.0 0.0 0.0 1.0
or simply write a loop:
julia> for i in eachindex(A, B, M)
if M[i]
A[i] = B[i]
end
end

Related

Fill only last among of consecutive NaN in Pandas by mean of previous and next valid values

Fill only last among of consecutive NaN in Pandas by mean of previous and next valid values. If one NaN, then fill with mean of next and previous. If two consecutive NaN, impute second one with mean of next and previous valid values.
Series:
expected output:
Idea is remove consecutive missing values without last, then use interpolate and assign back last missing value by condition:
m = df['header'].isna()
mask = m & ~m.shift(-1, fill_value=False)
df.loc[mask, 'header'] = df.loc[mask | ~m, 'header'].interpolate()
print (df)
header
0 10.0
1 20.0
2 20.0
3 20.0
4 30.0
5 NaN
6 35.0
7 40.0
8 10.0
9 NaN
10 NaN
11 30.0
12 50.0
Details:
print (df.assign(m=m, mask=mask))
header m mask
0 10.0 False False
1 20.0 False False
2 20.0 True True
3 20.0 False False
4 30.0 False False
5 NaN True False
6 35.0 True True
7 40.0 False False
8 10.0 False False
9 NaN True False
10 NaN True False
11 30.0 True True
12 50.0 False False
print (df.loc[mask | ~m, 'header'])
0 10.0
1 20.0
2 NaN
3 20.0
4 30.0
6 NaN
7 40.0
8 10.0
11 NaN
12 50.0
Name: header, dtype: float64
Solution for interpolate per groups is:
df.loc[mask, 'header'] = df.loc[mask | ~m, 'header'].groupby(df['groups'])
.transform(lambda x: x.interpolate())
You can try:
s = df['header']
m = s.isna()
df['header'] = s.ffill().add(s.bfill()).div(2).mask(m&m.shift(-1, fill_value=False))
output and intermediates:
header output ffill bfill m m&m.shift(-1)
0 10.0 10.0 10.0 10.0 False False
1 20.0 20.0 20.0 20.0 False False
2 NaN 20.0 20.0 20.0 True False
3 20.0 20.0 20.0 20.0 False False
4 30.0 30.0 30.0 30.0 False False
5 NaN NaN 30.0 40.0 True True
6 NaN 35.0 30.0 40.0 True False
7 40.0 40.0 40.0 40.0 False False
8 10.0 10.0 10.0 10.0 False False
9 NaN NaN 10.0 50.0 True True
10 NaN NaN 10.0 50.0 True True
11 NaN 30.0 10.0 50.0 True False
12 50.0 50.0 50.0 50.0 False False

How to forward fill row values with function in pandas MultiIndex dataframe?

I have the following MultiIndex dataframe:
Close ATR condition
Date Symbol
1990-01-01 A 24 1 True
B 72 1 False
C 40 3 False
D 21 5 True
1990-01-02 A 65 4 True
B 19 2 True
C 43 3 True
D 72 1 False
1990-01-03 A 92 5 False
B 32 3 True
C 52 2 False
D 33 1 False
I perform the following calculation on this dataframe:
data.loc[data.index.levels[0][0], 'Shares'] = 0
data.loc[data.index.levels[0][0], 'Closed_P/L'] = 0
data = data.reset_index()
Equity = 10000
def calcs(x):
global Equity
# Skip first date
if x.index[0]==0: return x
# calculate Shares where condition is True
x.loc[x['condition'] == True, 'Shares'] = np.floor((Equity * 0.02 / x['ATR']).astype(float))
# other calulations
x['Closed_P/L'] = x['Shares'] * x['Close']
Equity += x['Closed_P/L'].sum()
return x
data = data.groupby('Date').apply(calcs)
data['Equity'] = data.groupby('Date')['Closed_P/L'].transform('sum')
data['Equity'] = data.groupby('Symbol')['Equity'].cumsum() + Equity
data = data.set_index(['Date','Symbol'])
The output is:
Close ATR condition Shares Closed_P/L Equity
Date Symbol
1990-01-01 A 24 1.2 True 0.0 0.0 10000.0
B 72 1.4 False 0.0 0.0 10000.0
C 40 3 False 0.0 0.0 10000.0
D 21 5 True 0.0 0.0 10000.0
1990-01-02 A 65 4 True 50.0 3250.0 17988.0
B 19 2 True 100.0 1900.0 17988.0
C 43 3 True 66.0 2838.0 17988.0
D 72 1 False NaN NaN 17988.0
1990-01-03 A 92 5 False NaN NaN 21796.0
B 32 3 True 119.0 3808.0 21796.0
C 52 2 False NaN NaN 21796.0
D 33 1 False NaN NaN 21796.0
I want to forward fill Shares values - grouped by Symbol - in case condition evaluates to False (except for first date). So the Shares value on 1990-01-02 for D should be 0 (because on 1990-01-01 the Shares value for D was 0 and the condition on 1990-01-02 is False). Also values for Shares on 1990-01-03 for A, C and D should be 50, 66 and 0 respectively based on the logic described above. How can I do that?

replacing dictionary values into a dataframe

I have the following df on one side:
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 0 0 0 0
ADECCO 0 0 0 0 0
BANKIA 0 0 0 0 0
and the following dict on the other:
{'ADMIRAL': 1, 'ADECCO': -1, 'BANKIA': -1}
where the df.index values correspond to the the dict.keys
I would like to replace the dict.values into the df placing one value per row to obtain this output:
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 1 0 0 0
ADECCO 0 0 -1 0 0
BANKIA 0 0 0 -1 0
Loop by dict values and set values by at:
d = {'ADMIRAL': 1, 'ADECCO': -1, 'BANKIA': -1}
for k, v in d.items():
df.at[k, k] = v
#alternative
#df.loc[k, k] = v
print (df)
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 1 0 0 0
ADECCO 0 0 -1 0 0
BANKIA 0 0 0 -1 0
Another solution is create DataFrame by dict by MultiIndex.from_arrays and unstack:
s = pd.Series(list(d.values()), index=pd.MultiIndex.from_arrays([d.keys(), d.keys()]))
df1 = s.unstack()
print (df1)
ADECCO ADMIRAL BANKIA
ADECCO -1.0 NaN NaN
ADMIRAL NaN 1.0 NaN
BANKIA NaN NaN -1.0
And then replace non NaNs by combine_first:
df = df1.combine_first(df)
print (df)
ACCOR SA ADECCO ADMIRAL BANKIA BANKINTER
ADECCO 0.0 -1.0 0.0 0.0 0.0
ADMIRAL 0.0 0.0 1.0 0.0 0.0
BANKIA 0.0 0.0 0.0 -1.0 0.0

Get frequency of items in a pandas column in given intervals of values stored in another pandas column

My dataframe
class_lst = ["B","A","C","Z","H","K","O","W","L","R","M","Y","Q","X","X","G","G","G","G","G"]
value_lst = [1,0.999986,1,0.999358,0.999906,0.995292,0.998481,0.388307,0.99608,0.99829,1,0.087298,1,1,0.999993,1,1,1,1,1]
df =pd.DataFrame(
{'class': class_lst,
'val': value_lst
})
For any interval of 'val' in ranges
ranges = np.arange(0.0, 1.1, 0.1)
I would like to get the frequency of 'val' items, as follows:
class range frequency
A (0, 0.10] 0
A (0.10, 0.20] 0
A (0.20, 0.30] 0
...
A (0.90, 100] 1
G (0, 0.10] 0
G (0.10, 0.20] 0
G (0.20, 0.30] 0
...
G (0.80, 0.90] 0
G (0.90, 100] 5
...
I tried
df.groupby(pd.cut(df.val, ranges)).count()
but the output looks like
class val
val
(0, 0.1] 1 1
(0.1, 0.2] 0 0
(0.2, 0.3] 0 0
(0.3, 0.4] 1 1
(0.4, 0.5] 0 0
(0.5, 0.6] 0 0
(0.6, 0.7] 0 0
(0.7, 0.8] 0 0
(0.8, 0.9] 0 0
(0.9, 1] 18 18
and does not match with the expected one
This might be a good start:
df["range"] = pd.cut(df['val'], ranges)
class val range
0 B 1.000000 (0.9, 1.0]
1 A 0.999986 (0.9, 1.0]
2 C 1.000000 (0.9, 1.0]
3 Z 0.999358 (0.9, 1.0]
4 H 0.999906 (0.9, 1.0]
5 K 0.995292 (0.9, 1.0]
6 O 0.998481 (0.9, 1.0]
7 W 0.388307 (0.3, 0.4]
8 L 0.996080 (0.9, 1.0]
9 R 0.998290 (0.9, 1.0]
10 M 1.000000 (0.9, 1.0]
11 Y 0.087298 (0.0, 0.1]
12 Q 1.000000 (0.9, 1.0]
13 X 1.000000 (0.9, 1.0]
14 X 0.999993 (0.9, 1.0]
15 G 1.000000 (0.9, 1.0]
16 G 1.000000 (0.9, 1.0]
17 G 1.000000 (0.9, 1.0]
18 G 1.000000 (0.9, 1.0]
19 G 1.000000 (0.9, 1.0]
and then
df.groupby(["class", "range"]).size()
class range
A (0.9, 1.0] 1
B (0.9, 1.0] 1
C (0.9, 1.0] 1
G (0.9, 1.0] 5
H (0.9, 1.0] 1
K (0.9, 1.0] 1
L (0.9, 1.0] 1
M (0.9, 1.0] 1
O (0.9, 1.0] 1
Q (0.9, 1.0] 1
R (0.9, 1.0] 1
W (0.3, 0.4] 1
X (0.9, 1.0] 2
Y (0.0, 0.1] 1
Z (0.9, 1.0] 1
This will give already the right bin for each class and its frequency.

Pandas dataframe finding largest N elements of each row with row-specific N

I have a DataFrame:
>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 23.0 24.0 25.0
and a Series that specifies the number of top N values I want from each row
>>> n_max = pd.Series([2,3,4])
What is Panda's way of using df and n_max to find the largest N elements of each (breaking ties with a random pick, just as .nlargest() would do)?
The desired output is
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:
>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
But the goal, again, is to have row-specific N. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank() with a mask but tie breaking doesn't work there...
Based on #ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:
>>> n_max.index = df.index
>>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
>>> selected = df_rank.le(n_max, axis=0)
>>> df[selected]
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
For performance, I would suggest NumPy -
def mask_variable_largest_per_row(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Sample run -
In [182]: df
Out[182]:
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 5.0 24.0 25.0
In [183]: n_max = pd.Series([2,3,2])
In [184]: mask_variable_largest_per_row(df, n_max)
Out[184]:
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 NaN NaN NaN 24.0 25.0
Further boost : Bringing in numpy.argpartition to replace the numpy.argsort should help, as we don't care about the order of indices to be reset as NaNs. Thus, a numpy.argpartition based one would be -
def mask_variable_largest_per_row_v2(df, n_max):
a = df.values
m,n = a.shape
nan_row_count = np.isnan(a).sum(1)
n_reset = n-n_max.values-nan_row_count
n_reset.clip(min=0, max=n-1, out = n_reset)
N = (n-n_max.values).max()
N = np.clip(N, a_min=0, a_max=n-1)
sidx = a.argpartition(N, axis=1) #sidx = a.argsort(1)
mask = n_reset[:,None] > np.arange(n)
c = sidx[mask]
r = np.repeat(np.arange(m), n_reset)
a[r,c] = np.nan
return df
Runtime test
Other approaches -
def pandas_rank_based(df, n_max):
n_max.index = df.index
df_rank = df.stack(dropna=False).groupby(level=0).rank\
(ascending=False, method='first').unstack()
selected = df_rank.le(n_max, axis=0)
return df[selected]
Verification and timings -
In [387]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
...: out1 = pandas_rank_based(df1, n_max)
...: out2 = mask_variable_largest_per_row(df2, n_max)
...: out3 = mask_variable_largest_per_row_v2(df3, n_max)
...: print np.nansum(out1-out2)==0 # Verify
...: print np.nansum(out1-out3)==0 # Verify
...:
True
True
In [388]: arr = np.random.rand(1000,1000)
...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
...: df1 = pd.DataFrame(arr)
...: df2 = df1.copy()
...: df3 = df1.copy()
...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
...:
In [389]: %timeit pandas_rank_based(df1, n_max)
1 loops, best of 3: 559 ms per loop
In [390]: %timeit mask_variable_largest_per_row(df2, n_max)
10 loops, best of 3: 34.1 ms per loop
In [391]: %timeit mask_variable_largest_per_row_v2(df3, n_max)
100 loops, best of 3: 5.92 ms per loop
Pretty good speedups there of 50x+ over the pandas built-in!