Numpy Boolean Bitwise unexpected behavior - numpy

I have a code which takes a numpy array of prices, and find discounts in desired range.
For some reason, the output is not as expected.
My code is:
# prices
prices = np.arange(5, (1-0.4-0.01)*5, -0.1)
# calculate discounts for each price
discounts_calculated = (max(prices)-prices)/max(prices)
# choose prices in range of (10,30) percent discount
chosen_inds = (0.1<=discounts_calculated) & (discounts_calculated<=0.3)
chosen_discounts = discounts_calculated[ chosen_inds ]
# show output
print(prices)
print(discounts_calculated)
print(chosen_discounts)
and the output is:
[5. 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 4. 3.9 3.8 3.7 3.6 3.5 3.4 3.3
3.2 3.1 3. ]
[0. 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26
0.28 0.3 0.32 0.34 0.36 0.38 0.4 ]
[0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 ]
Whereas the expected output should include the .1 element at the last array:
[5. 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 4. 3.9 3.8 3.7 3.6 3.5 3.4 3.3
3.2 3.1 3. ]
[0. 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26
0.28 0.3 0.32 0.34 0.36 0.38 0.4 ]
[0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 ]
How can I change the code so that the bitwise comparison would work?

Related

Pandas show NaN value after fillna with mode [duplicate]

This question already has answers here:
How to Pandas fillna() with mode of column?
(7 answers)
Closed 4 months ago.
I read data from csv and fillna with mode like this code.
df = pd.read_csv(r'C:\\Users\PC\Downloads\File.csv')
df.fillna(df.mode(), inplace=True)
It still show NaN value like this.
0 0.0 0.0 4.7 0.0 138.0 0.15 0.15
1 0.0 1.0 3.5 0.0 132.0 0.38 0.18
2 0.0 0.0 4.0 0.0 132.0 0.30 0.11
3 0.0 1.0 3.9 0.0 146.0 0.75 0.37
4 0.0 1.0 3.5 0.0 132.0 0.45 0.22
5 0.0 NaN NaN NaN NaN 0.45 0.22
6 0.0 NaN NaN NaN NaN 0.30 0.11
7 0.0 0.0 4.5 0.0 136.0 NaN NaN
8 0.0 NaN NaN NaN NaN 0.30 0.37
9 0.0 NaN NaN NaN NaN 0.38 0.11
If I fillna with mean it have no problem. How to fillna with mode?
Because DataFrame.mode should return multiple values if smae number of maximum counts, select first row:
print (df.mode())
1 2 3 4 5 6 7
0 0.0 0.0 3.5 0.0 132.0 0.3 0.11
1 NaN 1.0 NaN NaN NaN NaN NaN
df.fillna(df.mode().iloc[0], inplace=True)
print (df)
1 2 3 4 5 6 7
0
0 0.0 0.0 4.7 0.0 138.0 0.15 0.15
1 0.0 1.0 3.5 0.0 132.0 0.38 0.18
2 0.0 0.0 4.0 0.0 132.0 0.30 0.11
3 0.0 1.0 3.9 0.0 146.0 0.75 0.37
4 0.0 1.0 3.5 0.0 132.0 0.45 0.22
5 0.0 0.0 3.5 0.0 132.0 0.45 0.22
6 0.0 0.0 3.5 0.0 132.0 0.30 0.11
7 0.0 0.0 4.5 0.0 136.0 0.30 0.11
8 0.0 0.0 3.5 0.0 132.0 0.30 0.37
9 0.0 0.0 3.5 0.0 132.0 0.38 0.11

Assign value to new df column with condition and use bigger value if there's no excatly matching value

Assume we have two DataFrames:
DF1:
spec u_g target
G1 4.8 0.88
G2 2.1 0.76
WG2 1.4 0.71
WG2 1.2 0.68
WG2 1.0 0.52
WG3 0.8 0.65
WG3 0.7 0.53
SWG3 0.7 0.31
DF2:
id type u_g_1
1 WG2 1.4
2 WG2 1.4
3 WG2 1.0
4 G1 4.8
5 G1 4.9
6 G2 2.1
7 SWG3 0.7
8 WG3 0.8
9 WG3 0.7
10 WG2 1.1
11 nan 0
For every row in DF2 I'd like to look up if the type match with an entry in DF1, if yes, I'd like to look if there is an entry in DF1 with the corresponding value u_g_1 == u_g, if yes, choose the target value and assign this to DF2.
If not, assign the next bigger value in u_g for the same type to the new 'target' column.
DF2_modified:
id type u_g_1 u_g target
1 WG2 1.4 1.4 0.71
2 WG2 1.4 1.4 0.71
3 WG2 1.0 1.0 0.52
4 G1 4.8 4.8 0.88
5 G1 4.9 4.8 0.88
6 G2 2.1 2.1 0.76
7 SWG3 0.7 0.7 0.31
8 WG3 0.8 0.8 0.65
9 WG3 0.7 0.7 0.53
10 WG2 1.1 1.2 0.68
11 nan 0 nan nan
I tried it with
df2.merge(df1, left_on=['type', 'u_g_1'], right_on=['spec', 'u_g'], how='left')
This gives me target-values where type == spec and u_g_1 == u_g, but not for the ones where there is no value u_g in u_g_1. In this case I'd like to assign the next bigger value from u_g
Someone can help?
Use merge_asof with DataFrame.sort_values and last DataFrame.sort_values:
df = (pd.merge_asof(df2.sort_values('u_g_1'),
df1.sort_values('u_g'),
left_on='u_g_1',
left_by='type',
right_on='u_g',
right_by='spec')
.sort_values('id', ignore_index=True))
print (df)
id type u_g_1 spec u_g target
0 1 WG2 1.4 WG2 1.4 0.71
1 2 WG2 1.4 WG2 1.4 0.71
2 3 WG2 1.0 WG2 1.0 0.52
3 4 G1 4.8 G1 4.8 0.88
4 5 G1 4.9 G1 4.8 0.88
5 6 G2 2.1 G2 2.1 0.76
6 7 SWG3 0.7 SWG3 0.7 0.31
7 8 WG3 0.8 WG3 0.8 0.65
8 9 WG3 0.7 WG3 0.7 0.53
9 10 WG2 1.1 WG2 1.0 0.52
10 11 NaN 0.0 NaN NaN NaN
EDIT: Same solution with changed default direction='backward' to direction='forward':
df = (pd.merge_asof(df2.sort_values('u_g_1'),
df1.sort_values('u_g'),
left_on='u_g_1',
left_by='type',
right_on='u_g',
right_by='spec',
direction='forward')
.sort_values('id', ignore_index=True))
print (df)
id type u_g_1 spec u_g target
0 1 WG2 1.4 WG2 1.4 0.71
1 2 WG2 1.4 WG2 1.4 0.71
2 3 WG2 1.0 WG2 1.0 0.52
3 4 G1 4.8 G1 4.8 0.88
4 5 G1 4.9 NaN NaN NaN <- 4.9 is greater like 4.8 so NaN
5 6 G2 2.1 G2 2.1 0.76
6 7 SWG3 0.7 SWG3 0.7 0.31
7 8 WG3 0.8 WG3 0.8 0.65
8 9 WG3 0.7 WG3 0.7 0.53
9 10 WG2 1.1 WG2 1.2 0.68 <- 1.1 is less like 1.1 so match
10 11 NaN 0.0 NaN NaN NaN
Another idea with direction='nearest':
df = (pd.merge_asof(df2.sort_values('u_g_1'),
df1.sort_values('u_g'),
left_on='u_g_1',
left_by='type',
right_on='u_g',
right_by='spec',
direction='nearest')
.sort_values('id', ignore_index=True))
print (df)
id type u_g_1 spec u_g target
0 1 WG2 1.4 WG2 1.4 0.71
1 2 WG2 1.4 WG2 1.4 0.71
2 3 WG2 1.0 WG2 1.0 0.52
3 4 G1 4.8 G1 4.8 0.88
4 5 G1 4.9 G1 4.8 0.88
5 6 G2 2.1 G2 2.1 0.76
6 7 SWG3 0.7 SWG3 0.7 0.31
7 8 WG3 0.8 WG3 0.8 0.65
8 9 WG3 0.7 WG3 0.7 0.53
9 10 WG2 1.1 WG2 1.2 0.68
10 11 NaN 0.0 NaN NaN NaN
EDIT2: First is used direction='forward' and then replaced missing values by direction='backward':
df0 = (pd.merge_asof(df2.sort_values('u_g_1'),
df1.sort_values('u_g'),
left_on='u_g_1',
left_by='type',
right_on='u_g',
right_by='spec').set_index('id'))
print (df0)
type u_g_1 spec u_g target
id
11 NaN 0.0 NaN NaN NaN
7 SWG3 0.7 SWG3 0.7 0.31
9 WG3 0.7 WG3 0.7 0.53
8 WG3 0.8 WG3 0.8 0.65
3 WG2 1.0 WG2 1.0 0.52
10 WG2 1.1 WG2 1.0 0.52
1 WG2 1.4 WG2 1.4 0.71
2 WG2 1.4 WG2 1.4 0.71
6 G2 2.1 G2 2.1 0.76
4 G1 4.8 G1 4.8 0.88
5 G1 4.9 G1 4.8 0.88
df = (pd.merge_asof(df2.sort_values('u_g_1'),
df1.sort_values('u_g'),
left_on='u_g_1',
left_by='type',
right_on='u_g',
right_by='spec',
direction='forward')
.set_index('id')
.combine_first(df0)
.sort_index()
.reset_index()
)
print (df)
id type u_g_1 spec u_g target
0 1 WG2 1.4 WG2 1.4 0.71
1 2 WG2 1.4 WG2 1.4 0.71
2 3 WG2 1.0 WG2 1.0 0.52
3 4 G1 4.8 G1 4.8 0.88
4 5 G1 4.9 G1 4.8 0.88
5 6 G2 2.1 G2 2.1 0.76
6 7 SWG3 0.7 SWG3 0.7 0.31
7 8 WG3 0.8 WG3 0.8 0.65
8 9 WG3 0.7 WG3 0.7 0.53
9 10 WG2 1.1 WG2 1.2 0.68
10 11 NaN 0.0 NaN NaN NaN

resnet50 (bs 128) training steps is not showing up on MAC(Darwin Kernel Version 18.2.0), but showing up when training on Linux (Ubuntu 18.04)

I was using below command on both Linux and MAC, there is no problem on Linux:
python tf_cnn_benchmarks.py --model=resnet50_v2 --num_inter_threads=2 --batch_size=128 --num_batches=3000 --data_format NCHW --train_dir /tmp/output_dir --data_name=imagenet --data_dir /xxx/xxxx/xxxx/ --datasets_use_prefetch=False --save_model_steps=10000 --print_training_accuracy=True --summary_verbosity=2 --eval_during_training_every_n_epochs=1 --num_learning_rate_warmup_epochs=5 --num_epochs_per_decay=30 --learning_rate_decay_factor=0.1 --variable_update=parameter_server --optimizer=momentum --init_learning_rate=0.05 --num_eval_epochs=1 --save_summaries_steps=50
On my MAC, the training log is like below:
TensorFlow: 1.14
Model: resnet50_v2
Dataset: imagenet
Mode: training + evaluation
SingleSess: False
Batch size: 128 global
128 per device
Num batches: x000
Num epochs: 0.x0
Devices: ['/gpu:0']
NUMA bind: False
Data format: NCHW
Optimizer: momentum
Variables: parameter_server
==========
Generating training model
Generating evaluation model
Initializing graph
Running warm up
Done with training
Running final evaluation at global_step xx10
10 12.7 examples/sec
20 13.1 examples/sec
issue is:
before "done with training", it supposed to show training steps like below, but it does not.
Running warm up
Done warm up
Step Img/sec total_loss top_1_accuracy top_5_accuracy
1 images/sec: 13.1 +/- 0.0 (jitter = 0.0) 7.493 0.000 0.000
10 images/sec: 31.6 +/- 4.5 (jitter = 14.2) 7.504 0.000 0.000
20 images/sec: 31.9 +/- 2.7 (jitter = 10.7) 7.490 0.000 0.008
30 images/sec: 32.9 +/- 2.2 (jitter = 10.1) 7.515 0.000 0.000
40 images/sec: 33.4 +/- 2.0 (jitter = 12.1) 7.495 0.016 0.016
....
on Linux, the logs shown are correct/as expected:
TensorFlow: 1.14
Model: resnet50_v2
Dataset: imagenet
Mode: training + evaluation
SingleSess: False
Batch size: 128 global
128 per device
Num batches: x000
Num epochs: 0.x0
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: momentum
Variables: parameter_server
==========
Generating training model
Generating evaluation model
Initializing graph
Running warm up
Done warm up
Step Img/sec total_loss top_1_accuracy top_5_accuracy
1 images/sec: 13.1 +/- 0.0 (jitter = 0.0) 7.493 0.000 0.000
10 images/sec: 31.6 +/- 4.5 (jitter = 14.2) 7.504 0.000 0.000
20 images/sec: 31.9 +/- 2.7 (jitter = 10.7) 7.490 0.000 0.008
30 images/sec: 32.9 +/- 2.2 (jitter = 10.1) 7.515 0.000 0.000
40 images/sec: 33.4 +/- 2.0 (jitter = 12.1) 7.495 0.016 0.016
50 images/sec: 32.7 +/- 1.8 (jitter = 9.3) 7.495 0.000 0.000
60 images/sec: 33.5 +/- 1.6 (jitter = 9.4) 7.536 0.000 0.000
70 images/sec: 35.3 +/- 1.4 (jitter = 9.7) 7.521 0.000 0.000
.....
Thanks a lot!

How to interpolate 2D array from a coarser resolution to finer resolution

Suppose that I have an emission data with shape (21600,43200),
which corresponds to the lat and lon,i.e,
lat = np.arange(21600)*(-0.008333333)+90
lon = np.arange(43200)*0.00833333-180
And I also have a scaling factor with shape of (720,1440,7),which corresponds to lat , lon, day of week, and
lat = np.arange(720)*0.25-90
lon = np.arange(1440)*0.25-180
For now, I want to apply the factor to the emission data and I think I need to interpolate the factor on (720,1440) to (21600,43200). After that I can multiply the interpolated factor with the emission data to get the new emission output.
But I have a difficulty on the interpolation method.
Could anyone give me some suggestions?
Here's a complete example of the kind of interpolation you're trying to do. For example purposes I used emission data with shape (10, 20) and scale data with shape (5, 10). It uses scipy.interpolate.RectBivariateSpline, which is the recommended method for interpolating on regular grids:
import scipy.interpolate as sci
def latlon(res):
return (np.arange(res)*(180/res) - 90,
np.arange(2*res)*(360/(2*res)) - 180)
lat_fine,lon_fine = latlon(10)
emission = np.ones(10*20).reshape(10,20)
lat_coarse,lon_coarse = latlon(5)
scale = np.linspace(0, .5, num=5).reshape(-1, 1) + np.linspace(0, .5, num=10)
f = sci.RectBivariateSpline(lat_coarse, lon_coarse, scale)
scale_interp = f(lat_em, lon_em)
with np.printoptions(precision=1, suppress=True, linewidth=9999):
print('original emission data:\n%s\n' % emission)
print('original scale data:\n%s\n' % scale)
print('interpolated scale data:\n%s\n' % scale_interp)
print('scaled emission data:\n%s\n' % (emission*scale_interp))
which outputs:
original emission data:
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
original scale data:
[[0. 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5]
[0.1 0.2 0.2 0.3 0.3 0.4 0.5 0.5 0.6 0.6]
[0.2 0.3 0.4 0.4 0.5 0.5 0.6 0.6 0.7 0.8]
[0.4 0.4 0.5 0.5 0.6 0.7 0.7 0.8 0.8 0.9]
[0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.9 0.9 1. ]]
interpolated scale data:
[[0. 0. 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5]
[0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6]
[0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6]
[0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7]
[0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8]
[0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.8]
[0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.9 0.9]
[0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9]
[0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 1. 1. 1. ]
[0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 1. 1. 1. ]]
scaled emission data:
[[0. 0. 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5]
[0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6]
[0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6]
[0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7]
[0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8]
[0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.8]
[0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.9 0.9]
[0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9]
[0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 1. 1. 1. ]
[0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 1. 1. 1. ]]
Notes
The interpolation methods in scipy.interpolate expect both x and y to be strictly increasing, so you'll have to make sure that your emission data is arranged in a grid such that:
lat = np.arange(21600)*0.008333333 - 90
instead of:
lat = np.arange(21600)*(-0.008333333) + 90
like you have above. You can flip your emission data like so:
emission = emission[::-1, :]
If you're just looking for nearest neighbor or linear interpolation, you can use xarray's native da.interp method:
scaling_interped = scaling_factor.interp(
lon=emissions.lon,
lat=emissions.lat,
method='nearest') # or 'linear'
note that this will dramatically increase the size of the array. Assuming these are 64-bit floats, the result will be approximately (21600*43200*7)*8/(1024**3) or 48.7 GB. You could cut the in-memory size by a factor of 7 by chunking the array by day of week and doing the computing out of core with dask.
If you want to use an interpolation scheme other than nearest or linear, use the method suggested by tel.

cutting off the values at a threshold in pandas dataframe

I have a dataframe with 5 columns all of which contain numerical values. The columns represent time steps. I have a threshold which, if reached within the time, stops the values from changing. So let's say the original values are [ 0 , 1.5, 2, 4, 1] arranged in a row, and threshold is 2, then i want the manipulated row values to be [0, 1, 2 , 2, 2]
Is there a way to do this without loops?
A bigger example:
>>> threshold = 0.25
>>> input
Out[75]:
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.20
143 0.11 0.27 0.12 0.28 0.35
146 0.30 0.20 0.12 0.25 0.20
324 0.06 0.20 0.12 0.15 0.20
>>> output
Out[75]:
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
Use:
df = df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)).ffill(axis=1).fillna(df)
print (df)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
Explanation:
Compare by threshold by ge (>=):
print (df.ge(threshold))
0 1 2 3 4
130 False False False True False
143 False True False True True
146 True False False True False
324 False False False False False
Create cumulative sum per rows:
print (df.ge(threshold).cumsum(axis=1))
0 1 2 3 4
130 0 0 0 1 1
143 0 1 1 2 3
146 1 1 1 2 2
324 0 0 0 0 0
Again for get first matched values:
print (df.ge(threshold).cumsum(axis=1).cumsum(axis=1))
0 1 2 3 4
130 0 0 0 1 2
143 0 1 2 4 7
146 1 2 3 5 7
324 0 0 0 0 0
Compare by 1:
print (df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1))
0 1 2 3 4
130 False False False True False
143 False True False False False
146 True False False False False
324 False False False False False
Replace to NaNs of no matched values:
print (df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)))
0 1 2 3 4
130 NaN NaN NaN 0.25 NaN
143 NaN 0.27 NaN NaN NaN
146 0.3 NaN NaN NaN NaN
324 NaN NaN NaN NaN NaN
Forward fill missing values:
print (df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)).ffill(axis=1))
0 1 2 3 4
130 NaN NaN NaN 0.25 0.25
143 NaN 0.27 0.27 0.27 0.27
146 0.3 0.30 0.30 0.30 0.30
324 NaN NaN NaN NaN NaN
Replace first values to original:
print (df.where(df.ge(threshold).cumsum(1).cumsum(1).eq(1)).ffill(axis=1).fillna(df))
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
A bit more complicated but I like it.
v = df.values
a = v >= threshold
b = np.where(np.logical_or.accumulate(a, axis=1), np.nan, v)
r = np.arange(len(a))
j = a.argmax(axis=1)
b[r, j] = v[r, j]
pd.DataFrame(b, df.index, df.columns).ffill(axis=1)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
I like this one too:
v = df.values
a = v >= threshold
b = np.logical_or.accumulate(a, axis=1)
r = np.arange(len(df))
g = a.argmax(1)
fill = pd.Series(v[r, g], df.index)
df.mask(b, fill, axis=0)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20