cutting off the values at a threshold in pandas dataframe - pandas

I have a dataframe with 5 columns all of which contain numerical values. The columns represent time steps. I have a threshold which, if reached within the time, stops the values from changing. So let's say the original values are [ 0 , 1.5, 2, 4, 1] arranged in a row, and threshold is 2, then i want the manipulated row values to be [0, 1, 2 , 2, 2]
Is there a way to do this without loops?
A bigger example:
>>> threshold = 0.25
>>> input
Out[75]:
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.20
143 0.11 0.27 0.12 0.28 0.35
146 0.30 0.20 0.12 0.25 0.20
324 0.06 0.20 0.12 0.15 0.20
>>> output
Out[75]:
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20

Use:
df = df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)).ffill(axis=1).fillna(df)
print (df)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
Explanation:
Compare by threshold by ge (>=):
print (df.ge(threshold))
0 1 2 3 4
130 False False False True False
143 False True False True True
146 True False False True False
324 False False False False False
Create cumulative sum per rows:
print (df.ge(threshold).cumsum(axis=1))
0 1 2 3 4
130 0 0 0 1 1
143 0 1 1 2 3
146 1 1 1 2 2
324 0 0 0 0 0
Again for get first matched values:
print (df.ge(threshold).cumsum(axis=1).cumsum(axis=1))
0 1 2 3 4
130 0 0 0 1 2
143 0 1 2 4 7
146 1 2 3 5 7
324 0 0 0 0 0
Compare by 1:
print (df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1))
0 1 2 3 4
130 False False False True False
143 False True False False False
146 True False False False False
324 False False False False False
Replace to NaNs of no matched values:
print (df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)))
0 1 2 3 4
130 NaN NaN NaN 0.25 NaN
143 NaN 0.27 NaN NaN NaN
146 0.3 NaN NaN NaN NaN
324 NaN NaN NaN NaN NaN
Forward fill missing values:
print (df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)).ffill(axis=1))
0 1 2 3 4
130 NaN NaN NaN 0.25 0.25
143 NaN 0.27 0.27 0.27 0.27
146 0.3 0.30 0.30 0.30 0.30
324 NaN NaN NaN NaN NaN
Replace first values to original:
print (df.where(df.ge(threshold).cumsum(1).cumsum(1).eq(1)).ffill(axis=1).fillna(df))
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20

A bit more complicated but I like it.
v = df.values
a = v >= threshold
b = np.where(np.logical_or.accumulate(a, axis=1), np.nan, v)
r = np.arange(len(a))
j = a.argmax(axis=1)
b[r, j] = v[r, j]
pd.DataFrame(b, df.index, df.columns).ffill(axis=1)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
I like this one too:
v = df.values
a = v >= threshold
b = np.logical_or.accumulate(a, axis=1)
r = np.arange(len(df))
g = a.argmax(1)
fill = pd.Series(v[r, g], df.index)
df.mask(b, fill, axis=0)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20

Related

How to remove rows so that the values in a column match a sequence

I'm looking for a more efficient method to deal with the following problem. I have a Dataframe with a column filled with values that randomly range from 1 to 4, I need to remove all the rows of the data frame that do not follow the sequence (1-2-3-4-1-2-3-...).
This is what I have:
A B
12/2/2022 0.02 2
14/2/2022 0.01 1
15/2/2022 0.04 4
16/2/2022 -0.02 3
18/2/2022 -0.01 2
20/2/2022 0.04 1
21/2/2022 0.02 3
22/2/2022 -0.01 1
24/2/2022 0.04 4
26/2/2022 -0.02 2
27/2/2022 0.01 3
28/2/2022 0.04 1
01/3/2022 -0.02 3
03/3/2022 -0.01 2
05/3/2022 0.04 1
06/3/2022 0.02 3
08/3/2022 -0.01 1
10/3/2022 0.04 4
12/3/2022 -0.02 2
13/3/2022 0.01 3
15/3/2022 0.04 1
...
This is what I need:
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1
...
Since the data frame is quite big I need some sort of NumPy-based operation to accomplish this, the more efficient the better. My solution is very ugly and inefficient, basically, I made 4 loops like the following to check for every part of the sequence (4-1,1-2,2-3,3-4):
df_len = len(df)
df_len2 = 0
while df_len != df_len2:
df_len = len(df)
df.loc[(df.B.shift(1) == 4) & (df.B != 1), 'B'] = 0
df = df[df['B'] != 0]
df_len2 = len(df)
By means of itertools.cycle (to define cycled range):
from itertools import cycle
c_rng = cycle(range(1, 5)) # cycled range
start = next(c_rng) # starting point
df[[(v == start) and bool(start := next(c_rng)) for v in df.B]]
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1
A simple improvement to speed this up is to not touch the dataframe within the loop, but just iterate over the values of B to construct a Boolean index, like this:
is_in_sequence = []
next_target = 1
for b in df.B:
if b == next_target:
is_in_sequence.append(True)
next_target = next_target % 4 + 1
else:
is_in_sequence.append(False)
print(df[is_in_sequence])
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1

How to extract a database based on a condition in pandas?

Please help me
The below one is the problem...
write an expression to extract a new dataframe containing those days where the temperature reached at least 70 degrees, and assign that to the variable at_least_70. (You might need to think some about what the different columns in the full dataframe represent to decide how to extract the subset of interest.)
After that, write another expression that computes how many days reached at least 70 degrees, and assign that to the variable num_at_least_70.
This is the original DataFrame
Date Maximum Temperature Minimum Temperature \
0 2018-01-01 5 0
1 2018-01-02 13 1
2 2018-01-03 19 -2
3 2018-01-04 22 1
4 2018-01-05 18 -2
.. ... ... ...
360 2018-12-27 33 23
361 2018-12-28 40 21
362 2018-12-29 50 37
363 2018-12-30 37 24
364 2018-12-31 35 25
Average Temperature Precipitation Snowfall Snow Depth
0 2.5 0.04 1.0 3.0
1 7.0 0.03 0.6 4.0
2 8.5 0.00 0.0 4.0
3 11.5 0.00 0.0 3.0
4 8.0 0.09 1.2 4.0
.. ... ... ... ...
360 28.0 0.00 0.0 1.0
361 30.5 0.07 0.0 0.0
362 43.5 0.04 0.0 0.0
363 30.5 0.02 0.7 1.0
364 30.0 0.00 0.0 0.0
[365 rows x 7 columns]
I wrote the code for the above problem is`
at_least_70 = dfc.loc[dfc['Minimum Temperature']>=70,['Date']]
print(at_least_70)
num_at_least_70 = at_least_70.count()
print(num_at_least_70)
The Results it is showing
Date
204 2018-07-24
240 2018-08-29
245 2018-09-03
Date 3
dtype: int64
But when run the test case it is showing...
Incorrect!
You are not correctly extracting the subset.
As suggested by #HenryYik, remove the column selector:
at_least_70 = dfc.loc[dfc['Maximum Temperature'] >= 70,
['Date', 'Maximum Temperature']]
num_at_least_70 = len(at_least_70)
Use boolean indexing and for count Trues of mask use sum:
mask = dfc['Minimum Temperature'] >= 70
at_least_70 = dfs[mask]
num_at_least_70 = mask.sum()

Changing a value of another column based on another column

I have a dataframe like this:
Index Latitude Longitude Wave Height Wave Period
0 7.101 101 0.3 4.1
1 7.103 101 0.25 4.2
2 7.105 101 0.5 4.4
3 0 0 0.6 4.6
4 0 0 0.7 4.8
5 7.1 101 0.3 4.1
6 7.1 101 0.3 4.3
7 7.1 101 0.3 4.4
8 0 0 0.6 4.6
9 0 0 0.7 4.8
10 7.1 101 0.3 4.1
I want to change Wave Height and Wave Period value to zero if Latitude and Longitude equals to zero.
Desired output:
Index Latitude Longitude Wave Height Wave Period
0 7.101 101 0.3 4.1
1 7.103 101 0.25 4.2
2 7.105 101 0.5 4.4
3 0 0 0 0
4 0 0 0 0
5 7.1 101 0.3 4.1
6 7.1 101 0.3 4.3
7 7.1 101 0.3 4.4
8 0 0 0 0
9 0 0 0 0
10 7.1 101 0.3 4.1
You could use pd.loc:
df.loc[df['Latitude'].eq(0) & df['Longitude'].eq(0),
['Wave Height', 'Wave Period']] = 0
Output:
Index Latitude Longitude Wave Height Wave Period
0 7.101 101 0.30 4.1
1 7.103 101 0.25 4.2
2 7.105 101 0.50 4.4
3 0 0 0.00 0.0
4 0 0 0.00 0.0
5 7.100 101 0.30 4.1
6 7.100 101 0.30 4.3
7 7.100 101 0.30 4.4
8 0 0 0.00 0.0
9 0 0 0.00 0.0
10 7.100 101 0.30 4.1
use numpy function np.where
import numpy as np
df["Wave Height"]=np.where((df["Latitude"]==0) & (df["Longitude"]==0),0,df["Wave Height"])
df["Wave Period"]=np.where((df["Latitude"]==0) & (df["Longitude"]==0),0,df["Wave Periods"])

how to get the difference between a column from two dataframes by getting their index from another dataframe?

I have two dataframes for groundtruth and predicted trajectories and one dataframe for matching between the groundtruth and predicted trajectories at each frame. I have dataframe of the groundtruth tracks and predicted tracks as follows:
df_pred_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId HId
0 0 -1.870000 -0.41 1.51 1.280 1.670 0.39
1 0 -1.730000 -0.36 1.51 1.440 1.660 0.40
2 0 -1.180000 -1.57 2.05 2.220 0.390 0.61
0 1 -1.540000 -1.83 2.05 2.140 0.390 0.61
1 1 -1.370000 -1.70 2.05 2.180 0.390 0.61
2 1 -1.590000 -0.29 1.51 1.610 1.630 0.41
1 2 -1.910000 -1.12 1.04 0.870 1.440 0.30
2 2 -1.810000 -1.09 1.04 1.010 1.440 0.27
0 3 17.190001 -3.15 1.80 2.178 -0.028 3.36
1 3 15.000000 -3.60 1.80 2.170 -0.020 3.38
df_gt_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId OId
1 0 -1.91 -1.12 1.040 0.87 1.44 0.30
2 0 -1.81 -1.09 1.040 1.01 1.44 0.27
0 1 -1.87 -0.41 1.510 1.28 1.67 0.39
1 1 -1.73 -0.36 1.510 1.44 1.66 0.40
2 1 -1.59 -0.29 1.510 1.61 1.63 0.41
0 2 -1.54 -1.83 2.056 2.14 0.39 0.61
1 2 -1.37 -1.70 2.050 2.18 0.39 0.61
2 2 -1.18 -1.57 2.050 2.22 0.39 0.61
0 3 1.71 -0.31 1.800 2.17 -0.02 3.36
1 3 1.50 -0.36 1.800 2.17 -0.02 3.38
2 3 1.29 -0.41 1.800 2.17 -0.01 3.40
Also, I know their matching at each timestamp:
matched_gt_pred =
FrameId Type OId HId
0 0 MATCH 1.0 0.0
1 0 MATCH 2.0 1.0
4 1 MATCH 1.0 0.0
5 1 MATCH 2.0 1.0
6 1 MATCH 0.0 2.0
9 2 MATCH 0.0 2.0
I would like to look at each row of matched_gt_pred and get the corresponding CENTER_X from df_pred_batch and df_gt_batch and calculate the error.
For instance looking at the first row of the matched_gt_pred I know at FrameId == 0 and OId == 1 and HId == 0 are matched. I should get the Center_X from gt_center_x = df_gt_batch["FrameId==0" and "OId == 1"].CENTER_X and pred_center_x = df_pred_batch["FrameId==0" and "HId == 0"].CENTER_X And compute error = abs(gt_center_x - pred_center_x)
IIUC, I would reshape your df_gt_batch and df_pred_batch and use lookup:
gt_x = df_gt_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['OId'])
pred_x = df_pred_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['HId'])
match_gt_pred['X Error'] = np.abs(gt_x - pred_x)
Output:
FrameId Type OId HId X Error
0 0 MATCH 1.0 0.0 0.0
1 0 MATCH 2.0 1.0 0.0
4 1 MATCH 1.0 0.0 0.0
5 1 MATCH 2.0 1.0 0.0
6 1 MATCH 0.0 2.0 0.0
9 2 MATCH 0.0 2.0 0.0
Another option is to use reindex with pd.MultiIndex:
match_gt_pred['X Error'] = (df_pred_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['HId']]))['Center_X'].to_numpy() -
df_gt_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['OId']]))['Center_X'].to_numpy())

Pandas: 1 dataframe comparing rows to create new column

I have a problem which I cannot seem to get my head round.
df1 is as follows:
Group item Quarter price quantity
1 A 2017Q3 0.10 1000
1 A 2017Q4 0.11 1000
1 A 2018Q1 0.11 1000
1 A 2018Q2 0.12 1000
1 A 2018Q3 0.11 1000
Result desired is a new dataframe call it df2 with an additional column.
Group item Quarter price quantity savings/lost
1 A 2017Q3 0.10 1000 0.00
1 A 2017Q4 0.11 1000 0.00
1 A 2018Q1 0.11 1000 0.00
1 A 2018Q2 0.12 1000 0.00
1 A 2018Q3 0.11 1000 10.00
1 A 2018Q4 0.13 1000 -20.00
Essentially, I want to go down each row, look at the quarter and find last year's similar quarter and do a calculation (price this quarter - price last quarter * quantity). If there are no previous quarter data, just have in the last column.
And to complete the picture, there are more groups and items in there, and even more quarters like 2016Q1, 2017Q1, 2018Q1 although i only need compare the year before. Quarters are in string format.
Use pandas.DataFrame.shift
The code below assumes that your column Quarter is sorted and there is no missing quarters. You can try with the below code:
# Input dataframe
Group item Quarter price quantity
0 1 A 2017Q3 0.10 1000
1 1 A 2017Q4 0.11 1000
2 1 A 2018Q1 0.11 1000
3 1 A 2018Q2 0.12 1000
4 1 A 2018Q3 0.11 1000
5 1 A 2018Q4 0.13 1000
# Code to generate your new column 'savings/lost'
df['savings/lost'] = df['price'] * df['quantity'] - df['price'].shift(4) * df['quantity'].shift(4)
# Output dataframe
Group item Quarter price quantity savings/lost
0 1 A 2017Q3 0.10 1000 NaN
1 1 A 2017Q4 0.11 1000 NaN
2 1 A 2018Q1 0.11 1000 NaN
3 1 A 2018Q2 0.12 1000 NaN
4 1 A 2018Q3 0.11 1000 10.0
5 1 A 2018Q4 0.13 1000 20.0
Update
I have updated my code to handle two things, first sort the Quarter and second handle the missing Quarter scenario. For grouping based on columns you can refer pandas.DataFrame.groupby and many pd.groupby related questions already answered in this site.
#Input dataframe
Group item Quarter price quantity
0 1 A 2014Q3 0.10 100
1 1 A 2017Q2 0.16 800
2 1 A 2017Q3 0.17 700
3 1 A 2015Q4 0.13 400
4 1 A 2016Q1 0.14 500
5 1 A 2014Q4 0.11 200
6 1 A 2015Q2 0.12 300
7 1 A 2016Q4 0.15 600
8 1 A 2018Q1 0.18 600
9 1 A 2018Q2 0.19 500
#Code to do the operations
df.index = pd.PeriodIndex(df.Quarter, freq='Q')
df.sort_index(inplace=True)
df2 = df.reset_index(drop=True)
df2['Profit'] = (df.price * df.quantity) - (df.reindex(df.index - 4).price * df.reindex(df.index - 4).quantity).values
df2['Profit'] = np.where(np.in1d(df.index - 4, df.index.values),
df2.Profit, ((df.price * df.quantity) - (df.price.shift(1) * df.quantity.shift(1))))
df2.Profit.fillna(0, inplace=True)
#Output dataframe
Group item Quarter price quantity Profit
0 1 A 2014Q3 0.10 100 0.0
1 1 A 2014Q4 0.11 200 12.0
2 1 A 2015Q2 0.12 300 14.0
3 1 A 2015Q4 0.13 400 0.0
4 1 A 2016Q1 0.14 500 18.0
5 1 A 2016Q4 0.15 600 0.0
6 1 A 2017Q2 0.16 800 38.0
7 1 A 2017Q3 0.17 700 -9.0
8 1 A 2018Q1 0.18 600 -11.0
9 1 A 2018Q2 0.19 500 0.0