Changing a value of another column based on another column

Changing a value of another column based on another column - pandas

I have a dataframe like this:
Index Latitude Longitude Wave Height Wave Period
0 7.101 101 0.3 4.1
1 7.103 101 0.25 4.2
2 7.105 101 0.5 4.4
3 0 0 0.6 4.6
4 0 0 0.7 4.8
5 7.1 101 0.3 4.1
6 7.1 101 0.3 4.3
7 7.1 101 0.3 4.4
8 0 0 0.6 4.6
9 0 0 0.7 4.8
10 7.1 101 0.3 4.1
I want to change Wave Height and Wave Period value to zero if Latitude and Longitude equals to zero.
Desired output:
Index Latitude Longitude Wave Height Wave Period
0 7.101 101 0.3 4.1
1 7.103 101 0.25 4.2
2 7.105 101 0.5 4.4
3 0 0 0 0
4 0 0 0 0
5 7.1 101 0.3 4.1
6 7.1 101 0.3 4.3
7 7.1 101 0.3 4.4
8 0 0 0 0
9 0 0 0 0
10 7.1 101 0.3 4.1

You could use pd.loc:
df.loc[df['Latitude'].eq(0) & df['Longitude'].eq(0),
['Wave Height', 'Wave Period']] = 0
Output:
Index Latitude Longitude Wave Height Wave Period
0 7.101 101 0.30 4.1
1 7.103 101 0.25 4.2
2 7.105 101 0.50 4.4
3 0 0 0.00 0.0
4 0 0 0.00 0.0
5 7.100 101 0.30 4.1
6 7.100 101 0.30 4.3
7 7.100 101 0.30 4.4
8 0 0 0.00 0.0
9 0 0 0.00 0.0
10 7.100 101 0.30 4.1

use numpy function np.where
import numpy as np
df["Wave Height"]=np.where((df["Latitude"]==0) & (df["Longitude"]==0),0,df["Wave Height"])
df["Wave Period"]=np.where((df["Latitude"]==0) & (df["Longitude"]==0),0,df["Wave Periods"])

Related

Based on some rules, how to expand data in Pandas?

Please forgive my English. I hope I can say clearly.
Assume we have this data:
>>> data = {'Span':[3,3.5], 'Low':[6.2,5.16], 'Medium':[4.93,4.1], 'High':[3.68,3.07], 'VeryHigh':[2.94,2.45], 'ExtraHigh':[2.48,2.06], '0.9':[4.9,3.61], '1.5':[3.23,2.38], '2':[2.51,1.85]}
>>> df = pd.DataFrame(data)
>>> df
Span Low Medium High VeryHigh ExtraHigh 0.9 1.5 2
0 3.0 6.20 4.93 3.68 2.94 2.48 4.90 3.23 2.51
1 3.5 5.16 4.10 3.07 2.45 2.06 3.61 2.38 1.85
I want to get this data:
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85
The principles apply to df:
Span expands by the combination of Wind and Snow to get the MaxSpacing
Wind and Snow is mutually exclusive. When Wind is one of 'Low', 'Medium', 'High', 'VeryHigh', 'ExtraHigh', Snow is zero; when Snow is one of 0.9, 1.5, 2, Wind is zero.
Please help. Thank you.

Use DataFrame.melt for unpivot and then sorting by indices, create Snow column by to_numeric and Series.fillna in DataFrame.insert and last set 0 for Wind column:
df = (df.melt('Span', ignore_index=False, var_name='Wind', value_name='MaxSpacing')
.sort_index(ignore_index=True))
s = pd.to_numeric(df['Wind'], errors='coerce')
df.insert(2, 'Snow', s.fillna(0))
df.loc[s.notna(), 'Wind'] = 0
print (df)
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85
Alternative solution with DataFrame.set_index and DataFrame.stack:
df = df.set_index('Span').rename_axis('Wind', axis=1).stack().reset_index(name='MaxSpacing')
s = pd.to_numeric(df['Wind'], errors='coerce')
df.insert(2, 'Snow', s.fillna(0))
df.loc[s.notna(), 'Wind'] = 0
print (df)
Span Wind Snow MaxSpacing
0 3.0 Low 0.0 6.20
1 3.0 Medium 0.0 4.93
2 3.0 High 0.0 3.68
3 3.0 VeryHigh 0.0 2.94
4 3.0 ExtraHigh 0.0 2.48
5 3.0 0 0.9 4.90
6 3.0 0 1.5 3.23
7 3.0 0 2.0 2.51
8 3.5 Low 0.0 5.16
9 3.5 Medium 0.0 4.10
10 3.5 High 0.0 3.07
11 3.5 VeryHigh 0.0 2.45
12 3.5 ExtraHigh 0.0 2.06
13 3.5 0 0.9 3.61
14 3.5 0 1.5 2.38
15 3.5 0 2.0 1.85

How to extract a database based on a condition in pandas?

Please help me
The below one is the problem...
write an expression to extract a new dataframe containing those days where the temperature reached at least 70 degrees, and assign that to the variable at_least_70. (You might need to think some about what the different columns in the full dataframe represent to decide how to extract the subset of interest.)
After that, write another expression that computes how many days reached at least 70 degrees, and assign that to the variable num_at_least_70.
This is the original DataFrame
Date Maximum Temperature Minimum Temperature \
0 2018-01-01 5 0
1 2018-01-02 13 1
2 2018-01-03 19 -2
3 2018-01-04 22 1
4 2018-01-05 18 -2
.. ... ... ...
360 2018-12-27 33 23
361 2018-12-28 40 21
362 2018-12-29 50 37
363 2018-12-30 37 24
364 2018-12-31 35 25
Average Temperature Precipitation Snowfall Snow Depth
0 2.5 0.04 1.0 3.0
1 7.0 0.03 0.6 4.0
2 8.5 0.00 0.0 4.0
3 11.5 0.00 0.0 3.0
4 8.0 0.09 1.2 4.0
.. ... ... ... ...
360 28.0 0.00 0.0 1.0
361 30.5 0.07 0.0 0.0
362 43.5 0.04 0.0 0.0
363 30.5 0.02 0.7 1.0
364 30.0 0.00 0.0 0.0
[365 rows x 7 columns]
I wrote the code for the above problem is`
at_least_70 = dfc.loc[dfc['Minimum Temperature']>=70,['Date']]
print(at_least_70)
num_at_least_70 = at_least_70.count()
print(num_at_least_70)
The Results it is showing
Date
204 2018-07-24
240 2018-08-29
245 2018-09-03
Date 3
dtype: int64
But when run the test case it is showing...
Incorrect!
You are not correctly extracting the subset.

As suggested by #HenryYik, remove the column selector:
at_least_70 = dfc.loc[dfc['Maximum Temperature'] >= 70,
['Date', 'Maximum Temperature']]
num_at_least_70 = len(at_least_70)

Use boolean indexing and for count Trues of mask use sum:
mask = dfc['Minimum Temperature'] >= 70
at_least_70 = dfs[mask]
num_at_least_70 = mask.sum()

Classify a value under certain conditions in pandas dataframe

I have this dataframe:
value limit_1 limit_2 limit_3 limit_4
10 2 3 7 10
11 5 6 11 13
2 0.3 0.9 2.01 2.99
I want to add another column called class that classifies the value column this way:
if value <= limit1.value then 1
if value > limit1.value and <= limit2.value then 2
if value > limit2.value and <= limit3.value then 3
if value > limit3.value then 4
to get this result:
value limit_1 limit_2 limit_3 limit_4 CLASS
10 2 3 7 10 4
11 5 6 11 13 3
2 0.3 0.9 2.01 2.99 3
I know I could work to get these 'if's to work but my dataframe has 2kk rows and I need the fasted way to perform such classification.
I tried to use .cut function but the result was not what I expected/wanted
Thanks

We can use the rank method over the column axis (axis=1):
df["CLASS"] = df.rank(axis=1, method="first").iloc[:, 0].astype(int)
value limit_1 limit_2 limit_3 limi_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3

We can use np.select:
import numpy as np
conditions = [df["value"]<df["limit_1"],
df["value"].between(df["limit_1"], df["limit_2"]),
df["value"].between(df["limit_2"], df["limit_3"]),
df["value"]>df["limit_3"]]
df["CLASS"] = np.select(conditions, [1,2,3,4])
>>> df
value limit_1 limit_2 limit_3 limit_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3

resample data within each group in pandas

I have a dataframe with different id and possible overlapping time with the time step of 0.4 second. I would like to resample the average speed for each id with the time step of 0.8 second.
time id speed
0 0.0 1 0
1 0.4 1 3
2 0.8 1 6
3 1.2 1 9
4 0.8 2 12
5 1.2 2 15
6 1.6 2 18
An example can be created by the following code
x = np.hstack((np.array([1] * 10), np.array([3] * 15)))
a = np.arange(10)*0.4
b = np.arange(15)*0.4 + 2
t = np.hstack((a, b))
df = pd.DataFrame({"time": t, "id": x})
df["speed"] = pd.DataFrame(np.arange(25) * 3)
The time column is transferred to datetime type by
df["re_time"] = pd.to_datetime(df["time"], unit='s')

Try with groupby:
block_size = int(0.8//0.4)
blocks = df.groupby('id').cumcount() // block_size
df.groupby(['id',blocks]).agg({'time':'first', 'speed':'mean'})
Output:
time speed
id
1 0 0.0 1.5
1 0.8 7.5
2 1.6 13.5
3 2.4 19.5
4 3.2 25.5
3 0 2.0 31.5
1 2.8 37.5
2 3.6 43.5
3 4.4 49.5
4 5.2 55.5
5 6.0 61.5
6 6.8 67.5
7 7.6 72.0

how to get the difference between a column from two dataframes by getting their index from another dataframe?

I have two dataframes for groundtruth and predicted trajectories and one dataframe for matching between the groundtruth and predicted trajectories at each frame. I have dataframe of the groundtruth tracks and predicted tracks as follows:
df_pred_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId HId
0 0 -1.870000 -0.41 1.51 1.280 1.670 0.39
1 0 -1.730000 -0.36 1.51 1.440 1.660 0.40
2 0 -1.180000 -1.57 2.05 2.220 0.390 0.61
0 1 -1.540000 -1.83 2.05 2.140 0.390 0.61
1 1 -1.370000 -1.70 2.05 2.180 0.390 0.61
2 1 -1.590000 -0.29 1.51 1.610 1.630 0.41
1 2 -1.910000 -1.12 1.04 0.870 1.440 0.30
2 2 -1.810000 -1.09 1.04 1.010 1.440 0.27
0 3 17.190001 -3.15 1.80 2.178 -0.028 3.36
1 3 15.000000 -3.60 1.80 2.170 -0.020 3.38
df_gt_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId OId
1 0 -1.91 -1.12 1.040 0.87 1.44 0.30
2 0 -1.81 -1.09 1.040 1.01 1.44 0.27
0 1 -1.87 -0.41 1.510 1.28 1.67 0.39
1 1 -1.73 -0.36 1.510 1.44 1.66 0.40
2 1 -1.59 -0.29 1.510 1.61 1.63 0.41
0 2 -1.54 -1.83 2.056 2.14 0.39 0.61
1 2 -1.37 -1.70 2.050 2.18 0.39 0.61
2 2 -1.18 -1.57 2.050 2.22 0.39 0.61
0 3 1.71 -0.31 1.800 2.17 -0.02 3.36
1 3 1.50 -0.36 1.800 2.17 -0.02 3.38
2 3 1.29 -0.41 1.800 2.17 -0.01 3.40
Also, I know their matching at each timestamp:
matched_gt_pred =
FrameId Type OId HId
0 0 MATCH 1.0 0.0
1 0 MATCH 2.0 1.0
4 1 MATCH 1.0 0.0
5 1 MATCH 2.0 1.0
6 1 MATCH 0.0 2.0
9 2 MATCH 0.0 2.0
I would like to look at each row of matched_gt_pred and get the corresponding CENTER_X from df_pred_batch and df_gt_batch and calculate the error.
For instance looking at the first row of the matched_gt_pred I know at FrameId == 0 and OId == 1 and HId == 0 are matched. I should get the Center_X from gt_center_x = df_gt_batch["FrameId==0" and "OId == 1"].CENTER_X and pred_center_x = df_pred_batch["FrameId==0" and "HId == 0"].CENTER_X And compute error = abs(gt_center_x - pred_center_x)

IIUC, I would reshape your df_gt_batch and df_pred_batch and use lookup:
gt_x = df_gt_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['OId'])
pred_x = df_pred_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['HId'])
match_gt_pred['X Error'] = np.abs(gt_x - pred_x)
Output:
FrameId Type OId HId X Error
0 0 MATCH 1.0 0.0 0.0
1 0 MATCH 2.0 1.0 0.0
4 1 MATCH 1.0 0.0 0.0
5 1 MATCH 2.0 1.0 0.0
6 1 MATCH 0.0 2.0 0.0
9 2 MATCH 0.0 2.0 0.0
Another option is to use reindex with pd.MultiIndex:
match_gt_pred['X Error'] = (df_pred_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['HId']]))['Center_X'].to_numpy() -
df_gt_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['OId']]))['Center_X'].to_numpy())

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Changing a value of another column based on another column - pandas

use numpy function np.where import numpy as np df["Wave Height"]=np.where((df["Latitude"]==0) & (df["Longitude"]==0),0,df["Wave Height"]) df["Wave Period"]=np.where((df["Latitude"]==0) & (df["Longitude"]==0),0,df["Wave Periods"])

Related

Based on some rules, how to expand data in Pandas?

How to extract a database based on a condition in pandas?

Classify a value under certain conditions in pandas dataframe

resample data within each group in pandas

how to get the difference between a column from two dataframes by getting their index from another dataframe?

Categories

Resources