comparison between two dataframes and find highest difference - pandas

I have two dataframes df1 and df2. Both are indexed the same with [i_batch, i_example]
The columns are different rmse errors. I would like to find [i_batch, i_example] that df1 is a lot lower than df2, or find the rows that df1 has less error than df2 based on the common [i_batch, i_example].
Note that it is possible that a specific [i_batch, i_example] only happens in one of the df1 or df2. But I need to only consider [i_batch, i_example] that exists in both df1 and df2.
df1 =
rmse_ACCELERATION rmse_CENTER_X rmse_CENTER_Y rmse_HEADING rmse_LENGTH rmse_TURN_RATE rmse_VELOCITY rmse_WIDTH
i_batch i_example
0 0.0 1.064 1.018 0.995 0.991 1.190 0.967 1.029 1.532
1 0.0 1.199 1.030 1.007 1.048 1.278 0.967 1.156 1.468
1.0 1.101 1.026 1.114 2.762 0.967 0.967 1.083 1.186
2 0.0 1.681 1.113 1.090 1.001 1.670 0.967 1.205 1.160
1.0 1.637 1.122 1.183 0.987 1.521 0.967 1.191 1.278
2.0 1.252 1.035 1.035 2.507 1.108 0.967 1.210 1.595
3 0.0 1.232 1.014 1.019 1.627 1.143 0.967 1.080 1.583
1.0 1.195 1.028 1.019 1.151 1.097 0.967 1.071 1.549
2.0 1.233 1.010 1.004 1.616 1.135 0.967 1.082 1.573
3.0 1.179 1.017 1.014 1.368 1.132 0.967 1.099 1.518
and
df2 =
rmse_ACCELERATION rmse_CENTER_X rmse_CENTER_Y rmse_HEADING rmse_LENGTH rmse_TURN_RATE rmse_VELOCITY rmse_WIDTH
i_batch i_example
1 0.0 0.071 0.034 0.048 0.114 0.006 1.309e-03 0.461 0.004
1.0 0.052 0.055 0.062 2.137 0.023 8.232e-04 0.357 0.011
2 0.0 1.665 0.156 0.178 0.112 0.070 3.751e-03 2.326 0.016
1.0 0.880 0.210 0.088 0.055 0.202 1.449e-03 0.899 0.047
2.0 0.199 0.072 0.078 1.686 0.010 6.240e-04 0.239 0.008
3 0.0 0.332 0.068 0.097 1.211 0.022 5.127e-04 0.167 0.016
1.0 0.252 0.075 0.070 0.368 0.013 5.295e-04 0.136 0.008
2.0 0.268 0.067 0.064 1.026 0.010 5.564e-04 0.175 0.010
3.0 0.171 0.051 0.054 0.473 0.011 4.150e-04 0.220 0.009
5 0.0 0.014 0.099 0.119 0.389 0.123 3.846e-04 0.313 0.037
For instance how can I get the [i_batch, i_example] that `df1[rmse_ACCELERATION] < df1[rmse_ACCELERATION]'?

Do a merge and then just filter according to your needs
df_merge = df_1.merge(df_2,
left_index=True,
right_index=True,
suffixes=('_1','_2'))
df_merge[
df_merge['rmse_ACCELERATION_1'] < df_merge['rmse_ACCELERATION_2']
].index
However I don't see any records with same [i_batch, i_example] in both dataframes that passes the condition

Use .sub(), that directly matches the indices and subtracts matches.
df3=df1.sub(df2)
df3[(df3<0).any(1)]
Or go specific and try searching in df1 by
df1[(df1.sub(df2)<0).any(1)]
rmse_ACCELERATION rmse_CENTER_X rmse_CENTER_Y \
i_batch i_example
2 0.0 0.016 0.957 0.912
rmse_HEADING rmse_LENGTH rmse_TURN_RATE rmse_VELOCITY \
i_batch i_example
2 0.0 0.889 1.6 0.963249 -1.121
rmse_WIDTH
i_batch i_example
2 0.0 1.144

Related

Pandas show NaN value after fillna with mode [duplicate]

This question already has answers here:
How to Pandas fillna() with mode of column?
(7 answers)
Closed 4 months ago.
I read data from csv and fillna with mode like this code.
df = pd.read_csv(r'C:\\Users\PC\Downloads\File.csv')
df.fillna(df.mode(), inplace=True)
It still show NaN value like this.
0 0.0 0.0 4.7 0.0 138.0 0.15 0.15
1 0.0 1.0 3.5 0.0 132.0 0.38 0.18
2 0.0 0.0 4.0 0.0 132.0 0.30 0.11
3 0.0 1.0 3.9 0.0 146.0 0.75 0.37
4 0.0 1.0 3.5 0.0 132.0 0.45 0.22
5 0.0 NaN NaN NaN NaN 0.45 0.22
6 0.0 NaN NaN NaN NaN 0.30 0.11
7 0.0 0.0 4.5 0.0 136.0 NaN NaN
8 0.0 NaN NaN NaN NaN 0.30 0.37
9 0.0 NaN NaN NaN NaN 0.38 0.11
If I fillna with mean it have no problem. How to fillna with mode?
Because DataFrame.mode should return multiple values if smae number of maximum counts, select first row:
print (df.mode())
1 2 3 4 5 6 7
0 0.0 0.0 3.5 0.0 132.0 0.3 0.11
1 NaN 1.0 NaN NaN NaN NaN NaN
df.fillna(df.mode().iloc[0], inplace=True)
print (df)
1 2 3 4 5 6 7
0
0 0.0 0.0 4.7 0.0 138.0 0.15 0.15
1 0.0 1.0 3.5 0.0 132.0 0.38 0.18
2 0.0 0.0 4.0 0.0 132.0 0.30 0.11
3 0.0 1.0 3.9 0.0 146.0 0.75 0.37
4 0.0 1.0 3.5 0.0 132.0 0.45 0.22
5 0.0 0.0 3.5 0.0 132.0 0.45 0.22
6 0.0 0.0 3.5 0.0 132.0 0.30 0.11
7 0.0 0.0 4.5 0.0 136.0 0.30 0.11
8 0.0 0.0 3.5 0.0 132.0 0.30 0.37
9 0.0 0.0 3.5 0.0 132.0 0.38 0.11

how to get the difference between a column from two dataframes by getting their index from another dataframe?

I have two dataframes for groundtruth and predicted trajectories and one dataframe for matching between the groundtruth and predicted trajectories at each frame. I have dataframe of the groundtruth tracks and predicted tracks as follows:
df_pred_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId HId
0 0 -1.870000 -0.41 1.51 1.280 1.670 0.39
1 0 -1.730000 -0.36 1.51 1.440 1.660 0.40
2 0 -1.180000 -1.57 2.05 2.220 0.390 0.61
0 1 -1.540000 -1.83 2.05 2.140 0.390 0.61
1 1 -1.370000 -1.70 2.05 2.180 0.390 0.61
2 1 -1.590000 -0.29 1.51 1.610 1.630 0.41
1 2 -1.910000 -1.12 1.04 0.870 1.440 0.30
2 2 -1.810000 -1.09 1.04 1.010 1.440 0.27
0 3 17.190001 -3.15 1.80 2.178 -0.028 3.36
1 3 15.000000 -3.60 1.80 2.170 -0.020 3.38
df_gt_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId OId
1 0 -1.91 -1.12 1.040 0.87 1.44 0.30
2 0 -1.81 -1.09 1.040 1.01 1.44 0.27
0 1 -1.87 -0.41 1.510 1.28 1.67 0.39
1 1 -1.73 -0.36 1.510 1.44 1.66 0.40
2 1 -1.59 -0.29 1.510 1.61 1.63 0.41
0 2 -1.54 -1.83 2.056 2.14 0.39 0.61
1 2 -1.37 -1.70 2.050 2.18 0.39 0.61
2 2 -1.18 -1.57 2.050 2.22 0.39 0.61
0 3 1.71 -0.31 1.800 2.17 -0.02 3.36
1 3 1.50 -0.36 1.800 2.17 -0.02 3.38
2 3 1.29 -0.41 1.800 2.17 -0.01 3.40
Also, I know their matching at each timestamp:
matched_gt_pred =
FrameId Type OId HId
0 0 MATCH 1.0 0.0
1 0 MATCH 2.0 1.0
4 1 MATCH 1.0 0.0
5 1 MATCH 2.0 1.0
6 1 MATCH 0.0 2.0
9 2 MATCH 0.0 2.0
I would like to look at each row of matched_gt_pred and get the corresponding CENTER_X from df_pred_batch and df_gt_batch and calculate the error.
For instance looking at the first row of the matched_gt_pred I know at FrameId == 0 and OId == 1 and HId == 0 are matched. I should get the Center_X from gt_center_x = df_gt_batch["FrameId==0" and "OId == 1"].CENTER_X and pred_center_x = df_pred_batch["FrameId==0" and "HId == 0"].CENTER_X And compute error = abs(gt_center_x - pred_center_x)
IIUC, I would reshape your df_gt_batch and df_pred_batch and use lookup:
gt_x = df_gt_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['OId'])
pred_x = df_pred_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['HId'])
match_gt_pred['X Error'] = np.abs(gt_x - pred_x)
Output:
FrameId Type OId HId X Error
0 0 MATCH 1.0 0.0 0.0
1 0 MATCH 2.0 1.0 0.0
4 1 MATCH 1.0 0.0 0.0
5 1 MATCH 2.0 1.0 0.0
6 1 MATCH 0.0 2.0 0.0
9 2 MATCH 0.0 2.0 0.0
Another option is to use reindex with pd.MultiIndex:
match_gt_pred['X Error'] = (df_pred_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['HId']]))['Center_X'].to_numpy() -
df_gt_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['OId']]))['Center_X'].to_numpy())

Display pandas dataframe in excel file with split level column and merged cells

I have a large dataframe df as:
Col1 Col2 ATC_Dzr ATC_Last ATC_exp Op_Dzr2 Op_Last2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.4 -0.85 -0.86 0.1 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29
I need to dump this to excel so that it looks as following:
where ATC and Op are in a merged cells
I am not sure how to approach this?
You can set the first 2 columns as index and split the rest and expand to create a Multiindex:
df1 = df.set_index(['Col1','Col2'])
df1.columns = df1.columns.str.split('_',expand=True)
print(df1)
ATC Op
Dzr Last exp Dzr2 Last2
Col1 Col2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.40 -0.85 -0.86 0.10 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29
Then export df1 into excel.
As per coments by #Datanovice , you can also use Pd.MultiIndex.from_tuples:
df1 = df.set_index(['Col1','Col2'])
df1.columns = pd.MultiIndex.from_tuples([(col.split('_')[0], col.split('_')[1])
for col in df1.columns])
print(df1)
ATC Op
Dzr Last exp Dzr2 Last2
Col1 Col2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.40 -0.85 -0.86 0.10 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29

Restart cumulative sum in pandas dataframe

I am trying to start a cumulative sum in a pandas dataframe, restarting everytime the absolute value is higher than 0.009. Could give you a excerpt of my tries but I assume they would just distract you. Have tried several things with np.where but at a certain point they start to overlap and it takes wrong things out.
Column b is the desired output.
df = pd.DataFrame({'values':(49.925,49.928,49.945,49.928,49.925,49.935,49.938,49.942,49.931,49.952)})
df['a']=df.diff()
values a b
0 49.925 NaN 0.000
1 49.928 0.003 0.003
2 49.945 0.017 0.020 (restart cumsum next row)
3 49.928 -0.017 -0.017 (restart cumsum next row)
4 49.925 -0.003 -0.003
5 49.935 0.010 0.007
6 49.938 0.003 0.010 (restart cumsum next row)
7 49.942 0.004 0.004
8 49.931 -0.011 -0.007
9 49.952 0.021 0.014 (restart cumsum next row)
So the actual objective is for python to understand that I want to restart the cumulative sum when it exceeds the absolute value of 0.009
I couldn't solve this in a vectorized manner, however applying a stateful function appears to work.
import pandas as pd
from pandas.compat import StringIO
print(pd.__version__)
df = pd.DataFrame({'values':(49.925,49.928,49.945,49.928,49.925,49.935,49.938,49.942,49.931,49.952)})
df['a']=df.diff()
accumulator = 0.0
reset = False
def myfunc(x):
global accumulator, reset
if(reset):
accumulator = 0.0
reset = False
accumulator += x
if abs(accumulator) > .009:
reset = True
return accumulator
df['a'].fillna(value=0, inplace=True)
df['b'] = df['a'].apply(myfunc)
print(df)
Produces
0.24.2
values a b
0 49.925 0.000 0.000
1 49.928 0.003 0.003
2 49.945 0.017 0.020
3 49.928 -0.017 -0.017
4 49.925 -0.003 -0.003
5 49.935 0.010 0.007
6 49.938 0.003 0.010
7 49.942 0.004 0.004
8 49.931 -0.011 -0.007
9 49.952 0.021 0.014

cutting off the values at a threshold in pandas dataframe

I have a dataframe with 5 columns all of which contain numerical values. The columns represent time steps. I have a threshold which, if reached within the time, stops the values from changing. So let's say the original values are [ 0 , 1.5, 2, 4, 1] arranged in a row, and threshold is 2, then i want the manipulated row values to be [0, 1, 2 , 2, 2]
Is there a way to do this without loops?
A bigger example:
>>> threshold = 0.25
>>> input
Out[75]:
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.20
143 0.11 0.27 0.12 0.28 0.35
146 0.30 0.20 0.12 0.25 0.20
324 0.06 0.20 0.12 0.15 0.20
>>> output
Out[75]:
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
Use:
df = df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)).ffill(axis=1).fillna(df)
print (df)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
Explanation:
Compare by threshold by ge (>=):
print (df.ge(threshold))
0 1 2 3 4
130 False False False True False
143 False True False True True
146 True False False True False
324 False False False False False
Create cumulative sum per rows:
print (df.ge(threshold).cumsum(axis=1))
0 1 2 3 4
130 0 0 0 1 1
143 0 1 1 2 3
146 1 1 1 2 2
324 0 0 0 0 0
Again for get first matched values:
print (df.ge(threshold).cumsum(axis=1).cumsum(axis=1))
0 1 2 3 4
130 0 0 0 1 2
143 0 1 2 4 7
146 1 2 3 5 7
324 0 0 0 0 0
Compare by 1:
print (df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1))
0 1 2 3 4
130 False False False True False
143 False True False False False
146 True False False False False
324 False False False False False
Replace to NaNs of no matched values:
print (df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)))
0 1 2 3 4
130 NaN NaN NaN 0.25 NaN
143 NaN 0.27 NaN NaN NaN
146 0.3 NaN NaN NaN NaN
324 NaN NaN NaN NaN NaN
Forward fill missing values:
print (df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)).ffill(axis=1))
0 1 2 3 4
130 NaN NaN NaN 0.25 0.25
143 NaN 0.27 0.27 0.27 0.27
146 0.3 0.30 0.30 0.30 0.30
324 NaN NaN NaN NaN NaN
Replace first values to original:
print (df.where(df.ge(threshold).cumsum(1).cumsum(1).eq(1)).ffill(axis=1).fillna(df))
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
A bit more complicated but I like it.
v = df.values
a = v >= threshold
b = np.where(np.logical_or.accumulate(a, axis=1), np.nan, v)
r = np.arange(len(a))
j = a.argmax(axis=1)
b[r, j] = v[r, j]
pd.DataFrame(b, df.index, df.columns).ffill(axis=1)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
I like this one too:
v = df.values
a = v >= threshold
b = np.logical_or.accumulate(a, axis=1)
r = np.arange(len(df))
g = a.argmax(1)
fill = pd.Series(v[r, g], df.index)
df.mask(b, fill, axis=0)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20