Pandas Cumsum in expanding rows - pandas

Looking to learn how to code this solution in a more elegant way. Need to split a set of rows into a smaller pieces and control the utilization as well as calculate the balance. Current solution is not generating the balance properly
import pandas as pd
import numpy as np
box_list = [['Box0', 0.2],
['Box1', 1.0],
['Box2', 1.8],
['Box4', 2.0],
['Box8', 4.01],]
sdf = pd.DataFrame(box_list, columns = ['Name', 'Size'])
print(sdf)
Name
Size
1
Box1
1.00
2
Box2
1.80
3
Box4
2.00
4
Box8
4.01
df = pd.DataFrame({'Name': np.repeat(sdf['Name'], sdf['Size'].apply(np.ceil)),
'Size': np.repeat(sdf['Size'], sdf['Size'].apply(np.ceil)),})
df['Max_Units']=df['Size'].apply(lambda x: np.ceil(x) if x>1.0 else 1.0)
df = df.reset_index()
df['Utilization'] =df['Size'].apply(lambda x: x-int(x) if x>1.0 else (x if x<1.0 else 1.0))
df['Balance'] =df['Max_Units']
g = df.groupby(['index'], as_index=0, group_keys=0)
df['Utilization'] = g.apply(lambda x:
pd.Series(np.where((x.Balance.shift(1) >= 1.0),
1.0,
x.Utilization))).values
df.loc[(df.Utilization == 0.0), ['Utilization']] = 1.0
df['Balance'] = g.apply(lambda x:
pd.Series(np.where((x.Balance.shift(1) >= 1.0),
x.Max_Units-x.Utilization,
0))).values
print(df)
index
Name
Size
Max_Units
Utilization
Balance
0
0
Box0
0.20
1.0
0.20
0.0
1
1
Box1
1.00
1.0
1.00
0.0
2
2
Box2
1.80
2.0
0.80
0.0
3
2
Box2
1.80
2.0
1.00
1.0
4
3
Box4
2.00
2.0
1.00
0.0
5
3
Box4
2.00
2.0
1.00
1.0
6
4
Box8
4.01
5.0
0.01
0.0
7
4
Box8
4.01
5.0
1.00
4.0
8
4
Box8
4.01
5.0
1.00
4.0
9
4
Box8
4.01
5.0
1.00
4.0
10
4
Box8
4.01
5.0
1.00
4.0

I'm not sure if I completely understand what all of these values are supposed to be representing.
However, I've achieved the correct desired output for your sample set in more direct ways:
import pandas as pd
import numpy as np
box_list = [['Box0', 0.2],
['Box1', 1.0],
['Box2', 1.8],
['Box4', 2.0],
['Box8', 4.01], ]
df = pd.DataFrame(box_list, columns=['Name', 'Size'])
# Set ceil column to ceil of size since it's used more than once
df['ceil'] = df['Size'].apply(np.ceil)
# Duplicate Rows based on Ceil of Size
df = df.loc[df.index.repeat(df['ceil'])]
# Get Max Units by comparing it to the ceil column
df['Max_Units'] = df.apply(lambda s: max(s['ceil'], 1), axis=1)
# Extract Decimal Portion By Using % 1 (Catch Special Case of x == 1)
df['Utilization'] = df['Size'].apply(lambda x: 1 if x == 1 else x % 1)
# Everywhere Max_Units cumcount is not 0 set Utilization to 1
df.loc[df.groupby(df['Max_Units']).cumcount().ne(0), 'Utilization'] = 1
# Set Balance to index cumcount as float
df['Balance'] = df.groupby(df.index).cumcount().astype(float)
# Drop Unnecessary Column and reset index for output
df = df.drop(columns=['ceil']).reset_index()
# For Display
print(df)
Output:
index
Name
Size
Max_Units
Utilization
Balance
0
0
Box0
0.20
1.0
0.20
0.0
1
1
Box1
1.00
1.0
1.00
0.0
2
2
Box2
1.80
2.0
0.80
0.0
3
2
Box2
1.80
2.0
1.00
1.0
4
3
Box4
2.00
2.0
1.00
0.0
5
3
Box4
2.00
2.0
1.00
1.0
6
4
Box8
4.01
5.0
0.01
0.0
7
4
Box8
4.01
5.0
1.00
1.0
8
4
Box8
4.01
5.0
1.00
2.0
9
4
Box8
4.01
5.0
1.00
3.0
10
4
Box8
4.01
5.0
1.00
4.0

Related

Obtaining a subset of the correlation matrix of a dataframe having only features that are less correlated

If I have a correlation matrix of features for a given target, like this:
feat1 feat2 feat3 feat4 feat5
feat1 1 ....
feat2 1
feat3 1
feat4 1
feat5 .... 1
how can I end up with a subset of the original correlation matrix give only some features that are less correlated? Let's say
feat2 feat3 feat5
feat2 1 ....
feat3 1
feat5 .... 1
In order to subset you just need to loc on both axis, i.e.:
In [105]: df
Out[105]:
0 1 2 3 4
0 0.4 0.0 0.0 0.00 0.0
1 0.0 1.0 0.0 0.00 0.0
2 0.0 0.0 1.0 0.00 0.0
3 0.0 0.0 0.0 0.45 0.0
4 0.0 0.0 0.0 0.00 1.0
target = [0, 2, 3] # ['featX', 'featY', 'etc']
subset = df.loc[target, target]
Or if you want to filter by some logic, do it in steps:
corr = pd.Series(np.diag(df), index=df.index)
high_corr = corr[corr > 0.7].index
subset = df.loc[high_corr, high_corr]
In [114]: subset
Out[114]:
1 2 4
1 1.0 0.0 0.0
2 0.0 1.0 0.0
4 0.0 0.0 1.0

Pandas concatenate dataframe with multiindex retaining index names

I have a list of DataFrames as follows where each DataFrame in the list is as follows:
dfList[0]
monthNum 1 2
G1
2.0 0.05 -0.16
3.0 1.17 0.07
4.0 9.06 0.83
dfList[1]
monthNum 1 2
G2
21.0 0.25 0.26
31.0 1.27 0.27
41.0 9.26 0.23
dfList[0].index
Float64Index([2.0, 3.0, 4.0], dtype='float64', name='G1')
dfList[0].columns
Int64Index([1, 2], dtype='int64', name='monthNum')
I am trying to achieve the following in a dataframe Final_Combined_DF:
monthNum 1 2
G1
2.0 0.05 -0.16
3.0 1.17 0.07
4.0 9.06 0.83
G2
21.0 0.25 0.26
31.0 1.27 0.27
41.0 9.26 0.23
I tried doing different combinations of:
pd.concat(dfList, axis=0)
but it has not given me desired output. I am not sure how to go about this.
We can try pd.concat with keys using the Index.name from each DataFrame to add a new level index in the final frame:
final_combined_df = pd.concat(
df_list, keys=map(lambda d: d.index.name, df_list)
)
final_combined_df:
monthNum 0 1
G1 2.0 4 7
3.0 7 1
4.0 9 5
G2 21.0 8 1
31.0 1 8
41.0 2 6
Setup Used:
import numpy as np
import pandas as pd
np.random.seed(5)
df_list = [
pd.DataFrame(np.random.randint(1, 10, (3, 2)),
columns=pd.Index([0, 1], name='monthNum'),
index=pd.Index([2.0, 3.0, 4.0], name='G1')),
pd.DataFrame(np.random.randint(1, 10, (3, 2)),
columns=pd.Index([0, 1], name='monthNum'),
index=pd.Index([21.0, 31.0, 41.0], name='G2'))
]
df_list:
[monthNum 0 1
G1
2.0 4 7
3.0 7 1
4.0 9 5,
monthNum 0 1
G2
21.0 8 1
31.0 1 8
41.0 2 6]

how to get the difference between a column from two dataframes by getting their index from another dataframe?

I have two dataframes for groundtruth and predicted trajectories and one dataframe for matching between the groundtruth and predicted trajectories at each frame. I have dataframe of the groundtruth tracks and predicted tracks as follows:
df_pred_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId HId
0 0 -1.870000 -0.41 1.51 1.280 1.670 0.39
1 0 -1.730000 -0.36 1.51 1.440 1.660 0.40
2 0 -1.180000 -1.57 2.05 2.220 0.390 0.61
0 1 -1.540000 -1.83 2.05 2.140 0.390 0.61
1 1 -1.370000 -1.70 2.05 2.180 0.390 0.61
2 1 -1.590000 -0.29 1.51 1.610 1.630 0.41
1 2 -1.910000 -1.12 1.04 0.870 1.440 0.30
2 2 -1.810000 -1.09 1.04 1.010 1.440 0.27
0 3 17.190001 -3.15 1.80 2.178 -0.028 3.36
1 3 15.000000 -3.60 1.80 2.170 -0.020 3.38
df_gt_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId OId
1 0 -1.91 -1.12 1.040 0.87 1.44 0.30
2 0 -1.81 -1.09 1.040 1.01 1.44 0.27
0 1 -1.87 -0.41 1.510 1.28 1.67 0.39
1 1 -1.73 -0.36 1.510 1.44 1.66 0.40
2 1 -1.59 -0.29 1.510 1.61 1.63 0.41
0 2 -1.54 -1.83 2.056 2.14 0.39 0.61
1 2 -1.37 -1.70 2.050 2.18 0.39 0.61
2 2 -1.18 -1.57 2.050 2.22 0.39 0.61
0 3 1.71 -0.31 1.800 2.17 -0.02 3.36
1 3 1.50 -0.36 1.800 2.17 -0.02 3.38
2 3 1.29 -0.41 1.800 2.17 -0.01 3.40
Also, I know their matching at each timestamp:
matched_gt_pred =
FrameId Type OId HId
0 0 MATCH 1.0 0.0
1 0 MATCH 2.0 1.0
4 1 MATCH 1.0 0.0
5 1 MATCH 2.0 1.0
6 1 MATCH 0.0 2.0
9 2 MATCH 0.0 2.0
I would like to look at each row of matched_gt_pred and get the corresponding CENTER_X from df_pred_batch and df_gt_batch and calculate the error.
For instance looking at the first row of the matched_gt_pred I know at FrameId == 0 and OId == 1 and HId == 0 are matched. I should get the Center_X from gt_center_x = df_gt_batch["FrameId==0" and "OId == 1"].CENTER_X and pred_center_x = df_pred_batch["FrameId==0" and "HId == 0"].CENTER_X And compute error = abs(gt_center_x - pred_center_x)
IIUC, I would reshape your df_gt_batch and df_pred_batch and use lookup:
gt_x = df_gt_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['OId'])
pred_x = df_pred_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['HId'])
match_gt_pred['X Error'] = np.abs(gt_x - pred_x)
Output:
FrameId Type OId HId X Error
0 0 MATCH 1.0 0.0 0.0
1 0 MATCH 2.0 1.0 0.0
4 1 MATCH 1.0 0.0 0.0
5 1 MATCH 2.0 1.0 0.0
6 1 MATCH 0.0 2.0 0.0
9 2 MATCH 0.0 2.0 0.0
Another option is to use reindex with pd.MultiIndex:
match_gt_pred['X Error'] = (df_pred_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['HId']]))['Center_X'].to_numpy() -
df_gt_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['OId']]))['Center_X'].to_numpy())

Python: group by with sum special columns and keep the initial rows too

I have a df:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 0.0 3.0 0.0 0.0
I would like to change the values for row #6: Passat value in Car column by add the values from row#2 & row#3 & row#4 (Golf, Tiguan, Touareg) in the Car column) and also keep the values of row#2 & row#3 & row#4 as initial.
Because Passat includes Golf, Touareg, Tiguan and due to it I need to add the values of Golf, Touareg, Tiguanrows to Passat row.
I tried to do it the following code:
car_list = ['Golf', 'Tiguan', 'Touareg']
for car in car_list:
df['Car'][df['Car']==car]='Passat'
and after I used groupby by Car and sum() function:
df1 = df.groupby(['Car'])['Jan17', 'Jun18', 'Dec18', 'Apr19'].sum().reset_index()
In result, df1 doesn't have initial (Golf, Tiguan, Touareg) rows. So, this way is wrong.
Expected result is df1:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 1.0 4.7 9.0 11.4
I'd appreciate for any idea. Thanks)
First we use .isin to get the correct Cars, then we use .filter to get the correct value columns, finally we sum the values and put them in our variable sums.
Then we select the Passat row and add the values to that row:
sums = df[df['Car'].isin(car_list)].filter(regex='\w{3}\d{2}').sum()
df.loc[df['Car'].eq('Passat'), 'Jan17':] += sums
Output
ID Car Jan17 Jun18 Dec18 Apr19
0 0 Nissan 0.0 1.7 3.7 0.0
1 1 Porsche 10.0 0.0 2.8 3.5
2 2 Golf 0.0 1.7 3.0 2.0
3 3 Tiguan 1.0 0.0 3.0 5.2
4 4 Touareg 0.0 0.0 3.0 4.2
5 5 Mercedes 0.0 0.0 0.0 7.2
6 6 Passat 1.0 4.7 9.0 11.4
Solution is in view of function:
car_list = ['Golf', 'Tiguan', 'Touareg', 'Passat']
def updateCarInfoBySum(df, car_list, name, id):
req = df[df['Car'].isin(car_list)]
req.set_index(['Car', 'ID], inplace=True)
req.loc[('new_value', '000'), :] = req.sum(axis=0)
req.reset_index(inplace=True)
req = req[req.Car != name]
req['Car'][req['Car'] == 'new_value'] = name
req['ID'][req['ID'] == '000'] = id
req.set_index(['Car', 'ID], inplace=True)
df_final = df.copy()
df_final.set_index(['Car', 'ID], inplace=True)
df_final.update(req)
return df_final

Pandas Python Moving Rows

I am new to Pandas and I have a csv file that I want to move every row 2 & 3 to value1 and value2 column. Could someone please help me out? I can't seem to figure it out.
data, value1, value2
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
output would turn into this
one, value1, value2
1.00 2.00 3.00
4.00 5.00 6.00
7.00 8.00 9.00
More general solution is create MultiIndex.from_arrays with modulo and floor division of numpy.arange with unstack:
print (df)
data
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
a = np.arange(len(df.index))
print (a)
[0 1 2 3 4 5 6 7 8 9]
df.index = pd.MultiIndex.from_arrays([a % 3, a // 3])
print (df)
data
0 0 1.0
1 0 2.0
2 0 3.0
0 1 4.0
1 1 5.0
2 1 6.0
0 2 7.0
1 2 8.0
2 2 9.0
0 3 10.0
df1 = df['data'].unstack(0)
df1.columns=['data','value1','value2']
print (df1)
data value1 value2
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
3 10.0 NaN NaN
You can use a numpy method reshape then convert back to dataframe with pd.DataFrame and name your columns.
pd.DataFrame(df.values.reshape(3,3), columns=['data','value1','value2'])
Output:
data value1 value2
0 1 2 3
1 4 5 6
2 7 8 9