How to format&calculate the data using awk? - awk

I have the original data array as below:
1 2 1.07
1 1 0.51
1 2 0.54
I want to calculate the value of a and b with below calculation format, then regenerate the array as below
1 2 1.07
1 1 0.51
1 2 0.54
a b
a=1*1.07+1*0.51+1*0.54==2.12
b=2*1.07+1*0.51+2*0.54==3.73
regenerate the array as below format:
1 2 1.07
1 1 0.51
1 2 0.54
2.12 3.73
Don't find the solution to merge the results, is there any suggestions, thank you.
$ cat data.test
1 2 1.07
1 1 0.51
1 2 0.54
$ cat data.test|awk '{sum+=($1*$NF)}END{print sum}' 2.12
$ cat data.test|awk '{sum+=($2*$NF)}END{print sum}' 3.73

Related

Convert value counts of multiple columns to pandas dataframe

I have a dataset in this form:
Name Batch DXYR Emp Lateral GDX MMT CN
Joe 2 0 2 2 2 0
Alan 0 1 1 2 0 0
Josh 1 1 2 1 1 2
Max 0 1 0 0 0 2
These columns can have only three distinct values ie. 0, 1 and 2..
So, I need percent of value counts for each column in pandas dataframe..
I have simply make a loop like:
for i in df.columns:
(df[i].value_counts()/df[i].count())*100
I am getting the output like:
0 90.608831
1 0.391169
2 9.6787899
Name: Batch, dtype: float64
0 95.545455
1 2.235422
2 2.6243553
Name: MX, dtype: float64
and so on...
These outputs are correct but I need it in pandas dataframe like this:
Batch DXYR Emp Lateral GDX MMT CN
Count_0_percent 98.32 52.5 22 54.5 44.2 53.4 76.01
Count_1_percent 0.44 34.5 43 43.5 44.5 46.5 22.44
Count_2_percent 1.3 64.3 44 2.87 12.6 1.88 2.567
Can someone please suggest me how to get it
You can melt the data, then use pd.crosstab:
melt = df.melt('Name')
pd.crosstab(melt['value'], melt['variable'], normalize='columns')
Or a bit faster (yet more verbose) with melt and groupby().value_counts():
(df.melt('Name')
.groupby('variable')['value'].value_counts(normalize=True)
.unstack('variable', fill_value=0)
)
Output:
variable Batch CN DXYR Emp Lateral GDX MMT
value
0 0.50 0.5 0.25 0.25 0.25 0.50
1 0.25 0.0 0.75 0.25 0.25 0.25
2 0.25 0.5 0.00 0.50 0.50 0.25
Update: apply also works:
df.drop(columns=['Name']).apply(pd.Series.value_counts, normalize=True)

Put next months start as previous months end pandas

I have a dataframe in long format (panel data), Each person has a start month along with variables. it looks something like:
Data description
person_id
month_start
Var1
Var2
1
1
0.4
1.4
1
2
0.3
0.131
1
3
0.34
0.434
2
2
0.49
0.949
2
3
0.53
1.53
2
5
0.38
0.738
3
1
1.12
1.34
3
4
1.89
1.02
3
5
0.83
0.27
and I need it to look like:
person_id
month_start
month_end
Var1
Var2
1
1
2
0.4
1.4
1
2
3
0.3
0.131
1
3
4
0.34
0.434
2
2
3
0.49
0.949
2
3
5
0.53
1.53
2
5
6
0.38
0.738
3
1
4
1.12
1.34
3
4
5
1.89
1.02
3
5
6
0.83
0.27
Where month end is the beginning of the next entry for that person.
I was able to make this:
a = pd.DataFrame({'person_id':[1,1,1,2,2,2,3,3,3], 'var1': [0.4, 0.3, 0.34, 0.49, 0.53, 0.38, 1.12, 1.89, 0.83], 'var2': [1.4, 0.131, 0.434, 0.949, 1.53, 0.738, 1.34, 1.02, 0.27], 'month_start': [1,2,3,2,3,5,1,4,5]})
def add_end_date(df_in,object_id, start_col, end_col):
df = df_in.copy()
prev_person_id = -1
prev_index = -1
df[end_col] = [-1]*len(df)
for idx, row in df.iterrows():
p_id = row[object_id]
p_idx = idx
if prev_person_id == p_id:
df.loc[prev_index, end_col] = int(row[start_col])# put in start date as last entries end date
if row[end_col] == -1:
df.loc[idx, end_col] = int(row[start_col]+1)
prev_person_id = p_id
prev_index = p_idx
return df
add_end_date(a, 'person_id', 'month_start', 'month_end')
Is there a better/optimized way to accomplish this?
Try groupby.shift:
df['month_end'] = df.groupby('person_id').month_start.shift(-1)\
.fillna(df.month_start + 1).astype(int)
df
person_id month_start Var1 Var2 month_end
0 1 1 0.40 1.400 2
1 1 2 0.30 0.131 3
2 1 3 0.34 0.434 4
3 2 2 0.49 0.949 3
4 2 3 0.53 1.530 5
5 2 5 0.38 0.738 6
6 3 1 1.12 1.340 4
7 3 4 1.89 1.020 5
8 3 5 0.83 0.270 6

how to get the difference between a column from two dataframes by getting their index from another dataframe?

I have two dataframes for groundtruth and predicted trajectories and one dataframe for matching between the groundtruth and predicted trajectories at each frame. I have dataframe of the groundtruth tracks and predicted tracks as follows:
df_pred_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId HId
0 0 -1.870000 -0.41 1.51 1.280 1.670 0.39
1 0 -1.730000 -0.36 1.51 1.440 1.660 0.40
2 0 -1.180000 -1.57 2.05 2.220 0.390 0.61
0 1 -1.540000 -1.83 2.05 2.140 0.390 0.61
1 1 -1.370000 -1.70 2.05 2.180 0.390 0.61
2 1 -1.590000 -0.29 1.51 1.610 1.630 0.41
1 2 -1.910000 -1.12 1.04 0.870 1.440 0.30
2 2 -1.810000 -1.09 1.04 1.010 1.440 0.27
0 3 17.190001 -3.15 1.80 2.178 -0.028 3.36
1 3 15.000000 -3.60 1.80 2.170 -0.020 3.38
df_gt_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId OId
1 0 -1.91 -1.12 1.040 0.87 1.44 0.30
2 0 -1.81 -1.09 1.040 1.01 1.44 0.27
0 1 -1.87 -0.41 1.510 1.28 1.67 0.39
1 1 -1.73 -0.36 1.510 1.44 1.66 0.40
2 1 -1.59 -0.29 1.510 1.61 1.63 0.41
0 2 -1.54 -1.83 2.056 2.14 0.39 0.61
1 2 -1.37 -1.70 2.050 2.18 0.39 0.61
2 2 -1.18 -1.57 2.050 2.22 0.39 0.61
0 3 1.71 -0.31 1.800 2.17 -0.02 3.36
1 3 1.50 -0.36 1.800 2.17 -0.02 3.38
2 3 1.29 -0.41 1.800 2.17 -0.01 3.40
Also, I know their matching at each timestamp:
matched_gt_pred =
FrameId Type OId HId
0 0 MATCH 1.0 0.0
1 0 MATCH 2.0 1.0
4 1 MATCH 1.0 0.0
5 1 MATCH 2.0 1.0
6 1 MATCH 0.0 2.0
9 2 MATCH 0.0 2.0
I would like to look at each row of matched_gt_pred and get the corresponding CENTER_X from df_pred_batch and df_gt_batch and calculate the error.
For instance looking at the first row of the matched_gt_pred I know at FrameId == 0 and OId == 1 and HId == 0 are matched. I should get the Center_X from gt_center_x = df_gt_batch["FrameId==0" and "OId == 1"].CENTER_X and pred_center_x = df_pred_batch["FrameId==0" and "HId == 0"].CENTER_X And compute error = abs(gt_center_x - pred_center_x)
IIUC, I would reshape your df_gt_batch and df_pred_batch and use lookup:
gt_x = df_gt_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['OId'])
pred_x = df_pred_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['HId'])
match_gt_pred['X Error'] = np.abs(gt_x - pred_x)
Output:
FrameId Type OId HId X Error
0 0 MATCH 1.0 0.0 0.0
1 0 MATCH 2.0 1.0 0.0
4 1 MATCH 1.0 0.0 0.0
5 1 MATCH 2.0 1.0 0.0
6 1 MATCH 0.0 2.0 0.0
9 2 MATCH 0.0 2.0 0.0
Another option is to use reindex with pd.MultiIndex:
match_gt_pred['X Error'] = (df_pred_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['HId']]))['Center_X'].to_numpy() -
df_gt_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['OId']]))['Center_X'].to_numpy())

How to add up with a variable instead a number in a dataframe?

Hi guys i am trying to select a the 2nd value and then add this value to the rest of the array exept the 1st value.
this is what i have so far.
Xloc = X.iloc(1) # selecting the second variable
X = X[1:-1] + Xloc # this doenst work but if i do + 1.25 it works...
the Dataframe
X
0
1.25
2.57
4.5
6.9
7.3
Expected Result
X
0
2.5
3.82
5.75
8.15
8.55
Given that this is your original df
N
0 0.00
1 1.25
2 2.57
3 4.50
4 6.90
5 7.30
you can assign these values and use a simple concat to add in the original value in place
df['M'] = pd.concat([df["N"].iloc[:1], (df["N"].iloc[1:] + df["N"].iloc[1])])
print(df)
N M
0 0.00 0.00
1 1.25 2.50
2 2.57 3.82
3 4.50 5.75
4 6.90 8.15
5 7.30 8.55

Pandas: 1 dataframe comparing rows to create new column

I have a problem which I cannot seem to get my head round.
df1 is as follows:
Group item Quarter price quantity
1 A 2017Q3 0.10 1000
1 A 2017Q4 0.11 1000
1 A 2018Q1 0.11 1000
1 A 2018Q2 0.12 1000
1 A 2018Q3 0.11 1000
Result desired is a new dataframe call it df2 with an additional column.
Group item Quarter price quantity savings/lost
1 A 2017Q3 0.10 1000 0.00
1 A 2017Q4 0.11 1000 0.00
1 A 2018Q1 0.11 1000 0.00
1 A 2018Q2 0.12 1000 0.00
1 A 2018Q3 0.11 1000 10.00
1 A 2018Q4 0.13 1000 -20.00
Essentially, I want to go down each row, look at the quarter and find last year's similar quarter and do a calculation (price this quarter - price last quarter * quantity). If there are no previous quarter data, just have in the last column.
And to complete the picture, there are more groups and items in there, and even more quarters like 2016Q1, 2017Q1, 2018Q1 although i only need compare the year before. Quarters are in string format.
Use pandas.DataFrame.shift
The code below assumes that your column Quarter is sorted and there is no missing quarters. You can try with the below code:
# Input dataframe
Group item Quarter price quantity
0 1 A 2017Q3 0.10 1000
1 1 A 2017Q4 0.11 1000
2 1 A 2018Q1 0.11 1000
3 1 A 2018Q2 0.12 1000
4 1 A 2018Q3 0.11 1000
5 1 A 2018Q4 0.13 1000
# Code to generate your new column 'savings/lost'
df['savings/lost'] = df['price'] * df['quantity'] - df['price'].shift(4) * df['quantity'].shift(4)
# Output dataframe
Group item Quarter price quantity savings/lost
0 1 A 2017Q3 0.10 1000 NaN
1 1 A 2017Q4 0.11 1000 NaN
2 1 A 2018Q1 0.11 1000 NaN
3 1 A 2018Q2 0.12 1000 NaN
4 1 A 2018Q3 0.11 1000 10.0
5 1 A 2018Q4 0.13 1000 20.0
Update
I have updated my code to handle two things, first sort the Quarter and second handle the missing Quarter scenario. For grouping based on columns you can refer pandas.DataFrame.groupby and many pd.groupby related questions already answered in this site.
#Input dataframe
Group item Quarter price quantity
0 1 A 2014Q3 0.10 100
1 1 A 2017Q2 0.16 800
2 1 A 2017Q3 0.17 700
3 1 A 2015Q4 0.13 400
4 1 A 2016Q1 0.14 500
5 1 A 2014Q4 0.11 200
6 1 A 2015Q2 0.12 300
7 1 A 2016Q4 0.15 600
8 1 A 2018Q1 0.18 600
9 1 A 2018Q2 0.19 500
#Code to do the operations
df.index = pd.PeriodIndex(df.Quarter, freq='Q')
df.sort_index(inplace=True)
df2 = df.reset_index(drop=True)
df2['Profit'] = (df.price * df.quantity) - (df.reindex(df.index - 4).price * df.reindex(df.index - 4).quantity).values
df2['Profit'] = np.where(np.in1d(df.index - 4, df.index.values),
df2.Profit, ((df.price * df.quantity) - (df.price.shift(1) * df.quantity.shift(1))))
df2.Profit.fillna(0, inplace=True)
#Output dataframe
Group item Quarter price quantity Profit
0 1 A 2014Q3 0.10 100 0.0
1 1 A 2014Q4 0.11 200 12.0
2 1 A 2015Q2 0.12 300 14.0
3 1 A 2015Q4 0.13 400 0.0
4 1 A 2016Q1 0.14 500 18.0
5 1 A 2016Q4 0.15 600 0.0
6 1 A 2017Q2 0.16 800 38.0
7 1 A 2017Q3 0.17 700 -9.0
8 1 A 2018Q1 0.18 600 -11.0
9 1 A 2018Q2 0.19 500 0.0