How to remove rows so that the values in a column match a sequence - pandas

I'm looking for a more efficient method to deal with the following problem. I have a Dataframe with a column filled with values that randomly range from 1 to 4, I need to remove all the rows of the data frame that do not follow the sequence (1-2-3-4-1-2-3-...).
This is what I have:
A B
12/2/2022 0.02 2
14/2/2022 0.01 1
15/2/2022 0.04 4
16/2/2022 -0.02 3
18/2/2022 -0.01 2
20/2/2022 0.04 1
21/2/2022 0.02 3
22/2/2022 -0.01 1
24/2/2022 0.04 4
26/2/2022 -0.02 2
27/2/2022 0.01 3
28/2/2022 0.04 1
01/3/2022 -0.02 3
03/3/2022 -0.01 2
05/3/2022 0.04 1
06/3/2022 0.02 3
08/3/2022 -0.01 1
10/3/2022 0.04 4
12/3/2022 -0.02 2
13/3/2022 0.01 3
15/3/2022 0.04 1
...
This is what I need:
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1
...
Since the data frame is quite big I need some sort of NumPy-based operation to accomplish this, the more efficient the better. My solution is very ugly and inefficient, basically, I made 4 loops like the following to check for every part of the sequence (4-1,1-2,2-3,3-4):
df_len = len(df)
df_len2 = 0
while df_len != df_len2:
df_len = len(df)
df.loc[(df.B.shift(1) == 4) & (df.B != 1), 'B'] = 0
df = df[df['B'] != 0]
df_len2 = len(df)

By means of itertools.cycle (to define cycled range):
from itertools import cycle
c_rng = cycle(range(1, 5)) # cycled range
start = next(c_rng) # starting point
df[[(v == start) and bool(start := next(c_rng)) for v in df.B]]
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1

A simple improvement to speed this up is to not touch the dataframe within the loop, but just iterate over the values of B to construct a Boolean index, like this:
is_in_sequence = []
next_target = 1
for b in df.B:
if b == next_target:
is_in_sequence.append(True)
next_target = next_target % 4 + 1
else:
is_in_sequence.append(False)
print(df[is_in_sequence])
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1

Related

Convert value counts of multiple columns to pandas dataframe

I have a dataset in this form:
Name Batch DXYR Emp Lateral GDX MMT CN
Joe 2 0 2 2 2 0
Alan 0 1 1 2 0 0
Josh 1 1 2 1 1 2
Max 0 1 0 0 0 2
These columns can have only three distinct values ie. 0, 1 and 2..
So, I need percent of value counts for each column in pandas dataframe..
I have simply make a loop like:
for i in df.columns:
(df[i].value_counts()/df[i].count())*100
I am getting the output like:
0 90.608831
1 0.391169
2 9.6787899
Name: Batch, dtype: float64
0 95.545455
1 2.235422
2 2.6243553
Name: MX, dtype: float64
and so on...
These outputs are correct but I need it in pandas dataframe like this:
Batch DXYR Emp Lateral GDX MMT CN
Count_0_percent 98.32 52.5 22 54.5 44.2 53.4 76.01
Count_1_percent 0.44 34.5 43 43.5 44.5 46.5 22.44
Count_2_percent 1.3 64.3 44 2.87 12.6 1.88 2.567
Can someone please suggest me how to get it
You can melt the data, then use pd.crosstab:
melt = df.melt('Name')
pd.crosstab(melt['value'], melt['variable'], normalize='columns')
Or a bit faster (yet more verbose) with melt and groupby().value_counts():
(df.melt('Name')
.groupby('variable')['value'].value_counts(normalize=True)
.unstack('variable', fill_value=0)
)
Output:
variable Batch CN DXYR Emp Lateral GDX MMT
value
0 0.50 0.5 0.25 0.25 0.25 0.50
1 0.25 0.0 0.75 0.25 0.25 0.25
2 0.25 0.5 0.00 0.50 0.50 0.25
Update: apply also works:
df.drop(columns=['Name']).apply(pd.Series.value_counts, normalize=True)

Put next months start as previous months end pandas

I have a dataframe in long format (panel data), Each person has a start month along with variables. it looks something like:
Data description
person_id
month_start
Var1
Var2
1
1
0.4
1.4
1
2
0.3
0.131
1
3
0.34
0.434
2
2
0.49
0.949
2
3
0.53
1.53
2
5
0.38
0.738
3
1
1.12
1.34
3
4
1.89
1.02
3
5
0.83
0.27
and I need it to look like:
person_id
month_start
month_end
Var1
Var2
1
1
2
0.4
1.4
1
2
3
0.3
0.131
1
3
4
0.34
0.434
2
2
3
0.49
0.949
2
3
5
0.53
1.53
2
5
6
0.38
0.738
3
1
4
1.12
1.34
3
4
5
1.89
1.02
3
5
6
0.83
0.27
Where month end is the beginning of the next entry for that person.
I was able to make this:
a = pd.DataFrame({'person_id':[1,1,1,2,2,2,3,3,3], 'var1': [0.4, 0.3, 0.34, 0.49, 0.53, 0.38, 1.12, 1.89, 0.83], 'var2': [1.4, 0.131, 0.434, 0.949, 1.53, 0.738, 1.34, 1.02, 0.27], 'month_start': [1,2,3,2,3,5,1,4,5]})
def add_end_date(df_in,object_id, start_col, end_col):
df = df_in.copy()
prev_person_id = -1
prev_index = -1
df[end_col] = [-1]*len(df)
for idx, row in df.iterrows():
p_id = row[object_id]
p_idx = idx
if prev_person_id == p_id:
df.loc[prev_index, end_col] = int(row[start_col])# put in start date as last entries end date
if row[end_col] == -1:
df.loc[idx, end_col] = int(row[start_col]+1)
prev_person_id = p_id
prev_index = p_idx
return df
add_end_date(a, 'person_id', 'month_start', 'month_end')
Is there a better/optimized way to accomplish this?
Try groupby.shift:
df['month_end'] = df.groupby('person_id').month_start.shift(-1)\
.fillna(df.month_start + 1).astype(int)
df
person_id month_start Var1 Var2 month_end
0 1 1 0.40 1.400 2
1 1 2 0.30 0.131 3
2 1 3 0.34 0.434 4
3 2 2 0.49 0.949 3
4 2 3 0.53 1.530 5
5 2 5 0.38 0.738 6
6 3 1 1.12 1.340 4
7 3 4 1.89 1.020 5
8 3 5 0.83 0.270 6

how to get the difference between a column from two dataframes by getting their index from another dataframe?

I have two dataframes for groundtruth and predicted trajectories and one dataframe for matching between the groundtruth and predicted trajectories at each frame. I have dataframe of the groundtruth tracks and predicted tracks as follows:
df_pred_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId HId
0 0 -1.870000 -0.41 1.51 1.280 1.670 0.39
1 0 -1.730000 -0.36 1.51 1.440 1.660 0.40
2 0 -1.180000 -1.57 2.05 2.220 0.390 0.61
0 1 -1.540000 -1.83 2.05 2.140 0.390 0.61
1 1 -1.370000 -1.70 2.05 2.180 0.390 0.61
2 1 -1.590000 -0.29 1.51 1.610 1.630 0.41
1 2 -1.910000 -1.12 1.04 0.870 1.440 0.30
2 2 -1.810000 -1.09 1.04 1.010 1.440 0.27
0 3 17.190001 -3.15 1.80 2.178 -0.028 3.36
1 3 15.000000 -3.60 1.80 2.170 -0.020 3.38
df_gt_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId OId
1 0 -1.91 -1.12 1.040 0.87 1.44 0.30
2 0 -1.81 -1.09 1.040 1.01 1.44 0.27
0 1 -1.87 -0.41 1.510 1.28 1.67 0.39
1 1 -1.73 -0.36 1.510 1.44 1.66 0.40
2 1 -1.59 -0.29 1.510 1.61 1.63 0.41
0 2 -1.54 -1.83 2.056 2.14 0.39 0.61
1 2 -1.37 -1.70 2.050 2.18 0.39 0.61
2 2 -1.18 -1.57 2.050 2.22 0.39 0.61
0 3 1.71 -0.31 1.800 2.17 -0.02 3.36
1 3 1.50 -0.36 1.800 2.17 -0.02 3.38
2 3 1.29 -0.41 1.800 2.17 -0.01 3.40
Also, I know their matching at each timestamp:
matched_gt_pred =
FrameId Type OId HId
0 0 MATCH 1.0 0.0
1 0 MATCH 2.0 1.0
4 1 MATCH 1.0 0.0
5 1 MATCH 2.0 1.0
6 1 MATCH 0.0 2.0
9 2 MATCH 0.0 2.0
I would like to look at each row of matched_gt_pred and get the corresponding CENTER_X from df_pred_batch and df_gt_batch and calculate the error.
For instance looking at the first row of the matched_gt_pred I know at FrameId == 0 and OId == 1 and HId == 0 are matched. I should get the Center_X from gt_center_x = df_gt_batch["FrameId==0" and "OId == 1"].CENTER_X and pred_center_x = df_pred_batch["FrameId==0" and "HId == 0"].CENTER_X And compute error = abs(gt_center_x - pred_center_x)
IIUC, I would reshape your df_gt_batch and df_pred_batch and use lookup:
gt_x = df_gt_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['OId'])
pred_x = df_pred_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['HId'])
match_gt_pred['X Error'] = np.abs(gt_x - pred_x)
Output:
FrameId Type OId HId X Error
0 0 MATCH 1.0 0.0 0.0
1 0 MATCH 2.0 1.0 0.0
4 1 MATCH 1.0 0.0 0.0
5 1 MATCH 2.0 1.0 0.0
6 1 MATCH 0.0 2.0 0.0
9 2 MATCH 0.0 2.0 0.0
Another option is to use reindex with pd.MultiIndex:
match_gt_pred['X Error'] = (df_pred_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['HId']]))['Center_X'].to_numpy() -
df_gt_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['OId']]))['Center_X'].to_numpy())

How to add up with a variable instead a number in a dataframe?

Hi guys i am trying to select a the 2nd value and then add this value to the rest of the array exept the 1st value.
this is what i have so far.
Xloc = X.iloc(1) # selecting the second variable
X = X[1:-1] + Xloc # this doenst work but if i do + 1.25 it works...
the Dataframe
X
0
1.25
2.57
4.5
6.9
7.3
Expected Result
X
0
2.5
3.82
5.75
8.15
8.55
Given that this is your original df
N
0 0.00
1 1.25
2 2.57
3 4.50
4 6.90
5 7.30
you can assign these values and use a simple concat to add in the original value in place
df['M'] = pd.concat([df["N"].iloc[:1], (df["N"].iloc[1:] + df["N"].iloc[1])])
print(df)
N M
0 0.00 0.00
1 1.25 2.50
2 2.57 3.82
3 4.50 5.75
4 6.90 8.15
5 7.30 8.55

sum over columns using different length

I have a pd df.
The table looks like:
df
lifetime 0 1 2 3 4 5 .... 30
0 2 0.12 0.14 0.18 0.12 0.13 0.14 .... 0.14
1 3 0.12 0.14 0.18 0.12 0.13 0.14 .... 0.14
2 4 0.12 0.14 0.18 0.12 0.13 0.14 .... 0.14
I want to sum the columns from 0 to 30 based on the column "lifetime" value, so the results looks like:
df
lifetime Total
0 2 sum(0.12+ 0.14) # sum column 0 and 1
1 3 sum(0.12+0.14+0.18) #sum from column 0 to 2
2 4 sum(0.12+0.14+0.18+0.12+0.13) #sum from column 0 to 3
How can I do it? Thank you for your help!
You can use where with broadcasting:
s = df.iloc[:,1:]
s.where(df.lifetime.to_numpy()[:,None] > np.arange(s.shape[1])).sum(1)
Output:
0 0.26
1 0.44
2 0.56
dtype: float64
Define the following function:
def mySum(row):
uLim = int(row.lifetime) + 1
return row.iloc[1:uLim].sum()
Then apply it and join the result with lifetime column:
df = df.lifetime.to_frame().join(df.apply(mySum, axis=1).rename('Total'))
The advantage over the other solution is that my solution creates
the target DataFrame, not only the new column.