Have the following df
import numpy as np
import random
i = ['dog', 'cat', 'rabbit', 'elephant'] * 20
df = pd.DataFrame(np.random.randn(len(i), 3), index=i, \
columns=list('ABC')).rename_axis('animal').reset_index()
df.insert(1, 'type', pd.Series(random.choice(['X', 'Y']) \
for _ in range(len(df))))
I would like to have the max of column A, if the type of the animal is X ... else the min of column A, in a separate column.
Apply lambda with group by shows the multi-indexed array with the following code:
g = df.groupby(['animal', 'type'])
g.apply(lambda g: np.where (g.type == 'X', g.A.max(), g.A.min()))
Is there a way to convert this to a series, that can be added to df as a column... say by using transform?
Is this what you want?
>>> df
animal type A B C
0 cat Y 0.96 -0.02 -0.14
1 cat Y -0.80 0.86 1.75
2 dog X 1.13 -0.49 -1.66
3 dog Y 0.84 -0.07 0.15
4 elephant X 0.13 -0.54 0.73
5 elephant Y 0.14 1.77 0.94
6 rabbit X -0.12 -0.39 0.05
7 rabbit X 0.58 -1.17 0.77
>>> def max_min_A(g):
animal, type_ = g.name
return np.where(type_ == 'X', g.max(), g.min())
>>> df['new_col'] = df.groupby(['animal', 'type'])['A'].transform(max_min_A)
animal type A B C new_col
0 cat Y 0.96 -0.02 -0.14 -0.80
1 cat Y -0.80 0.86 1.75 -0.80
2 dog X 1.13 -0.49 -1.66 1.13
3 dog Y 0.84 -0.07 0.15 0.84
4 elephant X 0.13 -0.54 0.73 0.13
5 elephant Y 0.14 1.77 0.94 0.14
6 rabbit X -0.12 -0.39 0.05 0.58
7 rabbit X 0.58 -1.17 0.77 0.58
#HarryPlotter: Thx for the name info. It is wonderful to see that the name of the group propagates as a tuple. In case one does not want to use a function, the following will work:
df.assign(new_col=g.A.transform(lambda x: np.where(x.name[1] =='X', \
x.max(), x.min())))
# x.name[1] is used to select the second element of the tuple, which is `type`
I'd like to think that performance wise, it is better to build the temporary columns, rather than iterating through groupby:
grp = df.groupby(['animal', 'type'])['A']
(df
.assign(
mi = grp.transform('min'),
ma = grp.transform('max'),
new_col = lambda df: np.where(df['type'] == 'X', df['ma'], df['mi']))
.drop(columns=['mi','ma'])
)
animal type A B C new_col
0 cat Y 0.96 -0.02 -0.14 -0.80
1 cat Y -0.80 0.86 1.75 -0.80
2 dog X 1.13 -0.49 -1.66 1.13
3 dog Y 0.84 -0.07 0.15 0.84
4 elephant X 0.13 -0.54 0.73 0.13
5 elephant Y 0.14 1.77 0.94 0.14
6 rabbit X -0.12 -0.39 0.05 0.58
7 rabbit X 0.58 -1.17 0.77 0.58
Related
I'm looking for a more efficient method to deal with the following problem. I have a Dataframe with a column filled with values that randomly range from 1 to 4, I need to remove all the rows of the data frame that do not follow the sequence (1-2-3-4-1-2-3-...).
This is what I have:
A B
12/2/2022 0.02 2
14/2/2022 0.01 1
15/2/2022 0.04 4
16/2/2022 -0.02 3
18/2/2022 -0.01 2
20/2/2022 0.04 1
21/2/2022 0.02 3
22/2/2022 -0.01 1
24/2/2022 0.04 4
26/2/2022 -0.02 2
27/2/2022 0.01 3
28/2/2022 0.04 1
01/3/2022 -0.02 3
03/3/2022 -0.01 2
05/3/2022 0.04 1
06/3/2022 0.02 3
08/3/2022 -0.01 1
10/3/2022 0.04 4
12/3/2022 -0.02 2
13/3/2022 0.01 3
15/3/2022 0.04 1
...
This is what I need:
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1
...
Since the data frame is quite big I need some sort of NumPy-based operation to accomplish this, the more efficient the better. My solution is very ugly and inefficient, basically, I made 4 loops like the following to check for every part of the sequence (4-1,1-2,2-3,3-4):
df_len = len(df)
df_len2 = 0
while df_len != df_len2:
df_len = len(df)
df.loc[(df.B.shift(1) == 4) & (df.B != 1), 'B'] = 0
df = df[df['B'] != 0]
df_len2 = len(df)
By means of itertools.cycle (to define cycled range):
from itertools import cycle
c_rng = cycle(range(1, 5)) # cycled range
start = next(c_rng) # starting point
df[[(v == start) and bool(start := next(c_rng)) for v in df.B]]
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1
A simple improvement to speed this up is to not touch the dataframe within the loop, but just iterate over the values of B to construct a Boolean index, like this:
is_in_sequence = []
next_target = 1
for b in df.B:
if b == next_target:
is_in_sequence.append(True)
next_target = next_target % 4 + 1
else:
is_in_sequence.append(False)
print(df[is_in_sequence])
A B
14/2/2022 0.01 1
18/2/2022 -0.01 2
21/2/2022 0.02 3
24/2/2022 0.04 4
28/2/2022 0.04 1
03/3/2022 -0.01 2
06/3/2022 0.02 3
10/3/2022 0.04 4
15/3/2022 0.04 1
I have a dataset in this form:
Name Batch DXYR Emp Lateral GDX MMT CN
Joe 2 0 2 2 2 0
Alan 0 1 1 2 0 0
Josh 1 1 2 1 1 2
Max 0 1 0 0 0 2
These columns can have only three distinct values ie. 0, 1 and 2..
So, I need percent of value counts for each column in pandas dataframe..
I have simply make a loop like:
for i in df.columns:
(df[i].value_counts()/df[i].count())*100
I am getting the output like:
0 90.608831
1 0.391169
2 9.6787899
Name: Batch, dtype: float64
0 95.545455
1 2.235422
2 2.6243553
Name: MX, dtype: float64
and so on...
These outputs are correct but I need it in pandas dataframe like this:
Batch DXYR Emp Lateral GDX MMT CN
Count_0_percent 98.32 52.5 22 54.5 44.2 53.4 76.01
Count_1_percent 0.44 34.5 43 43.5 44.5 46.5 22.44
Count_2_percent 1.3 64.3 44 2.87 12.6 1.88 2.567
Can someone please suggest me how to get it
You can melt the data, then use pd.crosstab:
melt = df.melt('Name')
pd.crosstab(melt['value'], melt['variable'], normalize='columns')
Or a bit faster (yet more verbose) with melt and groupby().value_counts():
(df.melt('Name')
.groupby('variable')['value'].value_counts(normalize=True)
.unstack('variable', fill_value=0)
)
Output:
variable Batch CN DXYR Emp Lateral GDX MMT
value
0 0.50 0.5 0.25 0.25 0.25 0.50
1 0.25 0.0 0.75 0.25 0.25 0.25
2 0.25 0.5 0.00 0.50 0.50 0.25
Update: apply also works:
df.drop(columns=['Name']).apply(pd.Series.value_counts, normalize=True)
I have two dataframes for groundtruth and predicted trajectories and one dataframe for matching between the groundtruth and predicted trajectories at each frame. I have dataframe of the groundtruth tracks and predicted tracks as follows:
df_pred_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId HId
0 0 -1.870000 -0.41 1.51 1.280 1.670 0.39
1 0 -1.730000 -0.36 1.51 1.440 1.660 0.40
2 0 -1.180000 -1.57 2.05 2.220 0.390 0.61
0 1 -1.540000 -1.83 2.05 2.140 0.390 0.61
1 1 -1.370000 -1.70 2.05 2.180 0.390 0.61
2 1 -1.590000 -0.29 1.51 1.610 1.630 0.41
1 2 -1.910000 -1.12 1.04 0.870 1.440 0.30
2 2 -1.810000 -1.09 1.04 1.010 1.440 0.27
0 3 17.190001 -3.15 1.80 2.178 -0.028 3.36
1 3 15.000000 -3.60 1.80 2.170 -0.020 3.38
df_gt_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId OId
1 0 -1.91 -1.12 1.040 0.87 1.44 0.30
2 0 -1.81 -1.09 1.040 1.01 1.44 0.27
0 1 -1.87 -0.41 1.510 1.28 1.67 0.39
1 1 -1.73 -0.36 1.510 1.44 1.66 0.40
2 1 -1.59 -0.29 1.510 1.61 1.63 0.41
0 2 -1.54 -1.83 2.056 2.14 0.39 0.61
1 2 -1.37 -1.70 2.050 2.18 0.39 0.61
2 2 -1.18 -1.57 2.050 2.22 0.39 0.61
0 3 1.71 -0.31 1.800 2.17 -0.02 3.36
1 3 1.50 -0.36 1.800 2.17 -0.02 3.38
2 3 1.29 -0.41 1.800 2.17 -0.01 3.40
Also, I know their matching at each timestamp:
matched_gt_pred =
FrameId Type OId HId
0 0 MATCH 1.0 0.0
1 0 MATCH 2.0 1.0
4 1 MATCH 1.0 0.0
5 1 MATCH 2.0 1.0
6 1 MATCH 0.0 2.0
9 2 MATCH 0.0 2.0
I would like to look at each row of matched_gt_pred and get the corresponding CENTER_X from df_pred_batch and df_gt_batch and calculate the error.
For instance looking at the first row of the matched_gt_pred I know at FrameId == 0 and OId == 1 and HId == 0 are matched. I should get the Center_X from gt_center_x = df_gt_batch["FrameId==0" and "OId == 1"].CENTER_X and pred_center_x = df_pred_batch["FrameId==0" and "HId == 0"].CENTER_X And compute error = abs(gt_center_x - pred_center_x)
IIUC, I would reshape your df_gt_batch and df_pred_batch and use lookup:
gt_x = df_gt_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['OId'])
pred_x = df_pred_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['HId'])
match_gt_pred['X Error'] = np.abs(gt_x - pred_x)
Output:
FrameId Type OId HId X Error
0 0 MATCH 1.0 0.0 0.0
1 0 MATCH 2.0 1.0 0.0
4 1 MATCH 1.0 0.0 0.0
5 1 MATCH 2.0 1.0 0.0
6 1 MATCH 0.0 2.0 0.0
9 2 MATCH 0.0 2.0 0.0
Another option is to use reindex with pd.MultiIndex:
match_gt_pred['X Error'] = (df_pred_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['HId']]))['Center_X'].to_numpy() -
df_gt_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['OId']]))['Center_X'].to_numpy())
I have a large dataframe df as:
Col1 Col2 ATC_Dzr ATC_Last ATC_exp Op_Dzr2 Op_Last2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.4 -0.85 -0.86 0.1 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29
I need to dump this to excel so that it looks as following:
where ATC and Op are in a merged cells
I am not sure how to approach this?
You can set the first 2 columns as index and split the rest and expand to create a Multiindex:
df1 = df.set_index(['Col1','Col2'])
df1.columns = df1.columns.str.split('_',expand=True)
print(df1)
ATC Op
Dzr Last exp Dzr2 Last2
Col1 Col2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.40 -0.85 -0.86 0.10 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29
Then export df1 into excel.
As per coments by #Datanovice , you can also use Pd.MultiIndex.from_tuples:
df1 = df.set_index(['Col1','Col2'])
df1.columns = pd.MultiIndex.from_tuples([(col.split('_')[0], col.split('_')[1])
for col in df1.columns])
print(df1)
ATC Op
Dzr Last exp Dzr2 Last2
Col1 Col2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.40 -0.85 -0.86 0.10 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29
I have a text file that looks like this.
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
I need to extract all the lines that start with B,H and two lines after H . How can I do this using awk?
The expected output would be
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
Any suggestions please.
Ignoring the blank line after B in your output (your problem specifications give no indication as to why that blank line is in the output, so I'm assuming it should not be there):
awk '/^H/{t=3} /^B/ || t-- >0' input.file
will print all lines that start with B and each line that starts with H along with the next two lines.
awk '/^[BH]/ || /^[[:blank:]]*[[:digit:]]/' inputfile
bash-3.00$ cat t
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
bash-3.00$ awk '{if(( $1 == "B") || ($1 == "H") || ($0 ~ /^ / )) print;}' t
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
OR in short
awk '{if($0 ~ /^[BH ]/ ) print;}' t
OR even shorter
awk '/^[BH ]/' t
If H and B aren't the only headers that are sent before tabular data and you intend to omit those blocks of data (you don't specify the requirements fully) you have to use a flip-flop to remember if you're currently in a block you want to keep or not:
awk '/^[^ 0-9]/ {inblock=0}; /^[BH]/ {inblock=1}; { if (inblock) print }' d.txt
cat filename.txt | awk '/^[B(H(^ .*$){2})].*$/' > output.txt
EDIT: Updated for OP's edit