Finding max row after groupby in pandas dataframe - pandas

I have a daframe as follows:
Month Col1 Col2 Val
A p a1 31
A q a1 78
A r b2 13
B x a1 54
B y b2 56
B z b2 65
I want to get the following:
Month a1 b2
A q r
B x z
Essentially for each pair of Month and Col2, I want to find the value in Col1 which is has the maximum value.
I am not sure how to approach this.

Your problem is:
Find row with max Val within a group, which is sort and drop_duplicates, and
transform the data, which is pivot:
(df.sort_values('Val')
.drop_duplicates(['Month','Col2'], keep='last')
.pivot(index='Month', columns='Col2', values='Col1')
)
Output:
Col2 a1 b2
Month
A q r
B x z

Related

Base R or tidyverse function to expand rows along time series columns [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
I have a time series data frame with two ID columns and about 1000 columns for day 1, day 2, etc.
I want to convert from my data frame being in the form
a b t1 t2 ... t1000
_____________________________
a1 b1 # # #
a2 b2 # # #
to being in the form
a b t value
____________________
a1 b1 't1' #
a1 b1 't2' #
a2 b2 't1' #
a2 b2 't2' #
Essentially, I want to do something like this:
dataframe %>%
select(starts_with("t_") ) %>%
gather(key = "t", value = "value")
so that I have a dataframe looking like this:
t value
__________
't1' #
't2' #
...
't100' #
for each row in the original dataframe. Then once I have the time columns that I generated from one row in the original dataframe, I want to left append the "a" and "b" columns to each row, so that this:
t value
__________
't1' #
't2' #
...
't100' #
turns into this:
a b t value
____________________
a1 b1 't1' #
a1 b1 't2' #
...
a1 b1 't100' #
and then I repeat this process for each row, stacking the new generated dataframe below (or above) the previously generated dataframe. At the end I want to have the dataframe in the code block above generated for each row of the original dataframe, and all stacked on top of one another. I could do this with a for loop, or maybe even some sapply magic, but I don't want to use a for loop in R, and I feel like there's a better way to do this than sapply-ing some function into each row of the original dataframe.
Can anyone help me? Thanks. Preferably using tidyverse.
We can use pivot_longer from tidyr.
library(tidyr)
data <- matrix(1:1000,nrow=2,ncol=1000)
colnames(data) <- paste0("t",1:1000)
data <- data.frame(a=c("a1","a2"),b=c("b1","b2"),data)
data[1:2,1:10]
# a b t1 t2 t3 t4 t5 t6 t7 t8
#1 a1 b1 1 3 5 7 9 11 13 15
#2 a2 b2 2 4 6 8 10 12 14 16
data %>%
pivot_longer(cols = starts_with("t"), names_to = "t")
## A tibble: 2,000 x 4
# a b t value
# <fct> <fct> <chr> <int>
# 1 a1 b1 t1 1
# 2 a1 b1 t2 3
# 3 a1 b1 t3 5
# 4 a1 b1 t4 7
# 5 a1 b1 t5 9
# 6 a1 b1 t6 11
# 7 a1 b1 t7 13
# 8 a1 b1 t8 15
# 9 a1 b1 t9 17
#10 a1 b1 t10 19
Try this:
library(dplyr)
dataframe %>%
gather(key = "t", value = "value", 3:ncol(.))

is it possible to obtain 'groupby-transform-apply' style results with the function return series rather than scaler?

I want to achieve the following behavior:
res = df.groupby(['dimension'], as_index=False)['metric'].transform(lambda x: foo(x))
where foo(x) returns a series the same size as the input which is df['metric']
however, this will throw the following error:
ValueError: transform must return a scalar value for each group
i know i can use a for loop style, but how can i achieve this in a groupby manner?
e.g.
df:
col1 col2 col3
0 A1 B1 1
1 A1 B1 2
2 A2 B2 3
and i want to achieve:
col1 col2 col3
0 A1 B1 1 - (1+2)/2
1 A1 B1 2 - (1+2)/2
2 A2 B2 3 - 3
If you want to return a Series you should use apply instead of transform:
res = df.groupby(['dimension'], as_index=False)['metric'].apply(lambda x: foo(x))
Transform as the error states must return a scalar value that would be put in every rows for each group. But apply will work with a Series returned for each group.
If this doesn't work, provide input and expected output to understand better your problem.
You can do this using transform:
df['col3']=(df.col3-df.groupby(['col1','col2'])['col3'].transform('sum'))/2
Or using apply(slower):
df['col3']=df.groupby(['col1','col2'])['col3'].apply(lambda x: (x-x.sum())/2)
col1 col2 col3
0 A1 B1 -1.0
1 A1 B1 -0.5
2 A2 B2 0.0

Same values don't change but different ones need to be concat when using pandas?

Example:
A B C
0 id1 b1 91
1 id1 b1 350
2 id2 a2 90
3 id2 a4 90
4 id2 a5 90
5 id3 c1 180
The type
col A: string
col B: string
col C: string
Expected Output:
A B C
0 id1 b1 '91,350'
1 id2 a2,a4,a5 '90'
2 id3 c1 '180'
I want to groupby column A to get expected output, but I don't know how to set function to get it like pd.groupby('A').
Notice: the type of expected output columns are all string. And values are merge by ','.
Convert to str, then using groupby with unique
s=df.astype(str).groupby('A',as_index=False).agg(lambda x : ','.join(x.unique()))
s
A B C
0 id1 b1 91,350
1 id2 a2,a4,a5 90
2 id3 c1 180

pandas dataframe group by and agg

I am new to ipython and I am trying to do something with dataframe grouping . I have a dataframe like below
df_test = pd.DataFrame({"A": range(4), "B": ["B1", "B2", "B1", "B2"], "C": ["C1", "C1", np.nan, "C2"]})
df_test
A B C
0 0 B1 C1
1 1 B2 C1
2 2 B1 NaN
3 3 B2 C2
I would like to achieve following things:
1) group by B but creating multilevel column instead of grouped to rows with B1 and B2 as index, B1 and B2 are basically count
2) column A and C are agg function applied with something like {'C':['count'],'A':['sum']}
B
A B1 B2 C
0 6 2 2 3
how ? Thanks
You are doing separate actions to each column. You can hack this by aggregating A and C and then taking the value counts of B separately and then combine the data back together.
ac = df_test.agg({'A':'sum', 'C':'count'})
b = df_test['B'].value_counts()
pd.concat([ac, b]).sort_index().to_frame().T
A B1 B2 C
0 6 2 2 3

Multiple group-by with one common variable with pandas?

I want to mark duplicate values within an ID group. For example
ID A B
i1 a1 b1
i1 a1 b2
i1 a2 b2
i2 a1 b2
should become
ID A An B Bn
i1 a1 2 b1 1
i1 a1 2 b2 2
i1 a2 1 b2 2
i2 a1 1 b2 1
Basically An and Bn count multiplicity within each ID group. How can I do this in pandas? I've found groupBy, but it was quite messy to put everything together. Also I tried individual groupby for ID, A and ID, B. Maybe there is a way to pre-group by ID first and then do all the other variables? (there are many variables and I have very man rows!)
Also I tried individual groupby for ID, A and ID, B
I think this is a straight-forward way to tackle it; As you suggest, you can groupby each separately and then compute the size of the groups. And use transform so you can easily add the results to the original dataframe:
df['An'] = df.groupby(['ID','A'])['A'].transform(np.size)
df['Bn'] = df.groupby(['ID','B'])['B'].transform(np.size)
print df
ID A B An Bn
0 i1 a1 b1 2 1
1 i1 a1 b2 2 2
2 i1 a2 b2 1 2
3 i2 a1 b2 1 1
Of course, with lots of columns you could do:
for col in ['A','B']:
df[col + 'n'] = df.groupby(['ID',col])[col].transform(np.size)
The duplicated method can also be used to give you something similar, but it will mark observations within a group after the first as duplicates:
for col in ['A','B']:
df[col + 'n'] = df.duplicated(['ID',col])
print df
ID A B An Bn
0 i1 a1 b1 False False
1 i1 a1 b2 True False
2 i1 a2 b2 False True
3 i2 a1 b2 False False
EDIT: increasing performance for large data. I did it on a large dataset (4 million rows) and it was significantly faster if I avoided transform with something like the following (it is much less elegant):
for col in ['A','B']:
x = df.groupby(['ID',col]).size()
df.set_index(['ID',col],inplace=True)
df[col + 'n'] = x
df.reset_index(inplace=True)