Finding max row after groupby in pandas dataframe

Finding max row after groupby in pandas dataframe - pandas

I have a daframe as follows:
Month Col1 Col2 Val
A p a1 31
A q a1 78
A r b2 13
B x a1 54
B y b2 56
B z b2 65
I want to get the following:
Month a1 b2
A q r
B x z
Essentially for each pair of Month and Col2, I want to find the value in Col1 which is has the maximum value.
I am not sure how to approach this.

Your problem is:
Find row with max Val within a group, which is sort and drop_duplicates, and
transform the data, which is pivot:
(df.sort_values('Val')
.drop_duplicates(['Month','Col2'], keep='last')
.pivot(index='Month', columns='Col2', values='Col1')
)
Output:
Col2 a1 b2
Month
A q r
B x z

Related

Base R or tidyverse function to expand rows along time series columns [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
I have a time series data frame with two ID columns and about 1000 columns for day 1, day 2, etc.
I want to convert from my data frame being in the form
a b t1 t2 ... t1000
_____________________________
a1 b1 # # #
a2 b2 # # #
to being in the form
a b t value
____________________
a1 b1 't1' #
a1 b1 't2' #
a2 b2 't1' #
a2 b2 't2' #
Essentially, I want to do something like this:
dataframe %>%
select(starts_with("t_") ) %>%
gather(key = "t", value = "value")
so that I have a dataframe looking like this:
t value
__________
't1' #
't2' #
...
't100' #
for each row in the original dataframe. Then once I have the time columns that I generated from one row in the original dataframe, I want to left append the "a" and "b" columns to each row, so that this:
t value
__________
't1' #
't2' #
...
't100' #
turns into this:
a b t value
____________________
a1 b1 't1' #
a1 b1 't2' #
...
a1 b1 't100' #
and then I repeat this process for each row, stacking the new generated dataframe below (or above) the previously generated dataframe. At the end I want to have the dataframe in the code block above generated for each row of the original dataframe, and all stacked on top of one another. I could do this with a for loop, or maybe even some sapply magic, but I don't want to use a for loop in R, and I feel like there's a better way to do this than sapply-ing some function into each row of the original dataframe.
Can anyone help me? Thanks. Preferably using tidyverse.

We can use pivot_longer from tidyr.
library(tidyr)
data <- matrix(1:1000,nrow=2,ncol=1000)
colnames(data) <- paste0("t",1:1000)
data <- data.frame(a=c("a1","a2"),b=c("b1","b2"),data)
data[1:2,1:10]
# a b t1 t2 t3 t4 t5 t6 t7 t8
#1 a1 b1 1 3 5 7 9 11 13 15
#2 a2 b2 2 4 6 8 10 12 14 16
data %>%
pivot_longer(cols = starts_with("t"), names_to = "t")
## A tibble: 2,000 x 4
# a b t value
# <fct> <fct> <chr> <int>
# 1 a1 b1 t1 1
# 2 a1 b1 t2 3
# 3 a1 b1 t3 5
# 4 a1 b1 t4 7
# 5 a1 b1 t5 9
# 6 a1 b1 t6 11
# 7 a1 b1 t7 13
# 8 a1 b1 t8 15
# 9 a1 b1 t9 17
#10 a1 b1 t10 19

Try this:
library(dplyr)
dataframe %>%
gather(key = "t", value = "value", 3:ncol(.))

is it possible to obtain 'groupby-transform-apply' style results with the function return series rather than scaler?

I want to achieve the following behavior:
res = df.groupby(['dimension'], as_index=False)['metric'].transform(lambda x: foo(x))
where foo(x) returns a series the same size as the input which is df['metric']
however, this will throw the following error:
ValueError: transform must return a scalar value for each group
i know i can use a for loop style, but how can i achieve this in a groupby manner?
e.g.
df:
col1 col2 col3
0 A1 B1 1
1 A1 B1 2
2 A2 B2 3
and i want to achieve:
col1 col2 col3
0 A1 B1 1 - (1+2)/2
1 A1 B1 2 - (1+2)/2
2 A2 B2 3 - 3

If you want to return a Series you should use apply instead of transform:
res = df.groupby(['dimension'], as_index=False)['metric'].apply(lambda x: foo(x))
Transform as the error states must return a scalar value that would be put in every rows for each group. But apply will work with a Series returned for each group.
If this doesn't work, provide input and expected output to understand better your problem.

You can do this using transform:
df['col3']=(df.col3-df.groupby(['col1','col2'])['col3'].transform('sum'))/2
Or using apply(slower):
df['col3']=df.groupby(['col1','col2'])['col3'].apply(lambda x: (x-x.sum())/2)
col1 col2 col3
0 A1 B1 -1.0
1 A1 B1 -0.5
2 A2 B2 0.0

Same values don't change but different ones need to be concat when using pandas?

Example:
A B C
0 id1 b1 91
1 id1 b1 350
2 id2 a2 90
3 id2 a4 90
4 id2 a5 90
5 id3 c1 180
The type
col A: string
col B: string
col C: string
Expected Output:
A B C
0 id1 b1 '91,350'
1 id2 a2,a4,a5 '90'
2 id3 c1 '180'
I want to groupby column A to get expected output, but I don't know how to set function to get it like pd.groupby('A').
Notice: the type of expected output columns are all string. And values are merge by ','.

Convert to str, then using groupby with unique
s=df.astype(str).groupby('A',as_index=False).agg(lambda x : ','.join(x.unique()))
s
A B C
0 id1 b1 91,350
1 id2 a2,a4,a5 90
2 id3 c1 180

pandas dataframe group by and agg

I am new to ipython and I am trying to do something with dataframe grouping . I have a dataframe like below
df_test = pd.DataFrame({"A": range(4), "B": ["B1", "B2", "B1", "B2"], "C": ["C1", "C1", np.nan, "C2"]})
df_test
A B C
0 0 B1 C1
1 1 B2 C1
2 2 B1 NaN
3 3 B2 C2
I would like to achieve following things:
1) group by B but creating multilevel column instead of grouped to rows with B1 and B2 as index, B1 and B2 are basically count
2) column A and C are agg function applied with something like {'C':['count'],'A':['sum']}
B
A B1 B2 C
0 6 2 2 3
how ? Thanks

You are doing separate actions to each column. You can hack this by aggregating A and C and then taking the value counts of B separately and then combine the data back together.
ac = df_test.agg({'A':'sum', 'C':'count'})
b = df_test['B'].value_counts()
pd.concat([ac, b]).sort_index().to_frame().T
A B1 B2 C
0 6 2 2 3

Multiple group-by with one common variable with pandas?

I want to mark duplicate values within an ID group. For example
ID A B
i1 a1 b1
i1 a1 b2
i1 a2 b2
i2 a1 b2
should become
ID A An B Bn
i1 a1 2 b1 1
i1 a1 2 b2 2
i1 a2 1 b2 2
i2 a1 1 b2 1
Basically An and Bn count multiplicity within each ID group. How can I do this in pandas? I've found groupBy, but it was quite messy to put everything together. Also I tried individual groupby for ID, A and ID, B. Maybe there is a way to pre-group by ID first and then do all the other variables? (there are many variables and I have very man rows!)

Also I tried individual groupby for ID, A and ID, B
I think this is a straight-forward way to tackle it; As you suggest, you can groupby each separately and then compute the size of the groups. And use transform so you can easily add the results to the original dataframe:
df['An'] = df.groupby(['ID','A'])['A'].transform(np.size)
df['Bn'] = df.groupby(['ID','B'])['B'].transform(np.size)
print df
ID A B An Bn
0 i1 a1 b1 2 1
1 i1 a1 b2 2 2
2 i1 a2 b2 1 2
3 i2 a1 b2 1 1
Of course, with lots of columns you could do:
for col in ['A','B']:
df[col + 'n'] = df.groupby(['ID',col])[col].transform(np.size)
The duplicated method can also be used to give you something similar, but it will mark observations within a group after the first as duplicates:
for col in ['A','B']:
df[col + 'n'] = df.duplicated(['ID',col])
print df
ID A B An Bn
0 i1 a1 b1 False False
1 i1 a1 b2 True False
2 i1 a2 b2 False True
3 i2 a1 b2 False False
EDIT: increasing performance for large data. I did it on a large dataset (4 million rows) and it was significantly faster if I avoided transform with something like the following (it is much less elegant):
for col in ['A','B']:
x = df.groupby(['ID',col]).size()
df.set_index(['ID',col],inplace=True)
df[col + 'n'] = x
df.reset_index(inplace=True)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Finding max row after groupby in pandas dataframe - pandas

Your problem is: Find row with max Val within a group, which is sort and drop_duplicates, and transform the data, which is pivot: (df.sort_values('Val') .drop_duplicates(['Month','Col2'], keep='last') .pivot(index='Month', columns='Col2', values='Col1') ) Output: Col2 a1 b2 Month A q r B x z

Related

Base R or tidyverse function to expand rows along time series columns [duplicate]

is it possible to obtain 'groupby-transform-apply' style results with the function return series rather than scaler?

Same values don't change but different ones need to be concat when using pandas?

pandas dataframe group by and agg

Multiple group-by with one common variable with pandas?

Categories

Resources