Multiple group-by with one common variable with pandas? - pandas

I want to mark duplicate values within an ID group. For example
i1 a1 b1
i1 a1 b2
i1 a2 b2
i2 a1 b2
should become
ID A An B Bn
i1 a1 2 b1 1
i1 a1 2 b2 2
i1 a2 1 b2 2
i2 a1 1 b2 1
Basically An and Bn count multiplicity within each ID group. How can I do this in pandas? I've found groupBy, but it was quite messy to put everything together. Also I tried individual groupby for ID, A and ID, B. Maybe there is a way to pre-group by ID first and then do all the other variables? (there are many variables and I have very man rows!)

Also I tried individual groupby for ID, A and ID, B
I think this is a straight-forward way to tackle it; As you suggest, you can groupby each separately and then compute the size of the groups. And use transform so you can easily add the results to the original dataframe:
df['An'] = df.groupby(['ID','A'])['A'].transform(np.size)
df['Bn'] = df.groupby(['ID','B'])['B'].transform(np.size)
print df
ID A B An Bn
0 i1 a1 b1 2 1
1 i1 a1 b2 2 2
2 i1 a2 b2 1 2
3 i2 a1 b2 1 1
Of course, with lots of columns you could do:
for col in ['A','B']:
df[col + 'n'] = df.groupby(['ID',col])[col].transform(np.size)
The duplicated method can also be used to give you something similar, but it will mark observations within a group after the first as duplicates:
for col in ['A','B']:
df[col + 'n'] = df.duplicated(['ID',col])
print df
ID A B An Bn
0 i1 a1 b1 False False
1 i1 a1 b2 True False
2 i1 a2 b2 False True
3 i2 a1 b2 False False
EDIT: increasing performance for large data. I did it on a large dataset (4 million rows) and it was significantly faster if I avoided transform with something like the following (it is much less elegant):
for col in ['A','B']:
x = df.groupby(['ID',col]).size()
df[col + 'n'] = x


replacing first row of selected column from each group with 0

Existing df :
Id status value
A1 clear 23
A1 in-process 50
A1 done 20
B1 start 2
B1 end 30
Expected df :
Id status value
A1 clear 0
A1 in-process 50
A1 done 20
B1 start 0
B1 end 30
looking to replace first value of each group with 0
Use Series.duplicated for duplicated values, set first duplicate by inverse mask by ~ with DataFrame.loc:
df.loc[~df['Id'].duplicated(), 'value'] = 0
print (df)
Id status value
0 A1 clear 0
1 A1 in-process 50
2 A1 done 20
3 B1 start 0
4 B1 end 30
One approach could be as follows:
Compare the values for each row in df.Id with the next row, combining Series.shift with This will return a boolean Series with True for each first row of a new Id value.
Next, use df.loc to select only rows with True for column value and assign 0.
df.loc[, 'value'] = 0
Id status value
0 A1 clear 0
1 A1 in-process 50
2 A1 done 20
3 B1 start 0
4 B1 end 30
N.B. this approach assumes that the "groups" in Id are sorted (as they seem to be, indeed). If this is not the case, you could use df.sort_values('Id', inplace=True) first, but if that is necessary, the answer by #jezrael will be faster, surely.

Finding max row after groupby in pandas dataframe

I have a daframe as follows:
Month Col1 Col2 Val
A p a1 31
A q a1 78
A r b2 13
B x a1 54
B y b2 56
B z b2 65
I want to get the following:
Month a1 b2
A q r
B x z
Essentially for each pair of Month and Col2, I want to find the value in Col1 which is has the maximum value.
I am not sure how to approach this.
Your problem is:
Find row with max Val within a group, which is sort and drop_duplicates, and
transform the data, which is pivot:
.drop_duplicates(['Month','Col2'], keep='last')
.pivot(index='Month', columns='Col2', values='Col1')
Col2 a1 b2
A q r
B x z

append lists of different length to dataframe pandas

Consider I have multiple lists
A = ['acc_num=1', 'A1', 'A2']
B = ['acc_num=2', 'B1', 'B2', 'B3','B4']
C = ['acc_num=3', 'C1']
How to I put them in dataframe to export to excel as:
acc_num _1 _2 _3 _4
_1 1 A1 A2
_2 2 B1 B2 B3 B4
_3 3 C1
Hi here is a solution for you in 3 basic steps:
Create a DataFrame just by passing a list of your lists
Manipulate the acc_num column and remove the starting string "acc_num=" this is done with a string method on the vectorized column (but that goes maybe to far for now)
Rename the Column Header / Names as you wish by passing a dictionary {} to the df.rename
The Code:
# Create a Dataframe from your lists
df = pd.DataFrame([A,B,C])
# Change Column 0 and remove initial string
df[0] = df[0].str.replace('acc_num=','')
# Change the name of Column 0
Final result:
acc_num 1 2 3 4
0 1 A1 A2 None None
1 2 B1 B2 B3 B4
2 3 C1 None None None

pandas dataframe group by and agg

I am new to ipython and I am trying to do something with dataframe grouping . I have a dataframe like below
df_test = pd.DataFrame({"A": range(4), "B": ["B1", "B2", "B1", "B2"], "C": ["C1", "C1", np.nan, "C2"]})
0 0 B1 C1
1 1 B2 C1
2 2 B1 NaN
3 3 B2 C2
I would like to achieve following things:
1) group by B but creating multilevel column instead of grouped to rows with B1 and B2 as index, B1 and B2 are basically count
2) column A and C are agg function applied with something like {'C':['count'],'A':['sum']}
A B1 B2 C
0 6 2 2 3
how ? Thanks
You are doing separate actions to each column. You can hack this by aggregating A and C and then taking the value counts of B separately and then combine the data back together.
ac = df_test.agg({'A':'sum', 'C':'count'})
b = df_test['B'].value_counts()
pd.concat([ac, b]).sort_index().to_frame().T
A B1 B2 C
0 6 2 2 3

Generating binary variables in Pig

I am newbie to the world of Pig and I need to implement the following scenario.
Input to pig script: Any arbitrary relation say as below table
a1 b1 c1
a2 b2 c2
a1 b1 c3
we have to generate binary columns based on B,C so my output will look something like this.
A B C B.b1 B.b2 C.c1 C.c2 C.c3
a1 b1 c1 1 0 1 0 0
a2 b2 c2 0 1 0 1 0
a1 b1 c3 1 0 0 0 1
Can someone let me know how to achieve this in pig? i know this can be easily achieved using R script but my requirement is to achieve via PIG.
Your help will be highly appreciated.
Can you try this?
a1 b1 c1
a2 b2 c2
a1 b1 c3
X = LOAD 'input' USING PigStorage() AS (A:chararray,B:chararray,C:chararray);
((B=='b1')?1:0) AS Bb1,
((B=='b2')?1:0) AS Bb2,
((C=='c1')?1:0) AS Cc1,
((C=='c2')?1:0) AS Cc2,
((C=='c3')?1:0) AS Cc3;