pandas dataframe group by and agg - pandas

I am new to ipython and I am trying to do something with dataframe grouping . I have a dataframe like below
df_test = pd.DataFrame({"A": range(4), "B": ["B1", "B2", "B1", "B2"], "C": ["C1", "C1", np.nan, "C2"]})
df_test
A B C
0 0 B1 C1
1 1 B2 C1
2 2 B1 NaN
3 3 B2 C2
I would like to achieve following things:
1) group by B but creating multilevel column instead of grouped to rows with B1 and B2 as index, B1 and B2 are basically count
2) column A and C are agg function applied with something like {'C':['count'],'A':['sum']}
B
A B1 B2 C
0 6 2 2 3
how ? Thanks

You are doing separate actions to each column. You can hack this by aggregating A and C and then taking the value counts of B separately and then combine the data back together.
ac = df_test.agg({'A':'sum', 'C':'count'})
b = df_test['B'].value_counts()
pd.concat([ac, b]).sort_index().to_frame().T
A B1 B2 C
0 6 2 2 3

Related

Pivoting data without column

Starting from an imported df from excel like that:
Code
Material
Text
QTY
A1
X222
Model3
1
A2
4027721
Gruoup1
1
A2
4647273
Gruoup1.1
4
A1
573828
Gruoup1.2
1
I want to create a new pivot table like that:
Code
Qty
A1
2
A2
5
I tried with the following command but they do not work:
df.pivot(index='Code', columns='',values='Qty')
df_pivot = df ("Code").Qty([sum, max])
You don't need pivot but groupby:
out = df.groupby('Code', as_index=False)['QTY'].sum()
# Or
out = df.groupby('Code')['QTY'].agg(['sum', 'max']).reset_index()
Output:
>>> out
Code sum max
0 A1 2 1
1 A2 5 4
The equivalent code with pivot_table:
out = (df.pivot_table('QTY', 'Code', aggfunc=['sum', 'max'])
.droplevel(1, axis=1).reset_index())

append lists of different length to dataframe pandas

Consider I have multiple lists
A = ['acc_num=1', 'A1', 'A2']
B = ['acc_num=2', 'B1', 'B2', 'B3','B4']
C = ['acc_num=3', 'C1']
How to I put them in dataframe to export to excel as:
acc_num _1 _2 _3 _4
_1 1 A1 A2
_2 2 B1 B2 B3 B4
_3 3 C1
Hi here is a solution for you in 3 basic steps:
Create a DataFrame just by passing a list of your lists
Manipulate the acc_num column and remove the starting string "acc_num=" this is done with a string method on the vectorized column (but that goes maybe to far for now)
Rename the Column Header / Names as you wish by passing a dictionary {} to the df.rename
The Code:
# Create a Dataframe from your lists
df = pd.DataFrame([A,B,C])
# Change Column 0 and remove initial string
df[0] = df[0].str.replace('acc_num=','')
# Change the name of Column 0
df.rename(columns={0:"acc_num"},inplace=True)
Final result:
Out[26]:
acc_num 1 2 3 4
0 1 A1 A2 None None
1 2 B1 B2 B3 B4
2 3 C1 None None None

How to groupby in pandas where i have column values starting with similar letters

Suppose i have a column with values(not column name) L1 xyy, L2 yyy, L3 abc, now i want to group L1, L2 and L3 as L(or any other name also would do).
Similarly i have other values like A1 xxx, A2 xxx, to be grouped form A and so on for other alphabets.
How do i achieve this in pandas?
I have L1, A1 and so on all in same column, and not different columns.
Use indexing by str[0] for return first letter of column and then aggregate some function, e.g. sum:
df = pd.DataFrame({'col':['L1 xyy','L2 yyy','L3 abc','A1 xxx','A2 xxx'],
'val':[2,3,5,1,2]})
print (df)
col val
0 L1 xyy 2
1 L2 yyy 3
2 L3 abc 5
3 A1 xxx 1
4 A2 xxx 2
df1 = df.groupby(df['col'].str[0])['val'].sum().reset_index(name='new')
print (df1)
col new
0 A 3
1 L 10
If need new column by first value:
df['new'] = df['col'].str[0]
print (df)
col val new
0 L1 xyy 2 L
1 L2 yyy 3 L
2 L3 abc 5 L
3 A1 xxx 1 A
4 A2 xxx 2 A

Split a dataframe into multiple dataframes

I have a dataframe; I split it using groupby. I understand this splits the dataframes into multiple dataframes. How can I get back those individual dataframes , based on the groups and name them accordingly? So if said df.groupby(['A','B'])
and A has values A1, and B has values B1-B4, I want to get back those 4 dataframes callefdf_A1B1..df_A1B1, df_A1B2...df_A1B4?
This can be done by locals but not recommend
variables = locals()
for i,j in df.groupby(['A','B']):
variables["df_{0[0]}{0[1]}".format(i)] = j
df_01
Out[332]:
A B C
0 0 1 a-1524112-124
Using dict is the right way
{"df_{0[0]}{0[1]}".format(i) : j for i,j in df.groupby(['A','B'])}
Offering an alternate solution, using pandas.DataFrame.xs and some exec magic -
df = pd.DataFrame({'A': ['a1', 'a2']*4,
'B': ['b1', 'b2', 'b3', 'b4']*2,
'val': [i for i in range(8)]
})
df
# A B val
# 0 a1 b1 0
# 1 a2 b2 1
# 2 a1 b3 2
# 3 a2 b4 3
# 4 a1 b1 4
# 5 a2 b2 5
# 6 a1 b3 6
# 7 a2 b4 7
for i in df.set_index(['A', 'B']).index.unique().tolist():
exec("df_{}{}".format(i[0], i[1]) + " = df.set_index(['A','B']).xs(i)")
df_a1b1
# val
# A B
# a1 b1 0
# b1 4

Multiple group-by with one common variable with pandas?

I want to mark duplicate values within an ID group. For example
ID A B
i1 a1 b1
i1 a1 b2
i1 a2 b2
i2 a1 b2
should become
ID A An B Bn
i1 a1 2 b1 1
i1 a1 2 b2 2
i1 a2 1 b2 2
i2 a1 1 b2 1
Basically An and Bn count multiplicity within each ID group. How can I do this in pandas? I've found groupBy, but it was quite messy to put everything together. Also I tried individual groupby for ID, A and ID, B. Maybe there is a way to pre-group by ID first and then do all the other variables? (there are many variables and I have very man rows!)
Also I tried individual groupby for ID, A and ID, B
I think this is a straight-forward way to tackle it; As you suggest, you can groupby each separately and then compute the size of the groups. And use transform so you can easily add the results to the original dataframe:
df['An'] = df.groupby(['ID','A'])['A'].transform(np.size)
df['Bn'] = df.groupby(['ID','B'])['B'].transform(np.size)
print df
ID A B An Bn
0 i1 a1 b1 2 1
1 i1 a1 b2 2 2
2 i1 a2 b2 1 2
3 i2 a1 b2 1 1
Of course, with lots of columns you could do:
for col in ['A','B']:
df[col + 'n'] = df.groupby(['ID',col])[col].transform(np.size)
The duplicated method can also be used to give you something similar, but it will mark observations within a group after the first as duplicates:
for col in ['A','B']:
df[col + 'n'] = df.duplicated(['ID',col])
print df
ID A B An Bn
0 i1 a1 b1 False False
1 i1 a1 b2 True False
2 i1 a2 b2 False True
3 i2 a1 b2 False False
EDIT: increasing performance for large data. I did it on a large dataset (4 million rows) and it was significantly faster if I avoided transform with something like the following (it is much less elegant):
for col in ['A','B']:
x = df.groupby(['ID',col]).size()
df.set_index(['ID',col],inplace=True)
df[col + 'n'] = x
df.reset_index(inplace=True)