groupby() with multiple categorical columns delivers unexpected rows [duplicate] - pandas

This question already has answers here:
Pandas groupby with categories with redundant nan
(6 answers)
Closed 2 days ago.
I faced an important issue while using groupby() with multiple columns of type 'categorical'. In this scenario, Pandas delivers unexpected rows, especially rows that are not delivered with other types. Below is a basic working example.
df = pd.DataFrame(['a','a','b','c'], columns=['C1'], dtype='category')
df['C2'] = pd.Series(['x','y','z','y']).astype('category')
df['V'] = 0
df
gives a basic DataFrame:
C1 C2 V
0 a x 0
1 a y 0
2 b z 0
3 c y 0
Now if group this dataframe with multiple columns:
df.groupby(['C1','C2']).sum()
The result contains unexpected rows (combinations of C1 and C2 that don't exist in the input dataframe):
V
C1 C2
a x 0
y 0
z 0
b x 0
y 0
z 0
c x 0
y 0
z 0
If we convert 'categorical' columns to string types
df[['C1','C2']] = df[['C1','C2']].astype(str)
df.groupby(['C1','C2']).sum()
The result contains only expected rows:
V
C1 C2
a x 0
y 0
b z 0
c y 0
Is there any other way, than converting categorical columns to string, to overcome this issue?

#wjandrea suggested a working solution: an extra parameter observed=True passed to groupby():
df.groupby(['C1','C2'], observed=True).sum()
Result is delivered as expected:
V
C1 C2
a x 0
y 0
b z 0
c y 0

Related

How to unpivot table from boolean form

I have a table like this where type (A, B, C) is represented as boolean form
ID
A
B
C
One
1
0
0
Two
0
0
1
Three
0
1
0
I want to have the table like
ID
Type
One
A
Two
C
Three
B
You can melt and select the rows with 1 with loc while using pop to remove the intermediate values:
out = df.melt('ID', var_name='Type').loc[lambda d: d.pop('value').eq(1)]
output:
ID Type
0 One A
5 Three B
7 Two C
You can do:
x,y = np.where(df.iloc[:, 1:])
out = pd.DataFrame({'ID': df.loc[x,'ID'], 'Type': df.columns[y]})
Output:
ID Type
0 One ID
1 Two B
2 Three A
You can also use the new pd.from_dummies constructor here as well. This was added in pandas version 1.5
Note that this also preserves the original order of your ID column.
df['Type'] = pd.from_dummies(df.loc[:, 'A':'C'])
print(df)
ID A B C Type
0 One 1 0 0 A
1 Two 0 0 1 C
2 Three 0 1 0 B
print(df[['ID', 'Type']])
ID Type
0 One A
1 Two C
2 Three B

df.apply(myfunc, axis=1) results in error but df.groupby(df.index).apply(myfunc) does not

I have a dataframe that looks like this:
a b c
0 x x x
1 y y y
2 z z z
I would like to apply a function to each row of dataframe. That function then creates a new dataframe with multiple rows from each input row and returns it. Here is my_func:
def my_func(df):
dup_num = int(df.c - df.a)
if isinstance(df, pd.Series):
df_expanded = pd.concat([pd.DataFrame(df).transpose()]*dup_num,
ignore_index=True)
else:
df_expanded = pd.concat([pd.DataFrame(df)]*dup_num,
ignore_index=True)
return df_expanded
The final dataframe will look like something like this:
a b c
0 x x x
1 x x x
2 y y y
3 y y y
4 y y y
5 z z z
6 z z z
So I did:
df_expanded = df.apply(my_func, axis=1)
I inserted breakpoints inside the function and for each row, the created dataframe from my_func is correct. However, at the end, when the last row returns, I get an error stating that:
ValueError: cannot copy sequence with size XX to array axis with dimension YY
As if apply is trying to return a Series not a group of dataFrames that the function created.
So instead of df.apply I did:
df_expanded = df.groupby(df.index).apply(my_func)
Which just creates groups of single rows and applies the same function. This on the other hand works.
Why?
Perhaps we can take advantage of how pd.Series.explode and pd.Series.apply(pd.Series) work to simplify this process.
Given:
a b c
0 1 1 4
1 2 2 4
2 3 3 4
Doing:
new_df = (df.apply(lambda x: [x.tolist()]*(x.c-x.a), axis=1)
.explode(ignore_index=True)
.apply(pd.Series))
new_df.columns = df.columns
print(new_df)
Output:
a b c
0 1 1 4
1 1 1 4
2 1 1 4
3 2 2 4
4 2 2 4
5 3 3 4

How to find difference between rows in a pandas multiIndex, by level 1

Suppose we have a DataFrame like this, only with many, many more index A values:
df = pd.DataFrame([[1,2,1,2],
[1,1,2,2],
[2,2,1,0],
[1,2,1,2],
[2,1,1,2] ], columns=['A','B','c1','c2'])
df.groupby(['A','B']).sum()
## result
c1 c2
A B
1 1 2 2
2 2 4
2 1 1 2
2 1 0
How can I get a data frame that consists of the difference between rows, by the second level of the index, level B?
The output here would be
A c1 c2
1 0 -2
2 0 2
Note In my particular use case, I have a lot of column A values, so I can write out the value for A explicitly.
Check diff and dropna
g = df.groupby(['A','B'])[['c1','c2']].sum()
g = g.groupby(level=0).diff().dropna()
g
Out[25]:
c1 c2
A B
1 2 0.0 2.0
2 2 0.0 -2.0
Assigning the first grouping to result variable:
result = df.groupby(['A','B']).sum()
You could use a pipe operation with nth:
result.groupby('A').pipe(lambda df: df.nth(0) - df.nth(-1))
c1 c2
A
1 0 -2
2 0 2
A simpler option, in my opinion, would be to use agg combined with numpy's ufunc reduce, as this covers scenarios where you have more than two rows:
result.groupby('A').agg(np.subtract.reduce)
c1 c2
A
1 0 -2
2 0 2

pandas: create a new dataframe from existing columns values

I have a dataframe like this;
ID code num
333_c_132 x 0
333_c_132 n36 1
998_c_134 x 0
998_c_134 n36 0
997_c_135 x 1
997_c_135 n36 0
From this I have to create a new dataframe like below; you can see a new column numX is formed with unique ID. Please note that numX values are taken from num column corresponding to n36.
ID code num numX
333_c_132 x 0 1
998_c_134 x 0 0
997_c_135 x 1 0
How can I do this only using pandas?
You can use a mask then merge after pivotting:
m = df['code'].eq('n36')
(df[~m].merge(df[m].set_index(['ID','code'])['num'].unstack()
,left_on='ID',right_index=True))
ID code num n36
0 333_c_132 x 0 1
2 998_c_134 x 0 0
4 997_c_135 x 1 0

Conditional frequency of elements within lists in pandas data frame

I have a data frame in pandas like this:
STATUS FEATURES
A [x,y,z]
A [t, y]
B [x,p,t]
B [x,p]
I want to count the frequency of the elements in the lists of features conditional on the status.
The desired output would be:
STATUS FEATURES FREQUENCY
A x 1
A y 2
A z 1
A t 1
B x 2
B t 1
B p 2
Let us do explode , the groupby size
s=df.explode(['FEATURES']).groupby(['STATUS','FEATURES']).size().reset_index()
Use DataFrame.explode and SeriesGroupBy.value_counts:
new_df = (df.explode('FEATURES')
.groupby('STATUS')['FEATURES']
.value_counts()
.reset_index(name='FRECUENCY'))
print(new_df)
Output
STATUS FEATURES FRECUENCY
0 A y 2
1 A t 1
2 A x 1
3 A z 1
4 B p 2
5 B x 2
6 B t 1