Get and graph groupby result distribution of a column - pandas

I want to graph my group's distribution of a label column. I was able to do so with creating dummies, crating pivot table of each of the groups, and then create a loop to build a new dataframe.
I am looking for a shorter way. Maybe with more advance methods of groupby?
And also I don't know how to create a side by side bar chart instead of the stack bar chart I have here.
To recreate the dataframe:
import pandas as pd
import numpy as np
a = np.random.choice(['region_A', 'region_B', 'region_C', 'region_D', 'region_E'], size=30, p=
[0.1, 0.2, 0.3, 0.30, 0.1])
b = np.random.choice(['1', '0'], size=30, p=[0.5, 0.5])
df = pd.DataFrame({'region': a, 'label': b})
My desire graph:
dummy = pd.get_dummies(df['region'])
region_lst = []
label_0 = []
label_1 = []
for col in dummy.columns:
label_0.append(pd.crosstab(dummy[col], df['label']).iloc[1,0])
label_1.append(pd.crosstab(dummy[col], df['label']).iloc[1,1])
df_labels = pd.DataFrame({'label_0': label_0, 'label_1': label_1}, index=region_lst)

Use crosstab with DataFrame.add_prefix for same ouput like in your long code:
pd.crosstab(df['region'], df['label']).add_prefix('label_')
df_labels = pd.crosstab(df['region'], df['label']).add_prefix('label_')
print (df_labels)
label label_0 label_1
region_A 2 3
region_B 3 3
region_C 5 4
region_D 3 6
region_E 1 0
If need remove texts label and region:
df_labels = (pd.crosstab(df['region'], df['label'])
.rename_axis(index=None, columns=None)
print (df_labels)
label_0 label_1
region_A 2 3
region_B 3 3
region_C 5 4
region_D 3 6
region_E 1 0

You can use a crosstab:
pd.crosstab(df['region'], df['label'])
intermediate crosstab:
label 0 1
region_A 2 3
region_B 3 3
region_C 5 4
region_D 3 6
region_E 1 0


Creating a dataframe using roll-forward window on multivariate time series

Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
df.iloc[i : i + window_size, :]
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
new_df.index.names = ["", ""]
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

categorical variable panel ols

For my PanelOLS i like to include categorical variables.
This is my model:
import statsmodels.api as sm
exog_vars = ['x1', 'x2', 'x3']
exog = sm.add_constant(df[exog_vars])
mod = PanelOLS(df.y, exog, entity_effects=True, time_effects=True)
result ='clustered', cluster_entity=True)
The categorial variable is a number for a industry. This nummber is stored in my dataframe(df['x4']).
Do you know how to include categorical variables? Or do you need more information to answer the question.
My dataframe:
I tried:
df['x4'] = pd.Categorical(gesamt.x4)
mod = PanelOLS(gesamt.CAR, exog, other_effects=df['x4'], entity_effects=True, time_effects=True)
The follwing error occured:
raise ValueError('At most two effects supported.')
ValueError: At most two effects supported.
The simplest way to do this is probably to one-hot-encode your column x4.
If you have
df = pd.DataFrame({'x1': [1,2,3], 'x4': ['bob', 'cat' ,'cat']})
which looks like
x1 x4
0 1 bob
1 2 cat
2 3 cat
pd.get_dummies(df, 'x4')
gives you
x1 x4_bob x4_cat
0 1 1 0
1 2 0 1
2 3 0 1
df['x4'] = pd.Categorical(df.x4).codes
will give you
x1 x4
0 1 0
1 2 1
2 3 1

Converting a pandas crosstab into a stacked dataframe (a regular table)

Given a pandas crosstab, how do you convert that into a stacked dataframe?
Assume you have a stacked dataframe. First we convert it into a crosstab. Now I would like to revert back to the original stacked dataframe. I searched a problem statement that addresses this requirement, but could not find any that hits bang on. In case I have missed any, please leave a note to it in the comment section.
I would like to document the best practice here. So, thank you for your support.
I know that pandas.DataFrame.stack() would be the best approach. But one needs to be careful of the the "level" stacking is applied to.
Input: Crosstab:
Label a b c d r
1 0 1 0 0 0
2 1 1 0 1 1
3 1 0 0 0 1
4 1 0 0 1 0
6 1 0 0 0 0
7 0 0 1 0 0
8 1 0 1 0 0
9 0 1 0 0 0
Output: Stacked DataFrame:
ID Label
0 1 b
1 2 a
2 2 b
3 2 d
4 2 r
5 3 a
6 3 r
7 4 a
8 4 d
9 6 a
10 7 c
11 8 a
12 8 c
13 9 b
Step-by-step Explanation:
First, let's make a function that would create our data. Note that it randomly generates the stacked dataframe, and so, the final output may differ from what I have given below.
Helper Function: Make the Stacked And Crosstab DataFrames
import numpy as np
import pandas as pd
# Make stacked dataframe
def _create_df():
This dataframe will be used to create a crosstab
B = np.array(list('abracadabra'))
A = np.arange(len(B))
AB = list()
for i in range(20):
a = np.random.randint(1,10)
b = np.random.randint(1,10)
AB += [(a,b)]
AB = np.unique(np.array(AB), axis=0)
AB = np.unique(np.array(list(zip(A[AB[:,0]], B[AB[:,1]]))), axis=0)
AB_df = pd.DataFrame({'ID': AB[:,0], 'Label': AB[:,1]})
return AB_df
original_stacked_df = _create_df()
# Make crosstab
crosstab_df = pd.crosstab(original_stacked_df['ID'],
What to expect?
You would expect a function to regenerate the stacked dataframe from the crosstab. I would provide my own solution to this in the answer section. If you could suggest something better that would be great.
Other References:
Closest stackoverflow discussion: pandas stacking a dataframe
Misleading stackoverflow question-topic: change pandas crossstab dataframe into plain table format:
You can just do stack
The following produces the desired outcome.
def crosstab2stacked(crosstab):
stacked = crosstab.stack(dropna=True).reset_index()
stacked = stacked[stacked.replace(0,np.nan)[0].notnull()].drop(columns=[0])
return stacked
# Make original dataframe
original_stacked_df = _create_df()
# Make crosstab dataframe
crosstab_df = pd.crosstab(original_stacked_df['ID'],
# Recontruct stacked dataframe
recon_stacked_df = crosstab2stacked(crosstab = crosstab_df)
Check if original == reconstructed:
np.alltrue(original_stacked_df == recon_stacked_df)
Output: True

Seaborn Violin Plot from Pandas Dataframe, each column its own separate violin plot

I have Pandas Dataframe with structure:
0 1 1
1 2 1
2 3 4
3 3 7
4 6 8
How do I generate a Seaborn Violin plot with each column as its own separate violin plot for side-by-side comparison?
seaborn (at least, version 0.8.1; not sure if this is new) supports what you want without messing around with your dataframe at all:
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'A': [1, 2, 3, 3, 6], 'B': [1, 1, 4, 7, 8]})
(Note that you do need to set data=df; if you just pass in df as the first argument (equivalent to setting x=df in the function call), it seems like it concatenates the columns together and then makes a violin plot of all of the data)
You can first reshape by melt for groups from columns and then seaborn.violinplot:
#old version of pandas
#df = pd.melt(df, var_name='groups', value_name='vals')
df = df.melt(var_name='groups', value_name='vals')
print (df)
groups vals
0 A 1
1 A 2
2 A 3
3 A 3
4 A 6
5 B 1
6 B 1
7 B 4
8 B 7
9 B 8
ax = sns.violinplot(x="groups", y="vals", data=df)