Get and graph groupby result distribution of a column - pandas

I want to graph my group's distribution of a label column. I was able to do so with creating dummies, crating pivot table of each of the groups, and then create a loop to build a new dataframe.
I am looking for a shorter way. Maybe with more advance methods of groupby?
And also I don't know how to create a side by side bar chart instead of the stack bar chart I have here.
To recreate the dataframe:
import pandas as pd
import numpy as np
np.random.seed(1)
a = np.random.choice(['region_A', 'region_B', 'region_C', 'region_D', 'region_E'], size=30, p=
[0.1, 0.2, 0.3, 0.30, 0.1])
b = np.random.choice(['1', '0'], size=30, p=[0.5, 0.5])
df = pd.DataFrame({'region': a, 'label': b})
My desire graph:
dummy = pd.get_dummies(df['region'])
region_lst = []
label_0 = []
label_1 = []
for col in dummy.columns:
region_lst.append(col)
label_0.append(pd.crosstab(dummy[col], df['label']).iloc[1,0])
label_1.append(pd.crosstab(dummy[col], df['label']).iloc[1,1])
df_labels = pd.DataFrame({'label_0': label_0, 'label_1': label_1}, index=region_lst)
df_labels.plot.bar()

Use crosstab with DataFrame.add_prefix for same ouput like in your long code:
pd.crosstab(df['region'], df['label']).add_prefix('label_').plot.bar()
Details:
df_labels = pd.crosstab(df['region'], df['label']).add_prefix('label_')
print (df_labels)
label label_0 label_1
region
region_A 2 3
region_B 3 3
region_C 5 4
region_D 3 6
region_E 1 0
If need remove texts label and region:
df_labels = (pd.crosstab(df['region'], df['label'])
.add_prefix('label_')
.rename_axis(index=None, columns=None)
print (df_labels)
label_0 label_1
region_A 2 3
region_B 3 3
region_C 5 4
region_D 3 6
region_E 1 0

You can use a crosstab:
pd.crosstab(df['region'], df['label']).plot.bar()
output:
intermediate crosstab:
label 0 1
region
region_A 2 3
region_B 3 3
region_C 5 4
region_D 3 6
region_E 1 0

Related

Creating a dataframe using roll-forward window on multivariate time series

Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
print(df)
A B
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
[
df.iloc[i : i + window_size, :]
.T.reset_index()
.assign(other_index=i)
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
]
)
new_df.index.names = ["", ""]
print(df)
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

categorical variable panel ols

For my PanelOLS i like to include categorical variables.
This is my model:
import statsmodels.api as sm
exog_vars = ['x1', 'x2', 'x3']
exog = sm.add_constant(df[exog_vars])
mod = PanelOLS(df.y, exog, entity_effects=True, time_effects=True)
result = mod.fit(cov_type='clustered', cluster_entity=True)
The categorial variable is a number for a industry. This nummber is stored in my dataframe(df['x4']).
Do you know how to include categorical variables? Or do you need more information to answer the question.
My dataframe:
I tried:
df['x4'] = pd.Categorical(gesamt.x4)
mod = PanelOLS(gesamt.CAR, exog, other_effects=df['x4'], entity_effects=True, time_effects=True)
The follwing error occured:
raise ValueError('At most two effects supported.')
ValueError: At most two effects supported.
The simplest way to do this is probably to one-hot-encode your column x4.
If you have
df = pd.DataFrame({'x1': [1,2,3], 'x4': ['bob', 'cat' ,'cat']})
df
which looks like
x1 x4
0 1 bob
1 2 cat
2 3 cat
then
pd.get_dummies(df, 'x4')
gives you
x1 x4_bob x4_cat
0 1 1 0
1 2 0 1
2 3 0 1
Alternatively,
df['x4'] = pd.Categorical(df.x4).codes
df
will give you
x1 x4
0 1 0
1 2 1
2 3 1

Converting a pandas crosstab into a stacked dataframe (a regular table)

Given a pandas crosstab, how do you convert that into a stacked dataframe?
Assume you have a stacked dataframe. First we convert it into a crosstab. Now I would like to revert back to the original stacked dataframe. I searched a problem statement that addresses this requirement, but could not find any that hits bang on. In case I have missed any, please leave a note to it in the comment section.
I would like to document the best practice here. So, thank you for your support.
I know that pandas.DataFrame.stack() would be the best approach. But one needs to be careful of the the "level" stacking is applied to.
Input: Crosstab:
Label a b c d r
ID
1 0 1 0 0 0
2 1 1 0 1 1
3 1 0 0 0 1
4 1 0 0 1 0
6 1 0 0 0 0
7 0 0 1 0 0
8 1 0 1 0 0
9 0 1 0 0 0
Output: Stacked DataFrame:
ID Label
0 1 b
1 2 a
2 2 b
3 2 d
4 2 r
5 3 a
6 3 r
7 4 a
8 4 d
9 6 a
10 7 c
11 8 a
12 8 c
13 9 b
Step-by-step Explanation:
First, let's make a function that would create our data. Note that it randomly generates the stacked dataframe, and so, the final output may differ from what I have given below.
Helper Function: Make the Stacked And Crosstab DataFrames
import numpy as np
import pandas as pd
# Make stacked dataframe
def _create_df():
"""
This dataframe will be used to create a crosstab
"""
B = np.array(list('abracadabra'))
A = np.arange(len(B))
AB = list()
for i in range(20):
a = np.random.randint(1,10)
b = np.random.randint(1,10)
AB += [(a,b)]
AB = np.unique(np.array(AB), axis=0)
AB = np.unique(np.array(list(zip(A[AB[:,0]], B[AB[:,1]]))), axis=0)
AB_df = pd.DataFrame({'ID': AB[:,0], 'Label': AB[:,1]})
return AB_df
original_stacked_df = _create_df()
# Make crosstab
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
What to expect?
You would expect a function to regenerate the stacked dataframe from the crosstab. I would provide my own solution to this in the answer section. If you could suggest something better that would be great.
Other References:
Closest stackoverflow discussion: pandas stacking a dataframe
Misleading stackoverflow question-topic: change pandas crossstab dataframe into plain table format:
You can just do stack
df[df.astype(bool)].stack().reset_index().drop(0,1)
The following produces the desired outcome.
def crosstab2stacked(crosstab):
stacked = crosstab.stack(dropna=True).reset_index()
stacked = stacked[stacked.replace(0,np.nan)[0].notnull()].drop(columns=[0])
return stacked
# Make original dataframe
original_stacked_df = _create_df()
# Make crosstab dataframe
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
# Recontruct stacked dataframe
recon_stacked_df = crosstab2stacked(crosstab = crosstab_df)
Check if original == reconstructed:
np.alltrue(original_stacked_df == recon_stacked_df)
Output: True

Seaborn Violin Plot from Pandas Dataframe, each column its own separate violin plot

I have Pandas Dataframe with structure:
A B
0 1 1
1 2 1
2 3 4
3 3 7
4 6 8
How do I generate a Seaborn Violin plot with each column as its own separate violin plot for side-by-side comparison?
seaborn (at least, version 0.8.1; not sure if this is new) supports what you want without messing around with your dataframe at all:
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'A': [1, 2, 3, 3, 6], 'B': [1, 1, 4, 7, 8]})
sns.violinplot(data=df)
(Note that you do need to set data=df; if you just pass in df as the first argument (equivalent to setting x=df in the function call), it seems like it concatenates the columns together and then makes a violin plot of all of the data)
You can first reshape by melt for groups from columns and then seaborn.violinplot:
#old version of pandas
#df = pd.melt(df, var_name='groups', value_name='vals')
df = df.melt(var_name='groups', value_name='vals')
print (df)
groups vals
0 A 1
1 A 2
2 A 3
3 A 3
4 A 6
5 B 1
6 B 1
7 B 4
8 B 7
9 B 8
ax = sns.violinplot(x="groups", y="vals", data=df)