Two Seaborn plots on one twinx figure become distorted - pandas

I saw variations of this question asked several times, but I don't think any of the variations I saw fixes it (other than "use matplotlib for combo-plots", but I'd appreciate help understanding why should I do that).
df1 = pd.DataFrame({'height': {0: 161, 1: 173, 2: 168, 3: 185, 4: 163},
'year': {0: 2015, 1: 2016, 2: 2017, 3: 2018, 4: 2019}})
df2 = pd.DataFrame({'year': {0: 2015, 1: 2015, 2: 2016, 3: 2016, 4: 2017,
5: 2017, 6: 2018, 7: 2018, 8: 2019, 9: 2019},
'weight': {0: 64, 1: 81, 2: 82, 3: 83, 4: 66,
5: 71, 6: 84, 7: 91, 8: 99, 9: 94},
'sex': {0: 'M', 1: 'F', 2: 'M', 3: 'F', 4: 'M',
5: 'F', 6: 'M', 7: 'F', 8: 'M', 9: 'F'}})
ax = sns.barplot(x='year', y='weight', hue='sex', data=df2)
ax2 = ax.twinx()
sns.lineplot(x='year', y='height', data=df1, ax=ax2)
I expected this to be a textbook example of a comboplot, but the result is:
Why is that? Shouldn't the X axes simply converge and make a nice plot? Of course, each plot renders fine individually.

If you plot them separately and check their xlim, you can see seaborn shifts the bar plot's x values down to 0 (the years are displayed separately via xticklabels):
ax = sns.barplot(x='year', y='weight', hue='sex', data=df2)
print(ax.get_xlim())
print(ax.get_xticklabels())
# (-0.5, 4.5)
# [Text(0, 0, '2015'), Text(1, 0, '2016'), Text(2, 0, '2017'), Text(3, 0, '2018'), Text(4, 0, '2019')]
The line plot does not shift the x values and plots the years in the 2000s range:
ax = sns.lineplot(x='year', y='height', data=df1)
print(ax.get_xlim())
# (2014.8, 2019.2)
One workaround is to use reset_index() on the line plot's data and use x='index' to manually shift its x values to 0 to align with the bar plot:
g = sns.lineplot(x='index', y='height', data=df1.reset_index(), ax=ax2)

Related

How to keep the number and names of columns in training and test dataset equal after one hot encoding?

Shape of the original dataset is 82580×30 with multiple string columns. Example dataset:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
df = pd.DataFrame({'Nationality': {0: 'DEU', 1: 'PRT', 2: 'PRT', 3: 'PRT', 4: 'FRA', 5: 'DEU', 6: 'CHE', 7: 'DEU', 8: 'GBR', 9: 'AUT', 10: 'PRT', 11: 'FRA', 12: 'OTR', 13: 'GBR', 14: 'ESP', 15: 'PRT', 16: 'OTR', 17: 'PRT', 18: 'ESP', 19: 'AUT'},
'Age': {0: 27.0, 1: 45.46, 2: 45.46, 3: 58.0, 4: 57.0, 5: 27.0, 6: 49.0, 7: 62.0, 8: 44.0, 9: 61.0, 10: 54.0, 11: 53.0, 12: 50.0, 13: 30.0, 14: 51.0, 15: 45.46, 16: 40.0, 17: 49.0, 18: 49.0, 19: 14.0},
'DaysSinceCreation': {0: 370, 1: 213, 2: 206, 3: 1018, 4: 835, 5: 52, 6: 597, 7: 217, 8: 999, 9: 1004, 10: 402, 11: 879, 12: 393, 13: 923, 14: 249, 15: 52, 16: 159, 17: 929, 18: 49, 19: 131},
'BookingsCheckedIn': {0: 1, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 0, 16: 0, 17: 1, 18: 1, 19: 0}})
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (20, 14)
Index(['onehotencoder__Nationality_AUT', 'onehotencoder__Nationality_CHE',
'onehotencoder__Nationality_DEU', 'onehotencoder__Nationality_ESP',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_GBR',
'onehotencoder__Nationality_OTR', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
After modeling, trying to test on a completely different test set:
df = pd.DataFrame({'Nationality': {0: 'CAN', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
transformer = make_column_transformer(
(OneHotEncoder(sparse=False), ['Nationality']),
remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (5, 10)
Index(['onehotencoder__Nationality_CAN', 'onehotencoder__Nationality_DEU',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
As can be seen, testing dataset has some features that were not present in the original training set and many features of training set are not present in test set. If I only use .values of X_train, y_train, X_test, y_test, I can run from logistic regression to Neural Net with >99% accuracy, but that feels like cheating and is not working out with Decision Trees. How do we deal with this?
I would like to contribute 2 inputs:
(1) the test set should be a subset of the training set, so the unknown Nationality 'CAN' is not allowed. Either: try to include the new 'CAN' in the training data, or try to replace it with 'GBR' instead in the test data.
(2) you should not do fit_transform() separately on training and test set. The right way is to fit on training set, then... transform on training set and transform on test set. To illustrate:
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
####transformed = transformer.fit_transform(df) #delete this
transformer.fit(df) #use this instead
transformed = transformer.transform(df) #use this instead
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (20, 14)
Second part, note that I have replaced 'CAN' with 'GBR'. And only use the previously fitted transformer to transform the test set:
df = pd.DataFrame({'Nationality': {0: 'GBR', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
####transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough') #do not repeat, use the previous fitted model
####transformed = transformer.fit_transform(df) #delete this, NO fitting on test set
transformed = transformer.transform(df) #only do transform on test set
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (5, 14)
So the number of columns (14) are the same for both training set and test set

text showing up in hoverinfo not just displayed

So I'm trying to add data labels so you can see the values of each of my stacks when looking at a graph. I added the text option and put the column I want displayed, but it just returns in the hover information and not just displayed on the graph. How do I change this?
df2 = pd.DataFrame.from_dict({'Country': {0: 'Europe',
1: 'America',
2: 'Asia',
3: 'Europe',
4: 'America',
5: 'Asia',
6: 'Europe',
7: 'America',
8: 'Asia',
9: 'Europe',
10: 'America',
11: 'Asia'},
'Year': {0: 2014,
1: 2014,
2: 2014,
3: 2015,
4: 2015,
5: 2015,
6: 2016,
7: 2016,
8: 2016,
9: 2017,
10: 2017,
11: 2017},
'Amount': {0: 1600,
1: 410,
2: 150,
3: 1300,
4: 300,
5: 170,
6: 1000,
7: 500,
8: 200,
9: 900,
10: 500,
11: 210}})
fig = go.Figure()
x=[]
for i in df2['Year'].unique():
x.append(str(i))
for c in df2['Country'].unique():
df3 = df2[df2['Country'] == c]
fig.add_trace(go.Bar(x=x, y=df3['Amount'], name=c, text=df3['Amount']))
fig.update_layout(title="Personnel at Work",
barmode='stack',
title_x=.5,
yaxis={
'showgrid':False,
'visible':False
},
xaxis=dict(
tick0=0,
dtick=1,
),
plot_bgcolor='rgba(0,0,0,0)')
fig.show()
I had a similar problem and this block of code helped me!. Im not sure if it can help your case but give it a try.
fig.update_traces(texttemplate='%{your_labels =:.1f}', textposition='outside')
Go through all the use cases here,
https://plotly.com/python/text-and-annotations/

Match coloring of slices for series of pandas pie charts

I have a pandas dataframe that looks like this :
df = pd.DataFrame( {'Judge': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3}, 'Category': {0: 'A', 1: 'B', 2: 'C', 3: 'A', 4: 'B', 5: 'C', 6: 'A', 7: 'B', 8: 'C'}, 'Rating': {0: 'Excellent', 1: 'Very Good', 2: 'Good', 3: 'Very Good', 4: 'Very Good', 5: 'Very Good', 6: 'Excellent', 7: 'Very Good', 8: 'Excellent'}} )
I'm plotting a pie chart to show the ratings of each judge like this:
grouped = df.groupby('Judge')
for group in grouped:
group[1].Rating.value_counts().plot(kind='pie', autopct="%1.1f%%")
plt.legend(group[1].Rating.value_counts().index.values, loc="upper right")
plt.title('Judge ' + str(group[0]))
plt.axis('equal')
plt.ylabel('')
plt.tight_layout()
plt.show()
Unfortunately, the colors of the slices are different for each judge. For example, Judge 1's "Excellent" slice is blue where Judge 2's "Very Good" slice is blue.
How can enforce slice color consistency from plot to plot?
I think you can unstack and plot:
axes = (df.groupby('Judge').Rating.value_counts()
.unstack('Judge')
.plot.pie(subplots=True, figsize=(6,6), layout=(2,2))
)
# do some thing with the axes
for ax in axes.ravel():
pass
Output:

ggplot/plotnine - adding a legend from geom_text() with specific color

I have this dataframe:
df = pd.DataFrame({'Segment': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'C', 7: 'D'},
'Average': {0: 55341, 1: 55159, 2: 55394, 3: 56960, 4: 55341, 5: 55159, 6: 55394, 7: 56960},
'Order': {0: 0, 1: 1, 2: 2, 3: 3, 4: 0, 5: 1, 6: 2, 7: 3},
'Variable': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'One', 5: 'One', 6: 'One', 7: 'One'},
'$': {0: 40.6, 1: 18.2, 2: 78.5, 3: 123.3, 4: 42.4, 5: 24.2, 6: 89.7, 7: 144.1},
'ypos': {0: 96.0, 1: 55.4, 2: 181.2, 3: 280.4, 4: 96.0, 5: 55.4, 6: 181.2, 7: 280.4},
'yticks': {0: 20.3,1: 9.1,2: 39.25,3: 61.65,4: 21.2,5: 12.1,6: 44.85,7: 72.05}})
With I plot this:
(ggplot(df, aes(x="Segment", y="$", ymin=0, ymax=300, fill="Variable"))
+ geom_col(position = position_stack(reverse = True), alpha=0.7)
+ geom_text(aes(x = "Segment", y = "ypos", label = "Average"), size=8, format_string="Average: \n ${:,.0f} CLP")
+ geom_text(aes(label = "$"), show_legend=True, position=position_stack(vjust = 0.5), size=8, format_string="%s"%(u"\N{dollar sign}{:,.0f} MM"))
)
I have been looking for a way to add the legend of Average and (then) I will delete the 'Average' words on the bars and leaving just the number. However, for this to be understandable, the additional legend should be the same color as the Average number values (could be yellow, orange, or any other, but no red or sky blue as those colors are already being used)
You can just add color as a variable to geom_text :
import plotnine
from plotnine import ggplot, geom_col, aes, position_stack, geom_text, scale_color_brewer, guides, guide_legend
(ggplot(df, aes(x="Segment", y="$", ymin=0, ymax=300, fill="Variable"))
+ geom_col(position = position_stack(reverse = True), alpha=0.7)
+ geom_text(aes(y = "ypos",color="Segment",label = "Average"), size=8,
show_legend=True,format_string="${:,.0f} CLP")
+ geom_text(aes(label = "$"), show_legend=True, position=position_stack(vjust = 0.5),
size=8, format_string="%s"%(u"\N{dollar sign}{:,.0f} MM"))
+ scale_color_brewer(type='qual', palette=2)
+ guides(color=guide_legend(title="Averages"))
)

Apply np.average in pandas pivot aggfunc

I am trying to calculate weighted average prices using pandas pivot table.
I have tried passing in a dictionary using aggfunc.
This does not work when passed into aggfunc, although it should calculate the correct weighted average.
'Price': lambda x: np.average(x, weights=df['Balance'])
I have also tried using a manual groupby:
df.groupby('Product').agg({
'Balance': sum,
'Price': lambda x : np.average(x, weights='Balance'),
'Value': sum
})
This also yields the error:
TypeError: Axis must be specified when shapes of a and weights differ.
Here is sample data
import pandas as pd
import numpy as np
price_dict = {'Product': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'B',
6: 'B',
7: 'B',
8: 'B',
9: 'B',
10: 'C',
11: 'C',
12: 'C',
13: 'C',
14: 'C'},
'Balance': {0: 10,
1: 20,
2: 30,
3: 40,
4: 50,
5: 60,
6: 70,
7: 80,
8: 90,
9: 100,
10: 110,
11: 120,
12: 130,
13: 140,
14: 150},
'Price': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15},
'Value': {0: 10,
1: 40,
2: 90,
3: 160,
4: 250,
5: 360,
6: 490,
7: 640,
8: 810,
9: 1000,
10: 1210,
11: 1440,
12: 1690,
13: 1960,
14: 2250}}
Try to calculate weighted average by passing dict into aggfunc:
df = pd.DataFrame(price_dict)
df.pivot_table(
index='Product',
aggfunc = {
'Balance': sum,
'Price': np.mean,
'Value': sum
}
)
Output:
Balance Price Value
Product
A 150 3 550
B 400 8 3300
C 650 13 8550
The expected outcome should be :
Balance Price Value
Product
A 150 3.66 550
B 400 8.25 3300
C 650 13.15 8550
Here is one way using apply
df.groupby('Product').apply(lambda x : pd.Series(
{'Balance': x['Balance'].sum(),
'Price': np.average(x['Price'], weights=x['Balance']),
'Value': x['Value'].sum()}))
Out[57]:
Balance Price Value
Product
A 150.0 3.666667 550.0
B 400.0 8.250000 3300.0
C 650.0 13.153846 8550.0