Seaborn swarmplot marker colors - pandas

I have my data stored in the following pandas dataframe df.
I am using the following command to plot a box and swarmplot superimposed.
hue_plot_params = {'data': df,'x': 'Genotype','y': 'centroid velocity','hue': 'Surface'}
sns.boxplot(**hue_plot_params)
sns.swarmplot(**hue_plot_params)
However, there is one more category 'Fly' which I would also like to plot as different marker colors in my swarmplot. As the data in each column is made up of multiple 'Fly' types, I would like to have multiple colors for the corresponding markers instead of blue and orange.
(ignore the significance bars, those have been made with another command)
Could someone help me with how to do this?

The following approach iterates through the 'Fly' categories. A swarmplot needs to be created in one go, but a stripplot could be drawn one by one on top of each other.
'Surface' is still used as hue to get them plotted as "dodged", but the palette uses the color for the fly category two times.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set()
np.random.seed(2022)
df = pd.DataFrame({'Genotype': np.repeat(['SN1_TNT', 'SN2_TNT', 'SN3_TNT', 'SN4_TNT'], 100),
'centroid velocity': 20 - np.random.normal(0.01, 1, 400).cumsum(),
'Surface': np.tile(np.repeat(['smooth', 'rough'], 50), 4),
'Fly': np.random.choice(['Fly1', 'Fly2', 'Fly3'], 400)})
hue_plot_params = {'x': 'Genotype', 'y': 'centroid velocity', 'hue': 'Surface', 'hue_order': ['smooth', 'rough']}
ax = sns.boxplot(data=df, **hue_plot_params)
handles, _ = ax.get_legend_handles_labels() # get the legend handles for the boxplots
fly_list = sorted(df['Fly'].unique())
for fly, color in zip(fly_list, sns.color_palette('pastel', len(fly_list))):
sns.stripplot(data=df[df['Fly'] == fly], **hue_plot_params, dodge=True,
palette=[color, color], label=fly, ax=ax)
strip_handles, _ = ax.get_legend_handles_labels()
strip_handles[-1].set_label(fly)
handles.append(strip_handles[-1])
ax.legend(handles=handles)
plt.show()

Related

Equivalent of Hist()'s Layout hyperparameter in Sns.Pairplot?

Am trying to find hist()'s figsize and layout parameter for sns.pairplot().
I have a pairplot that gives me nice scatterplots between the X's and y. However, it is oriented horizontally and there is no equivalent layout parameter to make them vertical to my knowledge. 4 plots per row would be great.
This is my current sns.pairplot():
sns.pairplot(X_train,
x_vars = X_train.select_dtypes(exclude=['object']).columns,
y_vars = ["SalePrice"])
This is what I would like it to look like: Source
num_mask = train_df.dtypes != object
num_cols = train_df.loc[:, num_mask[num_mask == True].keys()]
num_cols.hist(figsize = (30,15), layout = (4,10))
plt.show()
What you want to achieve isn't currently supported by sns.pairplot, but you can use one of the other figure-level functions (sns.displot, sns.catplot, ...). sns.lmplot creates a grid of scatter plots. For this to work, the dataframe needs to be in "long form".
Here is a simple example. sns.lmplot has parameters to leave out the regression line (fit_reg=False), to set the height of the individual subplots (height=...), to set its aspect ratio (aspect=..., where the subplot width will be height times aspect ratio), and many more. If all y ranges are similar, you can use the default sharey=True.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some test data with different y-ranges
np.random.seed(20230209)
X_train = pd.DataFrame({"".join(np.random.choice([*'uvwxyz'], np.random.randint(3, 8))):
np.random.randn(100).cumsum() + np.random.randint(100, 1000) for _ in range(10)})
X_train['SalePrice'] = np.random.randint(10000, 100000, 100)
# convert the dataframe to long form
# 'SalePrice' will get excluded automatically via `melt`
compare_columns = X_train.select_dtypes(exclude=['object']).columns
long_df = X_train.melt(id_vars='SalePrice', value_vars=compare_columns)
# create a grid of scatter plots
g = sns.lmplot(data=long_df, x='SalePrice', y='value', col='variable', col_wrap=4, sharey=False)
g.set(ylabel='')
plt.show()
Here is another example, with histograms of the mpg dataset:
import matplotlib.pyplot as plt
import seaborn as sns
mpg = sns.load_dataset('mpg')
compare_columns = mpg.select_dtypes(exclude=['object']).columns
mpg_long = mpg.melt(value_vars=compare_columns)
g = sns.displot(data=mpg_long, kde=True, x='value', common_bins=False, col='variable', col_wrap=4, color='crimson',
facet_kws={'sharex': False, 'sharey': False})
g.set(xlabel='')
plt.show()

Pandas plot of a stacked and grouped bar chart

I have a CSV-file with a couple of data:
# Comment header
#
MainCategory,SubCategory,DurationM,DurationH,Number
MainCat1,Sub1.1,598,9.97,105
MainCat1,Sub1.2,11,0.18,4
MainCat1,Sub1.3,17,0.28,5
MainCat1,Sub1.4,16,0.27,2
MainCat2,Sub2.1,14161,236.02,102
MainCat2,Sub2.2,834,13.90,17
MainCat3,Sub3.1,4325,72.08,472
MainCat3,Sub3.2,7,0.12,2
MainCat4,Sub4.1,614,10.23,60
MainCat5,Sub5.1,6362,106.03,142
MainCat5,Sub5.2,141,2.35,6
Misc,Misc.1,3033,50.55,53
MainCat4,Sub4.2,339,5.65,4
MainCat4,Sub4.3,925,15.42,11
Misc,Misc.2,2641,44.02,28
MainCat6,Sub6.1,370,6.17,4
MainCat7,Sub7.1,9601,160.02,10
MainCat4,Sub4.4,75,1.25,2
MainCat8,Sub8.1,148,2.47,4
MainCat8,Sub8.2,680,11.35,7
MainCat9,Sub9.1,3997,66.62,1
MainCat8,Sub8.3,105,1.75,2
MainCat4,Sub4.5,997,16.62,1
MainCat10,Sub10.1,12,0.20,3
MainCat4,Sub4.6,10,0.17,1
MainCat10,Sub10.2,13,0.22,1
MainCat4,Sub4.7,561,9.35,4
MainCat10,Sub10.3,1043,17.38,47
What I would like to achieve is a stacked bar plot where
the X-axis values/labels are given by the values/groups given by MainCategory
on the left Y-axis, the DurationH is used
on the right Y-axis the Number is used
DurationH and Number are plotted as bars per MainCategory side-by-side
In each of the bars, the SubCategory is used for stacking
Something like this:
The following code produces stacked plots, but a sequence of them:
import pandas as pd
from pandas import DataFrame
from matplotlib import pyplot as plt
import seaborn as sns
data = pd.read_csv('failureEventStatistic_total_Top10.csv', sep=',', header=2)
data = data.rename(columns={'DurationM':'Duration [min]', 'DurationH':'Duration [h]'})
data.groupby('MainCategorie')[['Duration [h]', 'Number']].plot.bar()
I tried to use unstack(), but this produces an error
You can get the plot data from a crosstab and then make a right aligned and a left aligned bar plot on the same axes:
ax = pd.crosstab(df.MainCategory, df.SubCategory.str.partition('.')[2], df.DurationH, aggfunc=sum).plot.bar(
stacked=True, width=-0.4, align='edge', ylabel='DurationH', ec='w', color=[(0,1,0,x) for x in np.linspace(1, 0.1, 7)], legend=False)
h_durationh, _ = ax.get_legend_handles_labels()
ax = pd.crosstab(df.MainCategory, df.SubCategory.str.partition('.')[2], df.Number, aggfunc=sum).plot.bar(
stacked=True, width=0.4, align='edge', secondary_y=True, ec='w', color=[(0,0,1,x) for x in np.linspace(1, 0.1, 7)], legend=False, ax=ax)
h_number, _ = ax.get_legend_handles_labels()
ax.set_ylabel('Number')
ax.set_xlim(left=ax.get_xlim()[0] - 0.5)
ax.legend([h_durationh[0], h_number[0]], ['DurationH', 'Number'])

How to start Seaborn Logarithmic Barplot at y=1

I have a problem figuring out how to have Seaborn show the right values in a logarithmic barplot. A value of mine should be, in the ideal case, be 1. My dataseries (5,2,1,0.5,0.2) has a set of values that deviate from unity and I want to visualize these in a logarithmic barplot. However, when plotting this in the standard log-barplot it shows the following:
But the values under one are shown to increase from -infinity to their value, whilst the real values ought to look like this:
Strangely enough, I was unable to find a Seaborn, Pandas or Matplotlib attribute to "snap" to a different horizontal axis or "align" or ymin/ymax. I have a feeling I am unable to find it because I can't find the terms to shove down my favorite search engine. Some semi-solutions I found just did not match what I was looking for or did not have either xaxis = 1 or a ylog. A try that uses some jank Matplotlib lines:
If someone knows the right terms or a solution, thank you in advance.
Here are the Jupyter cells I used:
{1}
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
data = {'X': ['A','B','C','D','E'], 'Y': [5,2,1,0.5,0.2]}
df = pd.DataFrame(data)
{2}
%matplotlib widget
g = sns.catplot(data=df, kind="bar", y = "Y", x = "X", log = True)
{3}
%matplotlib widget
plt.vlines(x=data['X'], ymin=1, ymax=data['Y'])
You could let the bars start at 1 instead of at 0. You'll need to use sns.barplot directly.
The example code subtracts 1 of all y-values and sets the bar bottom at 1.
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import seaborn as sns
import pandas as pd
import numpy as np
data = {'X': ['A', 'B', 'C', 'D', 'E'], 'Y': [5, 2, 1, 0.5, 0.2]}
df = pd.DataFrame(data)
ax = sns.barplot(y=df["Y"] - 1, x=df["X"], bottom=1, log=True, palette='flare_r')
ax.axhline(y=1, c='k')
# change the y-ticks, as the default shows too few in this case
ax.set_yticks(np.append(np.arange(.2, .8, .1), np.arange(1, 7, 1)), minor=False)
ax.set_yticks(np.arange(.3, 6, .1), minor=True)
ax.yaxis.set_major_formatter(lambda x, pos: f'{x:.0f}' if x >= 1 else f'{x:.1f}')
ax.yaxis.set_minor_formatter(NullFormatter())
ax.bar_label(ax.containers[0], labels=df["Y"])
sns.despine()
plt.show()
PS: With these specific values, the plot might go without logscale:

Align multi-line ticks in Seaborn plot

I have the following heatmap:
I've broken up the category names by each capital letter and then capitalised them. This achieves a centering effect across the labels on my x-axis by default which I'd like to replicate across my y-axis.
yticks = [re.sub("(?<=.{1})(.?)(?=[A-Z]+)", "\\1\n", label, 0, re.DOTALL).upper() for label in corr.index]
xticks = [re.sub("(?<=.{1})(.?)(?=[A-Z]+)", "\\1\n", label, 0, re.DOTALL).upper() for label in corr.columns]
fig, ax = plt.subplots(figsize=(20,15))
sns.heatmap(corr, ax=ax, annot=True, fmt="d",
cmap="Blues", annot_kws=annot_kws,
mask=mask, vmin=0, vmax=5000,
cbar_kws={"shrink": .8}, square=True,
linewidths=5)
for p in ax.texts:
myTrans = p.get_transform()
offset = mpl.transforms.ScaledTranslation(-12, 5, mpl.transforms.IdentityTransform())
p.set_transform(myTrans + offset)
plt.yticks(plt.yticks()[0], labels=yticks, rotation=0, linespacing=0.4)
plt.xticks(plt.xticks()[0], labels=xticks, rotation=0, linespacing=0.4)
where corr represents a pre-defined pandas dataframe.
I couldn't seem to find an align parameter for setting the ticks and was wondering if and how this centering could be achieved in seaborn/matplotlib?
I've adapted the seaborn correlation plot example below.
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Generate a large random dataset
rs = np.random.RandomState(33)
d = pd.DataFrame(data=rs.normal(size=(100, 7)),
columns=['Donald\nDuck','Mickey\nMouse','Han\nSolo',
'Luke\nSkywalker','Yoda','Santa\nClause','Ronald\nMcDonald'])
# Compute the correlation matrix
corr = d.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
for i in ax.get_yticklabels():
i.set_ha('right')
i.set_rotation(0)
for i in ax.get_xticklabels():
i.set_ha('center')
Note the two for sequences above. These get the label and then set the horizontal alignment (You can also change the vertical alignment (set_va()).
The code above produces this:

"panel barchart" in matplotlib

I would like to produce a figure like this one using matplotlib:
(source: peltiertech.com)
My data are in a pandas DataFrame, and I've gotten as far as a regular stacked barchart, but I can't figure out how to do the part where each category is given its own y-axis baseline.
Ideally I would like the vertical scale to be exactly the same for all the subplots and move the panel labels off to the side so there can be no gaps between the rows.
I haven't exactly replicated what you want but this should get you pretty close.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
#create dummy data
cols = ['col'+str(i) for i in range(10)]
ind = ['ind'+str(i) for i in range(10)]
df = pd.DataFrame(np.random.normal(loc=10, scale=5, size=(10, 10)), index=ind, columns=cols)
#create plot
sns.set_style("whitegrid")
axs = df.plot(kind='bar', subplots=True, sharey=True,
figsize=(6, 5), legend=False, yticks=[],
grid=False, ylim=(0, 14), edgecolor='none',
fontsize=14, color=[sns.xkcd_rgb["brownish red"]])
plt.text(-1, 100, "The y-axis label", fontsize=14, rotation=90) # add a y-label with custom positioning
sns.despine(left=True) # get rid of the axes
for ax in axs: # set the names beside the axes
ax.lines[0].set_visible(False) # remove ugly dashed line
ax.set_title('')
sername = ax.get_legend_handles_labels()[1][0]
ax.text(9.8, 5, sername, fontsize=14)
plt.suptitle("My panel chart", fontsize=18)