Ordering seaborn heatmap xticks given certain values - pandas

I have this given data:
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'C': np.random.choice([False, False, False, True], 100000),
'D': np.random.choice([False,True], 100000),
'B': np.random.choice([False,True, True], 100000),
'A': np.random.choice([False, False, True], 100000),
'F': np.random.choice([False,True, True, True], 100000)})
Where I plot this:
fig, ax = plt.subplots(figsize=(5, 6))
cmap = sns.mpl_palette("Set2", 2)
sns.heatmap(data=df, cmap=cmap, cbar=False)
plt.xticks(rotation=90, fontsize=10)
plt.yticks(rotation=0, fontsize=10)
legend_handles = [Patch(color=cmap[True], label='Missing Value'), # red
Patch(color=cmap[False], label='Non Missing Value')] # blue-green
plt.legend(handles=legend_handles, ncol=2, bbox_to_anchor=[0.5, 1.02], loc='lower center', fontsize=8, handlelength=.8)
plt.tight_layout()
plt.show()
I have been trying to order the x-axis from higher to lower (left to right) given the count of True values. So, the first position should have the highest amount of True values, the second position the second highest, and so on.
I was able to get the positions and their respective labels with:
x_axis = df.sum().rank(method="dense", ascending=False)
x_pos = x_axis.values.tolist()
x_labels = x_axis.index.tolist()
But I'm struggling trying to put this in the plot and make it work, and also because I need to be sure that I'm not just changuing the position of the labels but also the position of the variables displayed in the plot (as I'm visualizing near 100 variables in the real dataframe)

You can extract order then reindex:
orders = df.sum().sort_values(ascending=False).index
# change this:
sns.heatmap(data=df.reindex(orders, axis=1), cmap=cmap, cbar=False)
Output:

Related

Seaborn: annotate missing values on the heatmap

I am plotting a heatmap in python with the seaborn library. The dataframe contains some missing values (NaN). I wish that the heatmap cells corresponding to these fields are white (by default) and also annotated with a string NA. However, if I see it correctly, annotation does not work with missing values. Is there any hack around it?
My code:
sns.heatmap(
df,
ax=ax[0, 0],
cbar=False,
annot=annot_df,
fmt="",
annot_kws={"size": annot_size, "va": "center_baseline"},
cmap="coolwarm",
linewidth=0.5,
linecolor="black",
vmin=-max_value,
vmax=max_value,
xticklabels=True,
yticklabels=True,
)
An idea is to draw another heatmap, with a transparent color and with only values where the original dataframe is NaN. To control the axis labels, the "real" heatmap should be drawn last. Note that the color for the NaN cells is the background color of the plot.
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
data = np.where(np.random.rand(7, 10) < 0.2, np.nan, np.random.rand(7, 10) * 2 - 1)
df = pd.DataFrame(data)
annot_df = df.applymap(lambda f: f'{f:.1f}')
fig, ax = plt.subplots(squeeze=False)
sns.heatmap(
np.where(df.isna(), 0, np.nan),
ax=ax[0, 0],
cbar=False,
annot=np.full_like(df, "NA", dtype=object),
fmt="",
annot_kws={"size": 10, "va": "center_baseline", "color": "black"},
cmap=ListedColormap(['none']),
linewidth=0)
sns.heatmap(
df,
ax=ax[0, 0],
cbar=False,
annot=annot_df,
fmt="",
annot_kws={"size": 10, "va": "center_baseline"},
cmap="coolwarm",
linewidth=0.5,
linecolor="black",
vmin=-1,
vmax=1,
xticklabels=True,
yticklabels=True)
plt.show()
PS: To explicitly color the 'NA' cells, e.g. cmap=ListedColormap(['yellow']) could be used.

how to plot lines linking medians of multiple violin distributions in seaborn?

I struggle hard to succeed in plotting a dot-line between the median values (and min and max) per type of stacked violin distributions.
I tried superposing a violin plot with a seaborn.lineplot but it failed. I'm not sure with this approach that I can draw dot-lines and also link min and max of distributions of the same type. I also tried to use seaborn.lineplot but here the challenge is to plot min and max of the distribution at each x-axis value.
Here is a example dataset and the code for the violin plot in seaborn
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
x=[0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.8]
cate=['a','a','a','a','b','b','b','b','c','c','c','c','a','a','a','a','b','b','b','b','c','c','c','c','a','a','a','a','b','b','b','b','c','c','c','c','a','a','a','a','b','b','b','b','c','c','c','c']
y=[1.1,1.12,1.13,1.13,3.1,3.12,3.13,3.13,5.1,5.12,5.13,5.13,2.2,2.22,2.25,2.23,4.2,4.22,4.25,4.23,6.2,6.22,6.25,6.23,2.2,2.22,2.24,2.23,4.2,4.22,4.24,4.23,6.2,6.22,6.24,6.23,1.1,1.13,1.14,1.12,3.1,3.13,3.14,3.12,5.1,5.13,5.14,5.12]
my_pal =['red','green', 'purple']
df = pd.DataFrame({'x': x, 'Type': cate, 'y': y})
ax=sns.catplot(y='y', x='x',data=df, hue='Type', palette=my_pal, kind="violin",dodge =False)
sns.lineplot(y='y', x='x',data=df, hue='Type', palette=my_pal, ci=100,legend=False)
plt.show()
but it plots line only on a reduce part of the left of the plot. Is there a trick to superpose lineplot with violin plot?
For the line plot, 'x' is considered numerical. However, for the violin plot 'x' is considered categorical (positioned at 0, 1, 2, ...).
A solution is to convert 'x' to strings to have both plots consider it as categorical.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
my_pal = ['red', 'green', 'purple']
N = 40
df = pd.DataFrame({'x': np.random.randint(1, 6, N*3) * 0.2,
'y': np.random.uniform(0, 1, N*3) + np.tile([2, 4, 6], N),
'Type': np.tile(list('abc'), N)})
df['x'] = [f'{x:.1f}' for x in df['x']]
ax = sns.violinplot(y='y', x='x', data=df, hue='Type', palette=my_pal, dodge=False)
ax = sns.lineplot(y='y', x='x', data=df, hue='Type', palette=my_pal, ci=100, legend=False, ax=ax)
ax.margins(0.15) # slightly more padding for x and y axis
ax.legend(bbox_to_anchor=(1.01, 1), loc='upper left')
plt.tight_layout()
plt.show()

Seaborn: how to draw a vertical line that matches a specific y value in a cumulative KDE?

I'm using Seaborn to plot a cumulative distribution and it's KDE using this code:
sns.distplot(values, bins=20,
hist_kws= {'cumulative': True},
kde_kws= {'cumulative': True} )
This gives me the following chart:
I'd like to plot a vertical line and the corresponding x index where y is 0.8. Something like:
How do I get the x value of a specific y?
You could draw a vertical line at the 80% quantile:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
values = np.random.normal(1, 20, 1000)
sns.distplot(values, bins=20,
hist_kws= {'cumulative': True},
kde_kws= {'cumulative': True} )
plt.axvline(np.quantile(values, 0.8), color='r')
plt.show()
The answer by #JohanC is probably the best. I went an other route and it's maybe a slightly more generic solution.
The idea is to get the coordinates of the kde line, then find the indice of the point where it crosses the threshold value
values = np.random.normal(size=(100,))
fig = plt.figure()
ax = sns.distplot(values, bins=20,
hist_kws= {'cumulative': True},
kde_kws= {'cumulative': True} )
x,y = ax.lines[0].get_data()
thresh = 0.8
idx = np.where(np.diff(np.sign(y-thresh)))[0]
x_val = x[idx[0]]
ax.axvline(x_val, color='red')

Pandas histogram df.hist() group by

How to plot a histogram with pandas DataFrame.hist() using group by?
I have a data frame with 5 columns: "A", "B", "C", "D" and "Group"
There are two Groups classes: "yes" and "no"
Using:
df.hist()
I get the hist for each of the 4 columns.
Now I would like to get the same 4 graphs but with blue bars (group="yes") and red bars (group = "no").
I tried this withouth success:
df.hist(by = "group")
Using Seaborn
If you are open to use Seaborn, a plot with multiple subplots and multiple variables within each subplot can easily be made using seaborn.FacetGrid.
import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)
df2 = pd.melt(df, id_vars='group', value_vars=list("ABCD"), value_name='value')
bins=np.linspace(df2.value.min(), df2.value.max(), 10)
g = sns.FacetGrid(df2, col="variable", hue="group", palette="Set1", col_wrap=2)
g.map(plt.hist, 'value', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
This is not the most flexible workaround but will work for your question specifically.
def sephist(col):
yes = df[df['group'] == 'yes'][col]
no = df[df['group'] == 'no'][col]
return yes, no
for num, alpha in enumerate('abcd'):
plt.subplot(2, 2, num)
plt.hist(sephist(alpha)[0], bins=25, alpha=0.5, label='yes', color='b')
plt.hist(sephist(alpha)[1], bins=25, alpha=0.5, label='no', color='r')
plt.legend(loc='upper right')
plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
You could make this more generic by:
adding a df and by parameter to sephist: def sephist(df, by, col)
making the subplots loop more flexible: for num, alpha in enumerate(df.columns)
Because the first argument to matplotlib.pyplot.hist can take
either a single array or a sequency of arrays which are not required
to be of the same length
...an alternattive would be:
for num, alpha in enumerate('abcd'):
plt.subplot(2, 2, num)
plt.hist((sephist(alpha)[0], sephist(alpha)[1]), bins=25, alpha=0.5, label=['yes', 'no'], color=['r', 'b'])
plt.legend(loc='upper right')
plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
I generalized one of the other comment's solutions. Hope it helps someone out there. I added a line to ensure binning (number and range) is preserved for each column, regardless of group. The code should work for both "binary" and "categorical" groupings, i.e. "by" can specify a column wherein there are N number of unique groups. Plotting also stops if the number of columns to plot exceeds the subplot space.
import numpy as np
import matplotlib.pyplot as plt
def composite_histplot(df, columns, by, nbins=25, alpha=0.5):
def _sephist(df, col, by):
unique_vals = df[by].unique()
df_by = dict()
for uv in unique_vals:
df_by[uv] = df[df[by] == uv][col]
return df_by
subplt_c = 4
subplt_r = 5
fig = plt.figure()
for num, col in enumerate(columns):
if num + 1 > subplt_c * subplt_r:
continue
plt.subplot(subplt_c, subplt_r, num+1)
bins = np.linspace(df[col].min(), df[col].max(), nbins)
for lbl, sepcol in _sephist(df, col, by).items():
plt.hist(sepcol, bins=bins, alpha=alpha, label=lbl)
plt.legend(loc='upper right', title=by)
plt.title(col)
plt.tight_layout()
return fig
TLDR oneliner;
It won't create the subplots but will create 4 different plots;
[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']
Full working example below
import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)
[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']

"panel barchart" in matplotlib

I would like to produce a figure like this one using matplotlib:
(source: peltiertech.com)
My data are in a pandas DataFrame, and I've gotten as far as a regular stacked barchart, but I can't figure out how to do the part where each category is given its own y-axis baseline.
Ideally I would like the vertical scale to be exactly the same for all the subplots and move the panel labels off to the side so there can be no gaps between the rows.
I haven't exactly replicated what you want but this should get you pretty close.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
#create dummy data
cols = ['col'+str(i) for i in range(10)]
ind = ['ind'+str(i) for i in range(10)]
df = pd.DataFrame(np.random.normal(loc=10, scale=5, size=(10, 10)), index=ind, columns=cols)
#create plot
sns.set_style("whitegrid")
axs = df.plot(kind='bar', subplots=True, sharey=True,
figsize=(6, 5), legend=False, yticks=[],
grid=False, ylim=(0, 14), edgecolor='none',
fontsize=14, color=[sns.xkcd_rgb["brownish red"]])
plt.text(-1, 100, "The y-axis label", fontsize=14, rotation=90) # add a y-label with custom positioning
sns.despine(left=True) # get rid of the axes
for ax in axs: # set the names beside the axes
ax.lines[0].set_visible(False) # remove ugly dashed line
ax.set_title('')
sername = ax.get_legend_handles_labels()[1][0]
ax.text(9.8, 5, sername, fontsize=14)
plt.suptitle("My panel chart", fontsize=18)