I was wondering if it is possible to create a Seaborn count plot, but instead of actual counts on the y-axis, show the relative frequency (percentage) within its group (as specified with the hue parameter).
I sort of fixed this with the following approach, but I can't imagine this is the easiest approach:
# Plot percentage of occupation per income class
grouped = df.groupby(['income'], sort=False)
occupation_counts = grouped['occupation'].value_counts(normalize=True, sort=False)
occupation_data = [
{'occupation': occupation, 'income': income, 'percentage': percentage*100} for
(income, occupation), percentage in dict(occupation_counts).items()
]
df_occupation = pd.DataFrame(occupation_data)
p = sns.barplot(x="occupation", y="percentage", hue="income", data=df_occupation)
_ = plt.setp(p.get_xticklabels(), rotation=90) # Rotate labels
Result:
I'm using the well known adult data set from the UCI machine learning repository. The pandas dataframe is created like this:
# Read the adult dataset
df = pd.read_csv(
"data/adult.data",
engine='c',
lineterminator='\n',
names=['age', 'workclass', 'fnlwgt', 'education', 'education_num',
'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week',
'native_country', 'income'],
header=None,
skipinitialspace=True,
na_values="?"
)
This question is sort of related, but does not make use of the hue parameter. And in my case I cannot just change the labels on the y-axis, because the height of the bar must depend on the group.
With newer versions of seaborn you can do following:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
df = sns.load_dataset('titanic')
df.head()
x,y = 'class', 'survived'
(df
.groupby(x)[y]
.value_counts(normalize=True)
.mul(100)
.rename('percent')
.reset_index()
.pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))
output
Update: Also show percentages on top of barplots
If you also want percentages, you can do following:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
df.head()
x,y = 'class', 'survived'
df1 = df.groupby(x)[y].value_counts(normalize=True)
df1 = df1.mul(100)
df1 = df1.rename('percent').reset_index()
g = sns.catplot(x=x,y='percent',hue=y,kind='bar',data=df1)
g.ax.set_ylim(0,100)
for p in g.ax.patches:
txt = str(p.get_height().round(2)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
g.ax.text(txt_x,txt_y,txt)
I might be confused. The difference between your output and the output of
occupation_counts = (df.groupby(['income'])['occupation']
.value_counts(normalize=True)
.rename('percentage')
.mul(100)
.reset_index()
.sort_values('occupation'))
p = sns.barplot(x="occupation", y="percentage", hue="income", data=occupation_counts)
_ = plt.setp(p.get_xticklabels(), rotation=90) # Rotate labels
is, it seems to me, only the order of the columns.
And you seem to care about that, since you pass sort=False. But then, in your code the order is determined uniquely by chance (and the order in which the dictionary is iterated even changes from run to run with Python 3.5).
You could do this with sns.histplot by setting the following properties:
stat = 'density' (this will make the y-axis the density rather than count)
common_norm = False (this will normalize each density independently)
See the simple example below:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
ax = sns.histplot(x = df['class'], hue=df['survived'], multiple="dodge",
stat = 'density', shrink = 0.8, common_norm=False)
You can use the library Dexplot to do counting as well as normalizing over any variable to get relative frequencies.
Pass the count function the name of the variable you would like to count and it will automatically produce a bar plot of the counts of all unique values. Use split to subdivide the counts by another variable. Notice that Dexplot automatically wraps the x-tick labels.
dxp.count('occupation', data=df, split='income')
Use the normalize parameter to normalize the counts over any variable (or combination of variables with a list). You can also use True to normalize over the grand total of counts.
dxp.count('occupation', data=df, split='income', normalize='income')
It boggled my mind that Seaborn doesn't provide anything like this out of the box.
Still, it was pretty easy to tweak the source code to get what you wanted.
The following code, with the function "percentageplot(x, hue, data)" works just like sns.countplot, but norms each bar per group (i.e. divides each green bar's value by the sum of all green bars)
In effect, it turns this (hard to interpret because different N of Apple vs. Android):
sns.countplot
into this (Normed so that bars reflect proportion of total for Apple, vs Android):
Percentageplot
Hope this helps!!
from seaborn.categorical import _CategoricalPlotter, remove_na
import matplotlib as mpl
class _CategoricalStatPlotter(_CategoricalPlotter):
#property
def nested_width(self):
"""A float with the width of plot elements when hue nesting is used."""
return self.width / len(self.hue_names)
def estimate_statistic(self, estimator, ci, n_boot):
if self.hue_names is None:
statistic = []
confint = []
else:
statistic = [[] for _ in self.plot_data]
confint = [[] for _ in self.plot_data]
for i, group_data in enumerate(self.plot_data):
# Option 1: we have a single layer of grouping
# --------------------------------------------
if self.plot_hues is None:
if self.plot_units is None:
stat_data = remove_na(group_data)
unit_data = None
else:
unit_data = self.plot_units[i]
have = pd.notnull(np.c_[group_data, unit_data]).all(axis=1)
stat_data = group_data[have]
unit_data = unit_data[have]
# Estimate a statistic from the vector of data
if not stat_data.size:
statistic.append(np.nan)
else:
statistic.append(estimator(stat_data, len(np.concatenate(self.plot_data))))
# Get a confidence interval for this estimate
if ci is not None:
if stat_data.size < 2:
confint.append([np.nan, np.nan])
continue
boots = bootstrap(stat_data, func=estimator,
n_boot=n_boot,
units=unit_data)
confint.append(utils.ci(boots, ci))
# Option 2: we are grouping by a hue layer
# ----------------------------------------
else:
for j, hue_level in enumerate(self.hue_names):
if not self.plot_hues[i].size:
statistic[i].append(np.nan)
if ci is not None:
confint[i].append((np.nan, np.nan))
continue
hue_mask = self.plot_hues[i] == hue_level
group_total_n = (np.concatenate(self.plot_hues) == hue_level).sum()
if self.plot_units is None:
stat_data = remove_na(group_data[hue_mask])
unit_data = None
else:
group_units = self.plot_units[i]
have = pd.notnull(
np.c_[group_data, group_units]
).all(axis=1)
stat_data = group_data[hue_mask & have]
unit_data = group_units[hue_mask & have]
# Estimate a statistic from the vector of data
if not stat_data.size:
statistic[i].append(np.nan)
else:
statistic[i].append(estimator(stat_data, group_total_n))
# Get a confidence interval for this estimate
if ci is not None:
if stat_data.size < 2:
confint[i].append([np.nan, np.nan])
continue
boots = bootstrap(stat_data, func=estimator,
n_boot=n_boot,
units=unit_data)
confint[i].append(utils.ci(boots, ci))
# Save the resulting values for plotting
self.statistic = np.array(statistic)
self.confint = np.array(confint)
# Rename the value label to reflect the estimation
if self.value_label is not None:
self.value_label = "{}({})".format(estimator.__name__,
self.value_label)
def draw_confints(self, ax, at_group, confint, colors,
errwidth=None, capsize=None, **kws):
if errwidth is not None:
kws.setdefault("lw", errwidth)
else:
kws.setdefault("lw", mpl.rcParams["lines.linewidth"] * 1.8)
for at, (ci_low, ci_high), color in zip(at_group,
confint,
colors):
if self.orient == "v":
ax.plot([at, at], [ci_low, ci_high], color=color, **kws)
if capsize is not None:
ax.plot([at - capsize / 2, at + capsize / 2],
[ci_low, ci_low], color=color, **kws)
ax.plot([at - capsize / 2, at + capsize / 2],
[ci_high, ci_high], color=color, **kws)
else:
ax.plot([ci_low, ci_high], [at, at], color=color, **kws)
if capsize is not None:
ax.plot([ci_low, ci_low],
[at - capsize / 2, at + capsize / 2],
color=color, **kws)
ax.plot([ci_high, ci_high],
[at - capsize / 2, at + capsize / 2],
color=color, **kws)
class _BarPlotter(_CategoricalStatPlotter):
"""Show point estimates and confidence intervals with bars."""
def __init__(self, x, y, hue, data, order, hue_order,
estimator, ci, n_boot, units,
orient, color, palette, saturation, errcolor, errwidth=None,
capsize=None):
"""Initialize the plotter."""
self.establish_variables(x, y, hue, data, orient,
order, hue_order, units)
self.establish_colors(color, palette, saturation)
self.estimate_statistic(estimator, ci, n_boot)
self.errcolor = errcolor
self.errwidth = errwidth
self.capsize = capsize
def draw_bars(self, ax, kws):
"""Draw the bars onto `ax`."""
# Get the right matplotlib function depending on the orientation
barfunc = ax.bar if self.orient == "v" else ax.barh
barpos = np.arange(len(self.statistic))
if self.plot_hues is None:
# Draw the bars
barfunc(barpos, self.statistic, self.width,
color=self.colors, align="center", **kws)
# Draw the confidence intervals
errcolors = [self.errcolor] * len(barpos)
self.draw_confints(ax,
barpos,
self.confint,
errcolors,
self.errwidth,
self.capsize)
else:
for j, hue_level in enumerate(self.hue_names):
# Draw the bars
offpos = barpos + self.hue_offsets[j]
barfunc(offpos, self.statistic[:, j], self.nested_width,
color=self.colors[j], align="center",
label=hue_level, **kws)
# Draw the confidence intervals
if self.confint.size:
confint = self.confint[:, j]
errcolors = [self.errcolor] * len(offpos)
self.draw_confints(ax,
offpos,
confint,
errcolors,
self.errwidth,
self.capsize)
def plot(self, ax, bar_kws):
"""Make the plot."""
self.draw_bars(ax, bar_kws)
self.annotate_axes(ax)
if self.orient == "h":
ax.invert_yaxis()
def percentageplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None,
orient=None, color=None, palette=None, saturation=.75,
ax=None, **kwargs):
# Estimator calculates required statistic (proportion)
estimator = lambda x, y: (float(len(x))/y)*100
ci = None
n_boot = 0
units = None
errcolor = None
if x is None and y is not None:
orient = "h"
x = y
elif y is None and x is not None:
orient = "v"
y = x
elif x is not None and y is not None:
raise TypeError("Cannot pass values for both `x` and `y`")
else:
raise TypeError("Must pass values for either `x` or `y`")
plotter = _BarPlotter(x, y, hue, data, order, hue_order,
estimator, ci, n_boot, units,
orient, color, palette, saturation,
errcolor)
plotter.value_label = "Percentage"
if ax is None:
ax = plt.gca()
plotter.plot(ax, kwargs)
return ax
You can provide estimators for the height of the bar (along y axis) in a seaborn countplot by using the estimator keyword.
ax = sns.barplot(x="x", y="x", data=df, estimator=lambda x: len(x) / len(df) * 100)
The above code snippet is from https://github.com/mwaskom/seaborn/issues/1027
They have a whole discussion about how to provide percentages in a countplot. This answer is based off the same thread linked above.
In the context of your specific problem, you can probably do something like this:
ax = sb.barplot(x='occupation', y='some_numeric_column', data=raw_data, estimator=lambda x: len(x) / len(raw_data) * 100, hue='income')
ax.set(ylabel="Percent")
The above code worked for me (on a different dataset with different attributes). Note that you need to put in some numeric column for y else, it gives an error: "ValueError: Neither the x nor y variable appears to be numeric."
From this answer, and using "probability" worked best.
Taken from sns.histplot documentation on the "stat" parameter:
Aggregate statistic to compute in each bin.
count: show the number of observations in each bin
frequency: show the number of observations divided by the bin width
probability: or proportion: normalize such that bar heights sum to 1
percent: normalize such that bar heights sum to 100
density: normalize such that the total area of the histogram equals 1
import seaborn as sns
df = sns.load_dataset('titanic')
ax = sns.histplot(
x = df['class'],
hue=df['survived'],
multiple="dodge",
stat = 'probability',
shrink = 0.5,
common_norm=False
)
Background
I am trying to show three variables on a single plot. I have connected the three points using lines of different colours based on some other variables. This is shown here
Problem
What I want to do is to have a different scale on the negative x-axis. This would help me in providing positive x_ticks, different axis label and also clear and uncluttered representation of the lines on left side of the image
Question
How to have a different positive x-axis starting from 0 towards negative direction?
Have xticks based on data plotted in that direction
Have a separate xlabel for this new axis
Additional information
I have checked other questions regarding inclusion of multiple axes e.g. this and this. However, these questions did not serve the purpose.
Code Used
font_size = 20
plt.rcParams.update({'font.size': font_size})
fig = plt.figure()
ax = fig.add_subplot(111)
#read my_data from file or create it
for case in my_data:
#Iterating over my_data
if condition1 == True:
local_linestyle = '-'
local_color = 'r'
local_line_alpha = 0.6
elif condition2 == 1:
local_linestyle = '-'
local_color = 'b'
local_line_alpha = 0.6
else:
local_linestyle = '--'
local_color = 'g'
local_line_alpha = 0.6
datapoint = [case[0], case[1], case[2]]
plt.plot(datapoint[0], 0, color=local_color)
plt.plot(-datapoint[2], 0, color=local_color)
plt.plot(0, datapoint[1], color=local_color)
plt.plot([datapoint[0], 0], [0, datapoint[1]], linestyle=local_linestyle, color=local_color)
plt.plot([-datapoint[2], 0], [0, datapoint[1]], linestyle=local_linestyle, color=local_color)
plt.show()
exit()
You can define a custom scale, where values below zero are scaled differently than those above zero.
import numpy as np
from matplotlib import scale as mscale
from matplotlib import transforms as mtransforms
from matplotlib.ticker import FuncFormatter
class AsymScale(mscale.ScaleBase):
name = 'asym'
def __init__(self, axis, **kwargs):
mscale.ScaleBase.__init__(self)
self.a = kwargs.get("a", 1)
def get_transform(self):
return self.AsymTrans(self.a)
def set_default_locators_and_formatters(self, axis):
# possibly, set a different locator and formatter here.
fmt = lambda x,pos: "{}".format(np.abs(x))
axis.set_major_formatter(FuncFormatter(fmt))
class AsymTrans(mtransforms.Transform):
input_dims = 1
output_dims = 1
is_separable = True
def __init__(self, a):
mtransforms.Transform.__init__(self)
self.a = a
def transform_non_affine(self, x):
return (x >= 0)*x + (x < 0)*x*self.a
def inverted(self):
return AsymScale.InvertedAsymTrans(self.a)
class InvertedAsymTrans(AsymTrans):
def transform_non_affine(self, x):
return (x >= 0)*x + (x < 0)*x/self.a
def inverted(self):
return AsymScale.AsymTrans(self.a)
Using this you would provide a scale parameter a that scales the negative part of the axes.
# Now that the Scale class has been defined, it must be registered so
# that ``matplotlib`` can find it.
mscale.register_scale(AsymScale)
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot([-2, 0, 5], [0,1,0])
ax.set_xscale("asym", a=2)
ax.annotate("negative axis", xy=(.25,0), xytext=(0,-30),
xycoords = "axes fraction", textcoords="offset points", ha="center")
ax.annotate("positive axis", xy=(.75,0), xytext=(0,-30),
xycoords = "axes fraction", textcoords="offset points", ha="center")
plt.show()
The question is not very clear about what xticks and labels are desired, so I left that out for now.
Here's how to get what you want. This solution uses two twined axes object to get different scaling to the left and right of the origin, and then hides all the evidence:
import matplotlib.pyplot as plt
import matplotlib as mpl
from numbers import Number
tickkwargs = {m+k:False for k in ('bottom','top','left','right') for m in ('','label')}
p = np.zeros((10, 3, 2))
p[:,0,0] -= np.arange(10)*.1 + .5
p[:,1,1] += np.repeat(np.arange(5), 2)*.1 + .3
p[:,2,0] += np.arange(10)*.5 + 2
fig = plt.figure(figsize=(8,6))
host = fig.add_subplot(111)
par = host.twiny()
host.set_xlim(-6, 6)
par.set_xlim(-1, 1)
for ps in p:
# mask the points with negative x values
ppos = ps[ps[:,0] >= 0].T
host.plot(*ppos)
# mask the points with positive x values
pneg = ps[ps[:,0] <= 0].T
par.plot(*pneg)
# hide all possible ticks/notation text that could be set by the second x axis
par.tick_params(axis="both", **tickkwargs)
par.xaxis.get_offset_text().set_visible(False)
# fix the x tick labels so they're all positive
host.set_xticklabels(np.abs(host.get_xticks()))
fig.show()
Output:
Here's what the set of points p I used in the code above look like when plotted normally:
fig = plt.figure(figsize=(8,6))
ax = fig.gca()
for ps in p:
ax.plot(*ps.T)
fig.show()
Output:
The method of deriving a class of mscale.ScaleBase as shown in other answers may be too complicated for your purpose.
You can pass two scale transform functions to set_xscale or set_yscale, something like the following.
def get_scale(a=1): # a is the scale of your negative axis
def forward(x):
x = (x >= 0) * x + (x < 0) * x * a
return x
def inverse(x):
x = (x >= 0) * x + (x < 0) * x / a
return x
return forward, inverse
fig, ax = plt.subplots()
forward, inverse = get_scale(a=3)
ax.set_xscale('function', functions=(forward, inverse)) # this is for setting x axis
# do plotting
More examples can be found in this doc.
I'm trying to do a scatterplot and heatmap in the same plot.
The scatterplot is as follows:
examples, targets = zip(*list(blue_data()))
examples2, targets2 = zip(*list(red_data()))
plt.plot(np.array(examples), np.array(targets), 'o', color = 'b')
plt.plot(np.array(examples2), np.array(targets2), 'o', color = 'r')
The above works just fine. I also want to do a heatmap. It works just fine, if I do it in a separate subplot. However, I want to try to do both on the same plot, sharing the same axes and am unsure how to do it. This is the code for the heatmap:
x = np.linspace(lower_x, upper_x, 100)
y = np.linspace(lower_y, upper_y, 100)
X, Y = np.meshgrid(x, y)
Z = np.zeros((x.size,y.size))
for ii in range(len(x)):
for jj in range(len(y)):
X_ = X[ii,jj];
Y_ = Y[ii,jj];
Z[ii,jj] = some_function(X_,Y_)
cmap = mpl.colors.ListedColormap(['r', 'b'])
bounds = [-100,0,100]
norm = mpl.colors.BoundaryNorm(bounds, cmap.N)
plt.imshow(Z,cmap= cmap, norm = norm, extent = [-lower_x,upper_x,-lower_y,upper_y], origin = 'lower');
How can I combine the two plots so that they appear on the same plot with aligned axes?
I'm plotting some functions that have several discontinuities. Each function is given as a list. I want to connect points with lines only where the function is continuous.
Here is a simplified example of what plot is doing.
x=linspace(0,1,100)
y=zeros(100)
y[x<0.5] = x[x<0.5]
y[x>=0.5] = 1 + x[x>=0.5]
plot(x, y, '-o')
There is a discontinuity at x=0.5, but plot connects all points with lines regardless.
My functions are different of course. They typically have several discontinuities in different places. The criterion for the discontinuity is simple. Say, if the function jumps by more than 0.5, I assume it is discontinuous at that point.
Is there an option in plot to tell it to drop the connecting lines between the points where the function is discontinuous? I recall being able to do that easily with gnuplot.
use nan to break the line into multiple segments:
import numpy as np
from pylab import *
x=linspace(0,1,100)
y=zeros(100)
y[x<0.5] = x[x<0.5]
y[x>=0.5] = 1 + x[x>=0.5]
pos = np.where(np.abs(np.diff(y)) >= 0.5)[0]
x[pos] = np.nan
y[pos] = np.nan
plot(x, y, '-o')
Edit:
to insert nan at discontinuities:
pos = np.where(np.abs(np.diff(y)) >= 0.5)[0]+1
x = np.insert(x, pos, np.nan)
y = np.insert(y, pos, np.nan)
Here is my suggestion for plotting tan(x):
import matplotlib.pyplot as plt
from math import *
x_lim = 3*pi/2
y_lim = 5
n = 1000
X = []
Y = []
Z = []
for i in range(0,2*n):
x = -x_lim + i*x_lim/n
y = tan(x)
if y<y_lim and y>-y_lim:
X.append(x)
Y.append(y)
else:
if len(X)>0 and len(Y)>0:
Z.append([X,Y])
del X,Y
X = []
Y = []
for i in range(0, len(Z)):
plt.plot(Z[i][0],Z[i][1])
plt.grid(True)
plt.show()