Suppress stacked bar chart label if less than n - pandas

I have data with a lot of values. When plotting a percentage, a lot of values show up in 0%, which are then displayed in a plot. I do not want to include labels which are less than 0% or n%.
This is the code that I use to produce the output
import numpy
import pandas as pd
import matplotlib.pyplot as plt
data = np.random.rand(5,10)
data = 10 + data*10
df = pd.DataFrame(data, columns=list('ABCDEFGHIJ'))
ax = df.plot(kind='bar', stacked=True)
for c in ax.containers:
ax.bar_label(c, fmt='%.0f%%', label_type='center')
ax.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
I know that I can do what I need using this
ax = df.plot(kind='bar', stacked=True)
for c in ax.containers:
labels = [v if v > 12 else "" for v in c.datavalues]
ax.bar_label(c, labels=labels, label_type="center")
ax.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
This way I can suppress values less than 12, but how can I limit amount of decimals which will be shown in label like this fmt='%.0f%%' ?

Related

Pandas plot of a stacked and grouped bar chart

I have a CSV-file with a couple of data:
# Comment header
#
MainCategory,SubCategory,DurationM,DurationH,Number
MainCat1,Sub1.1,598,9.97,105
MainCat1,Sub1.2,11,0.18,4
MainCat1,Sub1.3,17,0.28,5
MainCat1,Sub1.4,16,0.27,2
MainCat2,Sub2.1,14161,236.02,102
MainCat2,Sub2.2,834,13.90,17
MainCat3,Sub3.1,4325,72.08,472
MainCat3,Sub3.2,7,0.12,2
MainCat4,Sub4.1,614,10.23,60
MainCat5,Sub5.1,6362,106.03,142
MainCat5,Sub5.2,141,2.35,6
Misc,Misc.1,3033,50.55,53
MainCat4,Sub4.2,339,5.65,4
MainCat4,Sub4.3,925,15.42,11
Misc,Misc.2,2641,44.02,28
MainCat6,Sub6.1,370,6.17,4
MainCat7,Sub7.1,9601,160.02,10
MainCat4,Sub4.4,75,1.25,2
MainCat8,Sub8.1,148,2.47,4
MainCat8,Sub8.2,680,11.35,7
MainCat9,Sub9.1,3997,66.62,1
MainCat8,Sub8.3,105,1.75,2
MainCat4,Sub4.5,997,16.62,1
MainCat10,Sub10.1,12,0.20,3
MainCat4,Sub4.6,10,0.17,1
MainCat10,Sub10.2,13,0.22,1
MainCat4,Sub4.7,561,9.35,4
MainCat10,Sub10.3,1043,17.38,47
What I would like to achieve is a stacked bar plot where
the X-axis values/labels are given by the values/groups given by MainCategory
on the left Y-axis, the DurationH is used
on the right Y-axis the Number is used
DurationH and Number are plotted as bars per MainCategory side-by-side
In each of the bars, the SubCategory is used for stacking
Something like this:
The following code produces stacked plots, but a sequence of them:
import pandas as pd
from pandas import DataFrame
from matplotlib import pyplot as plt
import seaborn as sns
data = pd.read_csv('failureEventStatistic_total_Top10.csv', sep=',', header=2)
data = data.rename(columns={'DurationM':'Duration [min]', 'DurationH':'Duration [h]'})
data.groupby('MainCategorie')[['Duration [h]', 'Number']].plot.bar()
I tried to use unstack(), but this produces an error
You can get the plot data from a crosstab and then make a right aligned and a left aligned bar plot on the same axes:
ax = pd.crosstab(df.MainCategory, df.SubCategory.str.partition('.')[2], df.DurationH, aggfunc=sum).plot.bar(
stacked=True, width=-0.4, align='edge', ylabel='DurationH', ec='w', color=[(0,1,0,x) for x in np.linspace(1, 0.1, 7)], legend=False)
h_durationh, _ = ax.get_legend_handles_labels()
ax = pd.crosstab(df.MainCategory, df.SubCategory.str.partition('.')[2], df.Number, aggfunc=sum).plot.bar(
stacked=True, width=0.4, align='edge', secondary_y=True, ec='w', color=[(0,0,1,x) for x in np.linspace(1, 0.1, 7)], legend=False, ax=ax)
h_number, _ = ax.get_legend_handles_labels()
ax.set_ylabel('Number')
ax.set_xlim(left=ax.get_xlim()[0] - 0.5)
ax.legend([h_durationh[0], h_number[0]], ['DurationH', 'Number'])

How to change the order of these plots using zorder?

I'm trying to get a line plot to be over the bar plot. But no matter what I do to change the zorder, it seems like it keeps the bar on top of the line. Nothing I do to try to change zorder seems to work. Sometimes the bar plot just doesn't show up if zorder is <= 0.
import pandas as pd
import matplotlib.pyplot as plt
def tail_plot(tail):
plt.figure()
#line plot
ax1 = incidence[incidence['actual_inc'] != 0].tail(tail).plot(x='date', y=['R_t', 'upper 95% CI', 'lower 95% CI'], color = ['b', '#808080', '#808080'])
ax1.set_zorder(2)
ax2 = ax1.twinx()
inc = incidence[incidence['actual_inc'] != 0]['actual_inc'].tail(tail).values
dates = incidence[incidence['actual_inc'] != 0]['date'].tail(tail).values
#bar plot
ax2.bar(dates, inc, color ='red', zorder=1)
ax2.set_zorder(1)
Keeps giving me this:
The problem with the approach in the post is that ax1 has a white background which totally occludes the plot of ax2. To solve this, the background color can be set to 'none'.
Note that the plt.figure() in the example code of the post creates an empty plot because the pandas plot creates its own new figure (as no ax is given explicitly).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({f'curve {i}': 20 + np.random.normal(.1, .5, 30).cumsum() for i in range(1, 6)})
# line plot
ax1 = df.plot()
ax1.set_zorder(2)
ax1.set_facecolor('none')
ax2 = ax1.twinx()
# bar plot
x = np.arange(30)
ax2.bar(x, np.random.randint(7 + x, 2 * x + 10), color='red', zorder=1)
ax2.set_zorder(1)
plt.show()

Make Y axis values show on subplots in pyplot

I have the following python code to plot a 2x2 set of graphs.
I would like to make the yaxis show its values on both columns (show Duration numbers on the right as well).
I am ok with the X axis being shown only for the lower row.
How can I do that?
import matplotlib.pyplot as plt
builds = ['20191006.1','20191004.1']
totals_10t = [39671486, 39977577]
totals_1t = [9671486, 3977577]
means_10t = [96160,99630]
means_1t = [9160,9630]
fig, axs = plt.subplots(2, 2, sharex=True,sharey=False, squeeze=False)
fig.suptitle('perf results')
axs[0,0].plot(builds, totals_10t)
axs[0,0].set_title('10T Totals')
axs[0,1].plot(builds, totals_1t, 'tab:orange')
axs[0,1].set_title('1T Totals')
axs[0,1].set_ylabel('Duration(ms)')
axs[0,1].yaxis.tick_right()
axs[1,0].plot(builds, means_10t, 'tab:green')
axs[1,0].set_title('10T Means')
axs[1,1].plot(builds, means_1t, 'tab:red')
axs[1,1].set_title('1T Means')
axs[1,1].yaxis.tick_right()
axs[1,1].set_ylabel('Duration(ms)')
for ax in axs.flat:
ax.set(xlabel='Build',ylabel='Duration(ms)')
for ax in axs.flat:
ax.label_outer()
plt.show()

Plotting Pandas dataframe subplots with different linestyles

I am plotting a figure with 6 sets of axes, each with a series of 3 lines from one of 2 Pandas dataframes (1 line per column).
I have been using matplotlib .plot:
import pandas as pd
import matplotlib.pyplot as plt
idx = pd.DatetimeIndex(start = '2013-01-01 00:00', periods =24,freq = 'H')
df1 = pd.DataFrame(index = idx, columns = ['line1','line2','line3'])
df1['line1']= df1.index.hour
df1['line2'] = 24 - df1['line1']
df1['line3'] = df1['line1'].mean()
df2 = df1*2
df3= df1/2
df4= df2+df3
fig, ax = plt.subplots(2,2,squeeze=False,figsize = (10,10))
ax[0,0].plot(df1.index, df1, marker='', linewidth=1, alpha=1)
ax[0,1].plot(df2.index, df2, marker='', linewidth=1, alpha=1)
ax[1,0].plot(df3.index, df3, marker='', linewidth=1, alpha=1)
ax[1,1].plot(df4.index, df4, marker='', linewidth=1, alpha=1)
fig.show()
It's all good, and matplotlib automatically cycles through a different colour for each line, but uses the same colours for each plot, which is what i wanted.
However, now I want to specify more details for the lines: choosing specific colours for each line, and / or changing the linestyle for each line.
This link shows how to pass multiple linestyles to a Pandas plot. e.g. using
ax = df.plot(kind='line', style=['-', '--', '-.'])
So I need to either:
pass lists of styles to my subplot command above, but style is not recognised and it doesn't accept a list for linestyle or color. Is there a way to do this?
or
Use df.plot:
fig, ax = plt.subplots(2,2,squeeze=False,figsize = (10,10))
ax[0,0] = df1.plot(style=['-','--','-.'], marker='', linewidth=1, alpha=1)
ax[0,1] = df2.plot(style=['-','--','-.'],marker='', linewidth=1, alpha=1)
ax[1,0] = df3.plot( style=['-','--','-.'],marker='', linewidth=1, alpha=1)
ax[1,1] = df4.plot(style=['-','--','-.'], marker='', linewidth=1, alpha=1)
fig.show()
...but then each plot is plotted as a seperate figure. I can't see how to put multiple Pandas plots on the same figure.
How can I make either of these approaches work?
using matplotlib
Using matplotlib, you may define a cycler for the axes to loop over color and linestyle automatically. (See this answer).
import numpy as np; np.random.seed(1)
import pandas as pd
import matplotlib.pyplot as plt
f = lambda i: pd.DataFrame(np.cumsum(np.random.randn(20,3),0))
dic1= dict(zip(range(3), [f(i) for i in range(3)]))
dic2= dict(zip(range(3), [f(i) for i in range(3)]))
dics = [dic1,dic2]
rows = range(3)
def set_cycler(ax):
ax.set_prop_cycle(plt.cycler('color', ['limegreen', '#bc15b0', 'indigo'])+
plt.cycler('linestyle', ["-","--","-."]))
fig, ax = plt.subplots(3,2,squeeze=False,figsize = (8,5))
for x in rows:
for i,dic in enumerate(dics):
set_cycler(ax[x,i])
ax[x,i].plot(dic[x].index, dic[x], marker='', linewidth=1, alpha=1)
plt.show()
using pandas
Using pandas you can indeed supply a list of possible colors and linestyles to the df.plot() method. Additionally you need to tell it in which axes to plot (df.plot(ax=ax[i,j])).
import numpy as np; np.random.seed(1)
import pandas as pd
import matplotlib.pyplot as plt
f = lambda i: pd.DataFrame(np.cumsum(np.random.randn(20,3),0))
dic1= dict(zip(range(3), [f(i) for i in range(3)]))
dic2= dict(zip(range(3), [f(i) for i in range(3)]))
dics = [dic1,dic2]
rows = range(3)
color = ['limegreen', '#bc15b0', 'indigo']
linestyle = ["-","--","-."]
fig, ax = plt.subplots(3,2,squeeze=False,figsize = (8,5))
for x in rows:
for i,dic in enumerate(dics):
dic[x].plot(ax=ax[x,i], style=linestyle, color=color, legend=False)
plt.show()

Pandas histogram df.hist() group by

How to plot a histogram with pandas DataFrame.hist() using group by?
I have a data frame with 5 columns: "A", "B", "C", "D" and "Group"
There are two Groups classes: "yes" and "no"
Using:
df.hist()
I get the hist for each of the 4 columns.
Now I would like to get the same 4 graphs but with blue bars (group="yes") and red bars (group = "no").
I tried this withouth success:
df.hist(by = "group")
Using Seaborn
If you are open to use Seaborn, a plot with multiple subplots and multiple variables within each subplot can easily be made using seaborn.FacetGrid.
import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)
df2 = pd.melt(df, id_vars='group', value_vars=list("ABCD"), value_name='value')
bins=np.linspace(df2.value.min(), df2.value.max(), 10)
g = sns.FacetGrid(df2, col="variable", hue="group", palette="Set1", col_wrap=2)
g.map(plt.hist, 'value', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
This is not the most flexible workaround but will work for your question specifically.
def sephist(col):
yes = df[df['group'] == 'yes'][col]
no = df[df['group'] == 'no'][col]
return yes, no
for num, alpha in enumerate('abcd'):
plt.subplot(2, 2, num)
plt.hist(sephist(alpha)[0], bins=25, alpha=0.5, label='yes', color='b')
plt.hist(sephist(alpha)[1], bins=25, alpha=0.5, label='no', color='r')
plt.legend(loc='upper right')
plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
You could make this more generic by:
adding a df and by parameter to sephist: def sephist(df, by, col)
making the subplots loop more flexible: for num, alpha in enumerate(df.columns)
Because the first argument to matplotlib.pyplot.hist can take
either a single array or a sequency of arrays which are not required
to be of the same length
...an alternattive would be:
for num, alpha in enumerate('abcd'):
plt.subplot(2, 2, num)
plt.hist((sephist(alpha)[0], sephist(alpha)[1]), bins=25, alpha=0.5, label=['yes', 'no'], color=['r', 'b'])
plt.legend(loc='upper right')
plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
I generalized one of the other comment's solutions. Hope it helps someone out there. I added a line to ensure binning (number and range) is preserved for each column, regardless of group. The code should work for both "binary" and "categorical" groupings, i.e. "by" can specify a column wherein there are N number of unique groups. Plotting also stops if the number of columns to plot exceeds the subplot space.
import numpy as np
import matplotlib.pyplot as plt
def composite_histplot(df, columns, by, nbins=25, alpha=0.5):
def _sephist(df, col, by):
unique_vals = df[by].unique()
df_by = dict()
for uv in unique_vals:
df_by[uv] = df[df[by] == uv][col]
return df_by
subplt_c = 4
subplt_r = 5
fig = plt.figure()
for num, col in enumerate(columns):
if num + 1 > subplt_c * subplt_r:
continue
plt.subplot(subplt_c, subplt_r, num+1)
bins = np.linspace(df[col].min(), df[col].max(), nbins)
for lbl, sepcol in _sephist(df, col, by).items():
plt.hist(sepcol, bins=bins, alpha=alpha, label=lbl)
plt.legend(loc='upper right', title=by)
plt.title(col)
plt.tight_layout()
return fig
TLDR oneliner;
It won't create the subplots but will create 4 different plots;
[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']
Full working example below
import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)
[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']