Combining Pandas Subplots into a Single Figure - pandas

I'm having trouble understanding Pandas subplots - and how to create axes so that all subplots are shown (not over-written by subsequent plot).
For each "Site", I want to make a time-series plot of all columns in the dataframe.
The "Sites" here are 'shark' and 'unicorn', both with 2 variables. The output should be be 4 plotted lines - the time-indexed plot for Var 1 and Var2 at each site.
Make Time-Indexed Data with Nans:
df = pd.DataFrame({
# some ways to create random data
'Var1':pd.np.random.randn(100),
'Var2':pd.np.random.randn(100),
'Site':pd.np.random.choice( ['unicorn','shark'], 100),
# a date range and set of random dates
'Date':pd.date_range('1/1/2011', periods=100, freq='D'),
# 'f':pd.np.random.choice( pd.date_range('1/1/2011', periods=365,
# freq='D'), 100, replace=False)
})
df.set_index('Date', inplace=True)
df['Var2']=df.Var2.cumsum()
df.loc['2011-01-31' :'2011-04-01', 'Var1']=pd.np.nan
Make a figure with a sub-plot for each site:
fig, ax = plt.subplots(len(df.Site.unique()), 1)
counter=0
for site in df.Site.unique():
print(site)
sitedat=df[df.Site==site]
sitedat.plot(subplots=True, ax=ax[counter], sharex=True)
ax[0].title=site #Set title of the plot to the name of the site
counter=counter+1
plt.show()
However, this is not working as written. The second sub-plot ends up overwriting the first. In my actual use case, I have 14 variable number of sites in each dataframe, as well as a variable number of 'Var1, 2, ...'. Thus, I need a solution that does not require creating each axis (ax0, ax1, ...) by hand.
As a bonus, I would love a title of each 'site' above that set of plots.
The current code over-writes the first 'Site' plot with the second. What I missing with the axes here?!

When you are using DataFrame.plot(..., subplot=True) you need to provide the correct number of axes that will be used for each column (and with the right geometry, if using layout=). In your example, you have 2 columns, so plot() needs two axes, but you are only passing one in ax=, therefore pandas has no choice but to delete all the axes and create the appropriate number of axes itself.
Therefore, you need to pass an array of axes of length corresponding to the number of columns you have in your dataframe.
# the grouper function is from itertools' cookbook
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
fig, axs = plt.subplots(len(df.Site.unique())*(len(df.columns)-1),1, sharex=True)
for (site,sitedat),axList in zip(df.groupby('Site'),grouper(axs,len(df.columns)-1)):
sitedat.plot(subplots=True, ax=axList)
axList[0].set_title(site)
plt.tight_layout()

Related

Draw bar-charts with value_counts() for multiple columns in a Pandas DataFrame

I'm trying to draw bar-charts with counts of unique values for all columns in a Pandas DataFrame. Kind of what df.hist() does for numerical columns, but I have categorical columns.
I'd prefer to use the object-oriented approach, because if feels more natural and explicit to me.
I'd like to have multiple Axes (subplots) within a single Figure, in a grid fashion (again like what df.hist() does).
My solution below does exactly what I want, but it feels cumbersome. I doubt whether I really need the direct dependency on Matplotlib (and all the code for creating the Figure, removing the unused Axes etc.). I see that pandas.Series.plot has parameters subplots and layout which seem to point to what I want, but maybe I'm totally off here. I tried looping over the columns in my DataFrame and apply these parameters, but I cannot figure it out.
Does anyone know a more compact way to do what I'm trying to achieve?
# Defining the grid-dimensions of the Axes in the Matplotlib Figure
nr_of_plots = len(ames_train_categorical.columns)
nr_of_plots_per_row = 4
nr_of_rows = math.ceil(nr_of_plots / nr_of_plots_per_row)
# Defining the Matplotlib Figure and Axes
figure, axes = plt.subplots(nrows=nr_of_rows, ncols=nr_of_plots_per_row, figsize=(25, 50))
figure.subplots_adjust(hspace=0.5)
# Plotting on the Axes
i, j = 0, 0
for column_name in ames_train_categorical:
if ames_train_categorical[column_name].nunique() <= 30:
axes[i][j].set_title(column_name)
ames_train_categorical[column_name].value_counts().plot(kind='bar', ax=axes[i][j])
j += 1
if j % nr_of_plots_per_row == 0:
i += 1
j = 0
# Cleaning up unused Axes
# plt.subplots creates a square grid of Axes. On the last row, not all Axes will always be used. Unused Axes are removed here.
axes_flattened = axes.flatten()
for ax in axes_flattened:
if not ax.has_data():
ax.remove()
Edit: alternative idea
Using the pyplot/state-machine WoW, you could do it like this with very limited lines of code. But this also has the downside that every graph gets it's own figure, you they're not nicely arranged in a grid.
for column_name in ames_train_categorical:
ames_train_categorical[column_name].value_counts().plot(kind='bar')
plt.show()
Desired output
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"MS Zoning": ["RL", "FV", "RL", "RH", "RL", "RL"],
"Street": ["Pave", "Pave", "Pave", "Grvl", "Pave", "Pave"],
"Alley": ["Grvl", "Grvl", "Grvl", "Grvl", "Pave", "Pave"],
"Utilities": ["AllPub", "NoSewr", "AllPub", "AllPub", "NoSewr", "AllPub"],
"Land Slope": ["Gtl", "Mod", "Sev", "Mod", "Sev", "Sev"],
}
)
Here is a bit more idiomatic way to do it:
import math
from matplotlib import pyplot as plt
size = math.ceil(df.shape[1]** (1/2))
fig = plt.figure()
for i, col in enumerate(df.columns):
fig.add_subplot(size, size, i + 1)
df[col].value_counts().plot(kind="bar", ax=plt.gca(), title=col, rot=0)
fig.tight_layout()

Create matplotlib subplots without manually counting number of subplots?

When doing ad-hoc analysis in Jupyter Notebook, I often want to view sequences of transformations to some Pandas DataFrame as vertically stacked subplots. My usual quick-and-dirty method is to not use subplots at all, but create a new figure for each plot:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame({"a": range(100)}) # Some arbitrary DataFrame
df.plot(title="0 to 100")
plt.show()
df = df * -1 # Some transformation
df.plot(title="0 to -100")
plt.show()
df = df * 2 # Some other transformation
df.plot(title="0 to -200")
plt.show()
This method has limitations. The x-axis ticks are unaligned even when identically indexed (because the x-axis width depends on y-axis labels) and the Jupyter cell output contains several separate inline images, not a single one that I can save or copy-and-paste.
As far as I know, the proper solution is to use plt.subplots():
fig, axes = plt.subplots(3, figsize=(20, 9))
df = pd.DataFrame({"a": range(100)}) # Arbitrary DataFrame
df.plot(ax=axes[0], title="0 to 100")
df = df * -1 # Some transformation
df.plot(ax=axes[1], title="0 to -100")
df = df * 2 # Some other transformation
df.plot(ax=axes[2], title="0 to -200")
plt.tight_layout()
plt.show()
This yields exactly the output I'd like. However, it also introduces an annoyance that makes me use the first method by default: I have to manually count the number of subplots I've created and update this count in several different places as the code changes.
In the multi-figure case, adding a fourth plot is as simple as calling df.plot() and plt.show() a fourth time. With subplots, the equivalent change requires updating the subplot count, plus arithmetic to resize the output figure, replacing plt.subplots(3, figsize=(20, 9)) with plt.subplots(4, figsize=(20, 12)). Each newly added subplot needs to know how many other subplots already exist (ax=axes[0], ax=axes[1], ax=axes[2], etc.), so any additions or removals require cascading changes to the plots below.
This seems like it should be trivial to automate — it's just counting and multiplication — but I'm finding it impossible to implement with the matplotlib/pyplot API. The closest I can get is the following partial solution, which is terse enough but still requires explicit counting:
n_subplots = 3 # Must still be updated manually as code changes
fig, axes = plt.subplots(n_subplots, figsize=(20, 3 * n_subplots))
i = 0 # Counts how many subplots have been added so far
df = pd.DataFrame({"a": range(100)}) # Arbitrary DataFrame
df.plot(ax=axes[i], title="0 to 100")
i += 1
df = df * -1 # Arbitrary transformation
df.plot(ax=axes[i], title="0 to -100")
i += 1
df = df * 2 # Arbitrary transformation
df.plot(ax=axes[i], title="0 to -200")
i += 1
plt.tight_layout()
plt.show()
The root problem is that any time df.plot() is called, there must exist an axes list of known size. I considered delaying the execution of df.plot() somehow, e.g. by appending to a list of lambda functions that can be counted before they're called in sequence, but this seems like an extreme amount of ceremony just to avoid updating an integer by hand.
Is there a more convenient way to do this? Specifically, is there a way to create a figure with an "expandable" number of subplots, suitable for ad-hoc/interactive contexts where the count is not known in advance?
(Note: This question may appear to be a duplicate of either this question or this one, but the accepted answers to both questions contain exactly the problem I'm trying to solve — that the nrows= parameter of plt.subplots() must be declared before adding subplots.)
First create an empty figure and then add subplots using add_subplot. Update the subplotspecs of the existing subplots in the figure using a new GridSpec for the new geometry (the figure keyword is only needed if you're using constrained layout instead of tight layout).
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
def append_axes(fig, as_cols=False):
"""Append new Axes to Figure."""
n = len(fig.axes) + 1
nrows, ncols = (1, n) if as_cols else (n, 1)
gs = mpl.gridspec.GridSpec(nrows, ncols, figure=fig)
for i,ax in enumerate(fig.axes):
ax.set_subplotspec(mpl.gridspec.SubplotSpec(gs, i))
return fig.add_subplot(nrows, ncols, n)
fig = plt.figure(layout='tight')
df = pd.DataFrame({"a": range(100)}) # Arbitrary DataFrame
df.plot(ax=append_axes(fig), title="0 to 100")
df = df * -1 # Some transformation
df.plot(ax=append_axes(fig), title="0 to -100")
df = df * 2 # Some other transformation
df.plot(ax=append_axes(fig), title="0 to -200")
Example for adding the new subplots as columns (and using constrained layout for a change):
fig = plt.figure(layout='constrained')
df = pd.DataFrame({"a": range(100)}) # Arbitrary DataFrame
df.plot(ax=append_axes(fig, True), title="0 to 100")
df = df + 10 # Some transformation
df.plot(ax=append_axes(fig, True), title="10 to 110")
IIUC you need some container for your transformations to achieve this - a list for example. Something like:
arbitrary_trx = [
lambda x: x, # No transformation
lambda x: x * -1, # Arbitrary transformation
lambda x: x * 2] # Arbitrary transformation
fig, axes = plt.subplots(nrows=len(arbitrary_trx))
for ax, f in zip(axes, arbitrary_trx):
df = df.apply(f)
df.plot(ax=ax)
You can create an object that stores the data and only creates the figure once you tell it to do so.
import pandas as pd
import matplotlib.pyplot as plt
class AxesStacker():
def __init__(self):
self.data = []
self.titles = []
def append(self, data, title=""):
self.data.append(data)
self.titles.append(title)
def create(self):
nrows = len(self.data)
self.fig, self.axs = plt.subplots(nrows=nrows)
for d, t, ax in zip(self.data, self.titles, self.axs.flat):
d.plot(ax=ax, title=t)
stacker = AxesStacker()
df = pd.DataFrame({"a": range(100)}) # Some arbitrary DataFrame
stacker.append(df, title="0 to 100")
df = df * -1 # Some transformation
stacker.append(df, title="0 to -100")
df = df * 2 # Some other transformation
stacker.append(df, title="0 to -200")
stacker.create()
plt.show()

pandas subplot, split into rows [duplicate]

I have a few Pandas DataFrames sharing the same value scale, but having different columns and indices. When invoking df.plot(), I get separate plot images. what I really want is to have them all in the same plot as subplots, but I'm unfortunately failing to come up with a solution to how and would highly appreciate some help.
You can manually create the subplots with matplotlib, and then plot the dataframes on a specific subplot using the ax keyword. For example for 4 subplots (2x2):
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=2, ncols=2)
df1.plot(ax=axes[0,0])
df2.plot(ax=axes[0,1])
...
Here axes is an array which holds the different subplot axes, and you can access one just by indexing axes.
If you want a shared x-axis, then you can provide sharex=True to plt.subplots.
You can see e.gs. in the documentation demonstrating joris answer. Also from the documentation, you could also set subplots=True and layout=(,) within the pandas plot function:
df.plot(subplots=True, layout=(1,2))
You could also use fig.add_subplot() which takes subplot grid parameters such as 221, 222, 223, 224, etc. as described in the post here. Nice examples of plot on pandas data frame, including subplots, can be seen in this ipython notebook.
You can plot multiple subplots of multiple pandas data frames using matplotlib with a simple trick of making a list of all data frame. Then using the for loop for plotting subplots.
Working code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# dataframe sample data
df1 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df2 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df3 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df4 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df5 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df6 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
#define number of rows and columns for subplots
nrow=3
ncol=2
# make a list of all dataframes
df_list = [df1 ,df2, df3, df4, df5, df6]
fig, axes = plt.subplots(nrow, ncol)
# plot counter
count=0
for r in range(nrow):
for c in range(ncol):
df_list[count].plot(ax=axes[r,c])
count+=1
Using this code you can plot subplots in any configuration. You need to define the number of rows nrow and the number of columns ncol. Also, you need to make list of data frames df_list which you wanted to plot.
You can use the familiar Matplotlib style calling a figure and subplot, but you simply need to specify the current axis using plt.gca(). An example:
plt.figure(1)
plt.subplot(2,2,1)
df.A.plot() #no need to specify for first axis
plt.subplot(2,2,2)
df.B.plot(ax=plt.gca())
plt.subplot(2,2,3)
df.C.plot(ax=plt.gca())
etc...
You can use this:
fig = plt.figure()
ax = fig.add_subplot(221)
plt.plot(x,y)
ax = fig.add_subplot(222)
plt.plot(x,z)
...
plt.show()
You may not need to use Pandas at all. Here's a matplotlib plot of cat frequencies:
x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)
f, axes = plt.subplots(2, 1)
for c, i in enumerate(axes):
axes[c].plot(x, y)
axes[c].set_title('cats')
plt.tight_layout()
Option 1: Create subplots from a dictionary of dataframes with long (tidy) data
Assumptions:
There is a dictionary of multiple dataframes of tidy data that are either:
Created by reading in from files
Created by separating a single dataframe into multiple dataframes
The categories, cat, may be overlapping, but all dataframes don't necessarily contain all values of cat
hue='cat'
This example uses a dict of dataframes, but a list of dataframes would be similar.
If the dataframes are wide, use pandas.DataFrame.melt to convert them to long form.
Because dataframes are being iterated through, there's no guarantee that colors will be mapped the same for each plot
A custom color map needs to be created from the unique 'cat' values for all the dataframes
Since the colors will be the same, place one legend to the side of the plots, instead of a legend in every plot
Tested in python 3.10, pandas 1.4.3, matplotlib 3.5.1, seaborn 0.11.2
Imports and Test Data
import pandas as pd
import numpy as np # used for random data
import matplotlib.pyplot as plt
from matplotlib.patches import Patch # for custom legend - square patches
from matplotlib.lines import Line2D # for custom legend - round markers
import seaborn as sns
import math import ceil # determine correct number of subplot
# synthetic data
df_dict = dict()
for i in range(1, 7):
np.random.seed(i) # for repeatable sample data
data_length = 100
data = {'cat': np.random.choice(['A', 'B', 'C'], size=data_length),
'x': np.random.rand(data_length), 'y': np.random.rand(data_length)}
df_dict[i] = pd.DataFrame(data)
# display(df_dict[1].head())
cat x y
0 B 0.944595 0.606329
1 A 0.586555 0.568851
2 A 0.903402 0.317362
3 B 0.137475 0.988616
4 B 0.139276 0.579745
# display(df_dict[6].tail())
cat x y
95 B 0.881222 0.263168
96 A 0.193668 0.636758
97 A 0.824001 0.638832
98 C 0.323998 0.505060
99 C 0.693124 0.737582
Create color mappings and plot
# create color mapping based on all unique values of cat
unique_cat = {cat for v in df_dict.values() for cat in v.cat.unique()} # get unique cats
colors = sns.color_palette('tab10', n_colors=len(unique_cat)) # get a number of colors
cmap = dict(zip(unique_cat, colors)) # zip values to colors
col_nums = 3 # how many plots per row
row_nums = math.ceil(len(df_dict) / col_nums) # how many rows of plots
# create the figue and axes
fig, axes = plt.subplots(row_nums, col_nums, figsize=(9, 6), sharex=True, sharey=True)
# convert to 1D array for easy iteration
axes = axes.flat
# iterate through dictionary and plot
for ax, (k, v) in zip(axes, df_dict.items()):
sns.scatterplot(data=v, x='x', y='y', hue='cat', palette=cmap, ax=ax)
sns.despine(top=True, right=True)
ax.legend_.remove() # remove the individual plot legends
ax.set_title(f'dataset = {k}', fontsize=11)
fig.tight_layout()
# create legend from cmap
# patches = [Patch(color=v, label=k) for k, v in cmap.items()] # square patches
patches = [Line2D([0], [0], marker='o', color='w', markerfacecolor=v, label=k, markersize=8) for k, v in cmap.items()] # round markers
# place legend outside of plot; change the right bbox value to move the legend up or down
plt.legend(title='cat', handles=patches, bbox_to_anchor=(1.06, 1.2), loc='center left', borderaxespad=0, frameon=False)
plt.show()
Option 2: Create subplots from a single dataframe with multiple separate datasets
The dataframes must be in a long form with the same column names.
This option uses pd.concat to combine multiple dataframes into a single dataframe, and .assign to add a new column.
See Import multiple csv files into pandas and concatenate into one DataFrame for creating a single dataframes from a list of files.
This option is easier because it doesn't require manually mapping colors to 'cat'
Combine DataFrames
# using df_dict, with dataframes as values, from the top
# combine all the dataframes in df_dict to a single dataframe with an identifier column
df = pd.concat((v.assign(dataset=k) for k, v in df_dict.items()), ignore_index=True)
# display(df.head())
cat x y dataset
0 B 0.944595 0.606329 1
1 A 0.586555 0.568851 1
2 A 0.903402 0.317362 1
3 B 0.137475 0.988616 1
4 B 0.139276 0.579745 1
# display(df.tail())
cat x y dataset
595 B 0.881222 0.263168 6
596 A 0.193668 0.636758 6
597 A 0.824001 0.638832 6
598 C 0.323998 0.505060 6
599 C 0.693124 0.737582 6
Plot a FacetGrid with seaborn.relplot
sns.relplot(kind='scatter', data=df, x='x', y='y', hue='cat', col='dataset', col_wrap=3, height=3)
Both options create the same result, however, it's less complicated to combine all the dataframes, and plot a figure-level plot with sns.relplot.
Building on #joris response above, if you have already established a reference to the subplot, you can use the reference as well. For example,
ax1 = plt.subplot2grid((50,100), (0, 0), colspan=20, rowspan=10)
...
df.plot.barh(ax=ax1, stacked=True)
Here is a working pandas subplot example, where modes is the column names of the dataframe.
dpi=200
figure_size=(20, 10)
fig, ax = plt.subplots(len(modes), 1, sharex="all", sharey="all", dpi=dpi)
for i in range(len(modes)):
ax[i] = pivot_df.loc[:, modes[i]].plot.bar(figsize=(figure_size[0], figure_size[1]*len(modes)),
ax=ax[i], title=modes[i], color=my_colors[i])
ax[i].legend()
fig.suptitle(name)
import numpy as np
import pandas as pd
imoprt matplotlib.pyplot as plt
fig, ax = plt.subplots(2,2)
df = pd.DataFrame({'A':np.random.randint(1,100,10),
'B': np.random.randint(100,1000,10),
'C':np.random.randint(100,200,10)})
for ax in ax.flatten():
df.plot(ax =ax)

colormap with pandas dataframe plot function

I have data from multiple sites that record a sharp change in the monitored parameter. How could I plot the data for all these sites using value-dependent colors to enhance the visualization?
import numpy as np
import pandas as pd
import string
# site names
cols = string.ascii_uppercase
# number of days
ndays = 3
# index
index = pd.date_range('2018-05-01', periods=3*24*60, freq='T')
# simulated daily data
d1 = np.random.randn(len(index)//ndays, len(cols))
d2 = np.random.randn(len(index)//ndays, len(cols))+2
d3 = np.random.randn(len(index)//ndays, len(cols))-2
data=np.concatenate([d1, d2, d3])
# df = pd.DataFrame(data=data, index=index, columns=list(cols))
df.plot(legend=False)
Each site (column) gets assigned one color in the above code. Is there a way to represent the parameter values to different colors?
I guess one alternative is using colormaps option from scatter plot function: How to use colormaps to color plots of Pandas DataFrames
ax = plt.subplots(figsize=(12,6))
collection = [plt.scatter(range(len(df)), df[col], c=df[col], s=25, cmap=cmap, edgecolor='None') for col in df.columns]
However, if I plot over time (i.e., x=df.index) things appear not to work as expected.
Is there any other alternative? or suggestion how to better visualize the sudden change in the time series?
In what follows I will use only 3 columns and hourly data in order to make the plots look less messy. The examples work as well with the original data.
cols = string.ascii_uppercase[:3]
ndays = 3
index = pd.date_range('2018-05-01', periods=3*24, freq='H')
# simulated daily data
d1 = np.random.randn(len(index)//ndays, len(cols))
d2 = np.random.randn(len(index)//ndays, len(cols))+2
d3 = np.random.randn(len(index)//ndays, len(cols))-2
data=np.concatenate([d1, d2, d3])
df = pd.DataFrame(data=data, index=index, columns=list(cols))
df.plot(legend=False)
The pandas way
You are out of luck,DataFrame.plot.scatter does not work with datetime-like data due to a long standing bug.
The matplotlib way
Matplotlib's scatter can handle datetime-like data but the x-axis does not scale as expected.
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])
plt.gcf().autofmt_xdate()
This looks like a bug to me but I could not find any reports. You can work around this by manually adjusting the x-limits.
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])
start, end = df.index[[0, -1]]
xmargin = (end - start) * plt.gca().margins()[0]
plt.xlim(start - xmargin, end + xmargin)
plt.gcf().autofmt_xdate()
Unfortunately the x-axis formatter is not as nice as the pandas one.
The pandas way, revisited
I discovered this trick by chance and I do not understand why it works. If you plot a pandas series indexed by the same datetime data before calling matplotlib's scatter, the autoscaling issue disappear and you get the nice pandas formatting.
So I made an invisible plot of the first column and then the scatter plot.
df.iloc[:, 0].plot(lw=0) # invisible plot
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])

Pandas bar plot changes date format

I have a simple stacked line plot that has exactly the date format I want magically set when using the following code.
df_ts = df.resample("W", how='max')
df_ts.plot(figsize=(12,8), stacked=True)
However, the dates mysteriously transform themselves to an ugly and unreadable format when plotting the same data as a bar plot.
df_ts = df.resample("W", how='max')
df_ts.plot(kind='bar', figsize=(12,8), stacked=True)
The original data was transformed a bit to have the weekly max. Why is this radical change in automatically set dates happening? How can I have the nicely formatted dates as above?
Here is some dummy data
start = pd.to_datetime("1-1-2012")
idx = pd.date_range(start, periods= 365).tolist()
df=pd.DataFrame({'A':np.random.random(365), 'B':np.random.random(365)})
df.index = idx
df_ts = df.resample('W', how= 'max')
df_ts.plot(kind='bar', stacked=True)
The plotting code assumes that each bar in a bar plot deserves its own label.
You could override this assumption by specifying your own formatter:
ax.xaxis.set_major_formatter(formatter)
The pandas.tseries.converter.TimeSeries_DateFormatter that Pandas uses to format the dates in the "good" plot works well with line plots when the x-values are dates. However, with a bar plot the x-values (at least those received by TimeSeries_DateFormatter.__call__) are merely integers starting at zero. If you try to use TimeSeries_DateFormatter with a bar plot, all the labels thus start at the Epoch, 1970-1-1 UTC, since this is the date which corresponds to zero. So the formatter used for line plots is unfortunately useless for bar plots (at least as far as I can see).
The easiest way I see to produce the desired formatting is to generate and set the labels explicitly:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.ticker as ticker
start = pd.to_datetime("5-1-2012")
idx = pd.date_range(start, periods=365)
df = pd.DataFrame({'A': np.random.random(365), 'B': np.random.random(365)})
df.index = idx
df_ts = df.resample('W').max()
ax = df_ts.plot(kind='bar', stacked=True)
# Make most of the ticklabels empty so the labels don't get too crowded
ticklabels = ['']*len(df_ts.index)
# Every 4th ticklable shows the month and day
ticklabels[::4] = [item.strftime('%b %d') for item in df_ts.index[::4]]
# Every 12th ticklabel includes the year
ticklabels[::12] = [item.strftime('%b %d\n%Y') for item in df_ts.index[::12]]
ax.xaxis.set_major_formatter(ticker.FixedFormatter(ticklabels))
plt.gcf().autofmt_xdate()
plt.show()
yields
For those looking for a simple example of a bar plot with dates:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
dates = pd.date_range('2012-1-1', '2017-1-1', freq='M')
df = pd.DataFrame({'A':np.random.random(len(dates)), 'Date':dates})
fig, ax = plt.subplots()
df.plot.bar(x='Date', y='A', ax=ax)
ticklabels = ['']*len(df)
skip = len(df)//12
ticklabels[::skip] = df['Date'].iloc[::skip].dt.strftime('%Y-%m-%d')
ax.xaxis.set_major_formatter(mticker.FixedFormatter(ticklabels))
fig.autofmt_xdate()
# fixes the tracker
# https://matplotlib.org/users/recipes.html
def fmt(x, pos=0, max_i=len(ticklabels)-1):
i = int(x)
i = 0 if i < 0 else max_i if i > max_i else i
return dates[i]
ax.fmt_xdata = fmt
plt.show()
I've struggled with this problem too, and after reading several posts came up with the following solution, which seems to me slightly clearer than matplotlib.dates approach.
Labels without modification:
# Use DatetimeIndex instead of date_range for pandas earlier than 1.0.0 version
timeline = pd.date_range(start='2018, November', freq='M', periods=15)
df = pd.DataFrame({'date': timeline, 'value': np.random.randn(15)})
df.set_index('date', inplace=True)
df.plot(kind='bar', figsize=(12, 8), color='#2ecc71')
Labels with modification:
def line_format(label):
"""
Convert time label to the format of pandas line plot
"""
month = label.month_name()[:3]
if month == 'Jan':
month += f'\n{label.year}'
return month
# Note that we specify rot here
ax = df.plot(kind='bar', figsize=(12, 8), color='#2ecc71', rot=0)
ax.set_xticklabels(map(line_format, df.index))
This approach will add year to the label only if it is January
Here's an easy approach with pandas plot() and without using matplotlib dates:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# generate sample data
start = pd.to_datetime("1-1-2012")
index = pd.date_range(start, periods= 365)
df = pd.DataFrame({'A' : np.random.random(365), 'B' : np.random.random(365)}, index=index)
# resample to any timeframe you need, e.g. months
df_months = df.resample("M").sum()
# plot
fig, ax = plt.subplots()
df_months.plot(kind="bar", figsize=(16,5), stacked=True, ax=ax)
# format xtick-labels with list comprehension
ax.set_xticklabels([x.strftime("%Y-%m") for x in df_months.index], rotation=45)
plt.show()
How to get nicely formatted dates like the pandas line plot
The issue is that the pandas bar plot processes the date variable as a categorical variable where each date is considered to be a unique category, so the x-axis units are set to integers starting at 0 (like the default DataFrame index when none is assigned) and the full string of each date is shown without any automatic formatting.
Here are two solutions to format the date tick labels of a pandas (stacked) bar chart of a time series:
The first is a variation of the answer by unutbu and is made to better fit the data shown in the question;
The second is a generalized solution that lets you use matplotlib date tick locators and formatters which produces appropriate date labels for time series of any type of frequency.
But first, let's see what the nicely formatted tick labels look like when the sample data is plotted with a pandas line plot.
Default pandas line plot date formatting
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.dates as mdates # v 3.3.2
# Create sample dataset with a daily frequency and resample it to a weekly frequency
rng = np.random.default_rng(seed=123) # random number generator
idx = pd.date_range(start='2012-01-01', end='2013-12-31', freq='D')
df_raw = pd.DataFrame(rng.random(size=(idx.size, 3)),
index=idx, columns=list('ABC'))
df = df_raw.resample('W').sum() # default is 'W-SUN'
# Create pandas stacked line plot
ax = df.plot(stacked=True, figsize=(10,5))
Because the data is grouped by week with timestamps for Sundays (frequency W-SUN), the monthly tick labels are not necessarily placed on the first day of the month and there can be 3 or 4 weeks between each first week of the month so the minor ticks are unevenly spaced (noticeable if you look closely). Here are the exact dates of the major ticks:
# Convert major x ticks to date labels
np.array([mdates.num2date(tick*7-4).strftime('%Y-%b-%d') for tick in ax.get_xticks()])
"""
array(['2012-Jan-01', '2012-Apr-01', '2012-Jul-01', '2012-Oct-07',
'2013-Jan-06', '2013-Apr-07', '2013-Jul-07', '2013-Oct-06',
'2014-Jan-05'], dtype='<U11')
"""
The challenge lies in selecting the ticks for each first week of the month seeing as they are unequally spaced. Other answers have provided simple solutions based on a fixed tick frequency which produces oddly spaced labels in terms of dates where the months can be sometimes repeated (for example the month of July in unutbu's answer). Or they have provided solutions based on a monthly time series instead of a weekly time series, which is simpler to format seeing as there are always 12 months per year. So here is a solution that gives nicely formatted tick labels like in the pandas line plot and that works for any frequency of data.
Solution 1: pandas bar plot with tick labels based on the DatetimeIndex
# Create pandas stacked bar chart
ax = df.plot.bar(stacked=True, figsize=(10,5))
# Create list of monthly timestamps by selecting the first weekly timestamp of each
# month (in this example, the first Sunday of each month)
monthly_timestamps = [timestamp for idx, timestamp in enumerate(df.index)
if (timestamp.month != df.index[idx-1].month) | (idx == 0)]
# Automatically select appropriate number of timestamps so that x-axis does
# not get overcrowded with tick labels
step = 1
while len(monthly_timestamps[::step]) > 10: # increase number if time range >3 years
step += 1
timestamps = monthly_timestamps[::step]
# Create tick labels from timestamps
labels = [ts.strftime('%b\n%Y') if ts.year != timestamps[idx-1].year
else ts.strftime('%b') for idx, ts in enumerate(timestamps)]
# Set major ticks and labels
ax.set_xticks([df.index.get_loc(ts) for ts in timestamps])
ax.set_xticklabels(labels)
# Set minor ticks without labels
ax.set_xticks([df.index.get_loc(ts) for ts in monthly_timestamps], minor=True)
# Rotate and center labels
ax.figure.autofmt_xdate(rotation=0, ha='center')
To my knowledge, there is no way of getting this exact label formatting with the matplotlib.dates (mdates) tick locators and formatters. Nevertheless, combining mdates functionalities with a pandas stacked bar plot can come in handy if you prefer using tick locators/formatters or if you want to have dynamic ticks when using the interactive interface of matplotlib (to pan/zoom in and out).
At this point, it may be useful to consider creating the stacked bar plot in matplotlib directly, where you need to loop through the variables to create the stacked bar. The pandas-based solution shown below works by looping through the patches of the bars to relocate them according to matplotlib date units. So it is basically one loop instead of another, up to you to see which is more convenient.
Solution 2: pandas bar plot with matplotlib tick locators and formatters
This generalized solution uses the mdates AutoDateLocator which places ticks at the beginning of months/years. If you generate data and timestamps with pd.date_range in pandas (like in this example), you should keep in mind that the commonly used 'M' and 'Y' frequencies produce timestamps for the end date of the periods. The code given in the following example aligns monthly/yearly tick marks with 'MS' and 'YS' frequencies.
If you import a dataset using end-of-period dates (or some other type of pandas frequency not aligned with AutoDateLocator ticks), I am not aware of any convenient way to shift the AutoDateLocator accordingly so that the labels become correctly aligned with the bars. I see two options: i) resample the data using df.resample('MS').sum() if that does not cause any issue regarding the meaning of the underlying data; ii) or else use another date locator.
This issue causes no problem in the following example seeing as the data has a week end frequency 'W-SUN' so the monthly/yearly labels placed at a month/year start frequency are fine.
# Create pandas stacked bar chart with the default bar width = 0.5
ax = df.plot.bar(stacked=True, figsize=(10,5))
# Compute width of bars in matplotlib date units, 'md' (in days) and adjust it if
# the bar width in df.plot.bar has been set to something else than the default 0.5
bar_width_md_default, = np.diff(mdates.date2num(df.index[:2]))/2
bar_width = ax.patches[0].get_width()
bar_width_md = bar_width*bar_width_md_default/0.5
# Compute new x values in matplotlib date units for the patches (rectangles) that
# make up the stacked bars, adjusting the positions according to the bar width:
# if the frequency is in months (or years), the bars may not always be perfectly
# centered over the tick marks depending on the number of days difference between
# the months (or years) given by df.index[0] and [1] used to compute the bar
# width, this should not be noticeable if the bars are wide enough.
x_bars_md = mdates.date2num(df.index) - bar_width_md/2
nvar = len(ax.get_legend_handles_labels()[1])
x_patches_md = np.ravel(nvar*[x_bars_md])
# Set bars to new x positions and adjust width: this loop works fine with NaN
# values as well because in bar plot NaNs are drawn with a rectangle of 0 height
# located at the foot of the bar, you can verify this with patch.get_bbox()
for patch, x_md in zip(ax.patches, x_patches_md):
patch.set_x(x_md)
patch.set_width(bar_width_md)
# Set major ticks
maj_loc = mdates.AutoDateLocator()
ax.xaxis.set_major_locator(maj_loc)
# Show minor tick under each bar (instead of each month) to highlight
# discrepancy between major tick locator and bar positions seeing as no tick
# locator is available for first-week-of-the-month frequency
ax.set_xticks(x_bars_md + bar_width_md/2, minor=True)
# Set major tick formatter
zfmts = ['', '%b\n%Y', '%b', '%b-%d', '%H:%M', '%H:%M']
fmt = mdates.ConciseDateFormatter(maj_loc, zero_formats=zfmts, show_offset=False)
ax.xaxis.set_major_formatter(fmt)
# Shift the plot frame to where the bars are now located
xmin = min(x_bars_md) - bar_width_md
xmax = max(x_bars_md) + 2*bar_width_md
ax.set_xlim(xmin, xmax)
# Adjust tick label format last, else it may sometimes not be applied correctly
ax.figure.autofmt_xdate(rotation=0, ha='center')
Minor ticks a displayed under each bar to highlight the fact that the timestamps of the bars often do not coincide with a month/year start marked by the labels of the AutoDateLocator ticks. I am not aware of any date locator that can be used to select ticks for the first week of each month and reproduce exactly the result shown in solution 1.
Documentation: date format codes, mdates.ConciseDateFormatter
Here's a possibly easier approach using mdates, though requires you to loop over your columns, calling bar plot from matplotlib. Here's an example where I plot just one column and use mdates for customized ticks and labels (EDIT Added looping function to plot all columns stacked):
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
def format_x_date_month_day(ax):
# Standard date x-axis formatting block, labels each month and ticks each day
days = mdates.DayLocator()
months = mdates.MonthLocator() # every month
dayFmt = mdates.DateFormatter('%D')
monthFmt = mdates.DateFormatter('%Y-%m')
ax.figure.autofmt_xdate()
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(monthFmt)
ax.xaxis.set_minor_locator(days)
def df_stacked_bar_formattable(df, ax, **kwargs):
P = []
lastBar = None
for col in df.columns:
X = df.index
Y = df[col]
if lastBar is not None:
P.append(ax.bar(X, Y, bottom=lastBar, **kwargs))
else:
P.append(ax.bar(X, Y, **kwargs))
lastBar = Y
plt.legend([p[0] for p in P], df.columns)
span_days = 90
start = pd.to_datetime("1-1-2012")
idx = pd.date_range(start, periods=span_days).tolist()
df=pd.DataFrame(index=idx, data={'A':np.random.random(span_days), 'B':np.random.random(span_days)})
plt.close('all')
fig, ax = plt.subplots(1)
df_stacked_bar_formattable(df, ax)
format_x_date_month_day(ax)
plt.show()
(Referencing matplotlib.org for example of looping to create a stacked bar plot.) This gives us
Another approach that should work and be much easier is to use df.plot.bar(ax=ax, stacked=True), however it does not admit date axis formatting with mdates and is the subject of my question.
Maybe not the most elegant, but hopefully easy way:
fig = plt.figure()
ax = fig.add_subplot(111)
df_ts.plot(kind='bar', figsize=(12,8), stacked=True,ax=ax)
ax.set_xticklabels(''*len(df_ts.index))
df_ts.plot(linewidth=0, ax=ax) # This sets the nice x_ticks automatically
[EDIT]: ax=ax neede in df_ts.plot()