pandas subplot, split into rows [duplicate] - pandas

I have a few Pandas DataFrames sharing the same value scale, but having different columns and indices. When invoking df.plot(), I get separate plot images. what I really want is to have them all in the same plot as subplots, but I'm unfortunately failing to come up with a solution to how and would highly appreciate some help.

You can manually create the subplots with matplotlib, and then plot the dataframes on a specific subplot using the ax keyword. For example for 4 subplots (2x2):
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=2, ncols=2)
df1.plot(ax=axes[0,0])
df2.plot(ax=axes[0,1])
...
Here axes is an array which holds the different subplot axes, and you can access one just by indexing axes.
If you want a shared x-axis, then you can provide sharex=True to plt.subplots.

You can see e.gs. in the documentation demonstrating joris answer. Also from the documentation, you could also set subplots=True and layout=(,) within the pandas plot function:
df.plot(subplots=True, layout=(1,2))
You could also use fig.add_subplot() which takes subplot grid parameters such as 221, 222, 223, 224, etc. as described in the post here. Nice examples of plot on pandas data frame, including subplots, can be seen in this ipython notebook.

You can plot multiple subplots of multiple pandas data frames using matplotlib with a simple trick of making a list of all data frame. Then using the for loop for plotting subplots.
Working code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# dataframe sample data
df1 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df2 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df3 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df4 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df5 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df6 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
#define number of rows and columns for subplots
nrow=3
ncol=2
# make a list of all dataframes
df_list = [df1 ,df2, df3, df4, df5, df6]
fig, axes = plt.subplots(nrow, ncol)
# plot counter
count=0
for r in range(nrow):
for c in range(ncol):
df_list[count].plot(ax=axes[r,c])
count+=1
Using this code you can plot subplots in any configuration. You need to define the number of rows nrow and the number of columns ncol. Also, you need to make list of data frames df_list which you wanted to plot.

You can use the familiar Matplotlib style calling a figure and subplot, but you simply need to specify the current axis using plt.gca(). An example:
plt.figure(1)
plt.subplot(2,2,1)
df.A.plot() #no need to specify for first axis
plt.subplot(2,2,2)
df.B.plot(ax=plt.gca())
plt.subplot(2,2,3)
df.C.plot(ax=plt.gca())
etc...

You can use this:
fig = plt.figure()
ax = fig.add_subplot(221)
plt.plot(x,y)
ax = fig.add_subplot(222)
plt.plot(x,z)
...
plt.show()

You may not need to use Pandas at all. Here's a matplotlib plot of cat frequencies:
x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)
f, axes = plt.subplots(2, 1)
for c, i in enumerate(axes):
axes[c].plot(x, y)
axes[c].set_title('cats')
plt.tight_layout()

Option 1: Create subplots from a dictionary of dataframes with long (tidy) data
Assumptions:
There is a dictionary of multiple dataframes of tidy data that are either:
Created by reading in from files
Created by separating a single dataframe into multiple dataframes
The categories, cat, may be overlapping, but all dataframes don't necessarily contain all values of cat
hue='cat'
This example uses a dict of dataframes, but a list of dataframes would be similar.
If the dataframes are wide, use pandas.DataFrame.melt to convert them to long form.
Because dataframes are being iterated through, there's no guarantee that colors will be mapped the same for each plot
A custom color map needs to be created from the unique 'cat' values for all the dataframes
Since the colors will be the same, place one legend to the side of the plots, instead of a legend in every plot
Tested in python 3.10, pandas 1.4.3, matplotlib 3.5.1, seaborn 0.11.2
Imports and Test Data
import pandas as pd
import numpy as np # used for random data
import matplotlib.pyplot as plt
from matplotlib.patches import Patch # for custom legend - square patches
from matplotlib.lines import Line2D # for custom legend - round markers
import seaborn as sns
import math import ceil # determine correct number of subplot
# synthetic data
df_dict = dict()
for i in range(1, 7):
np.random.seed(i) # for repeatable sample data
data_length = 100
data = {'cat': np.random.choice(['A', 'B', 'C'], size=data_length),
'x': np.random.rand(data_length), 'y': np.random.rand(data_length)}
df_dict[i] = pd.DataFrame(data)
# display(df_dict[1].head())
cat x y
0 B 0.944595 0.606329
1 A 0.586555 0.568851
2 A 0.903402 0.317362
3 B 0.137475 0.988616
4 B 0.139276 0.579745
# display(df_dict[6].tail())
cat x y
95 B 0.881222 0.263168
96 A 0.193668 0.636758
97 A 0.824001 0.638832
98 C 0.323998 0.505060
99 C 0.693124 0.737582
Create color mappings and plot
# create color mapping based on all unique values of cat
unique_cat = {cat for v in df_dict.values() for cat in v.cat.unique()} # get unique cats
colors = sns.color_palette('tab10', n_colors=len(unique_cat)) # get a number of colors
cmap = dict(zip(unique_cat, colors)) # zip values to colors
col_nums = 3 # how many plots per row
row_nums = math.ceil(len(df_dict) / col_nums) # how many rows of plots
# create the figue and axes
fig, axes = plt.subplots(row_nums, col_nums, figsize=(9, 6), sharex=True, sharey=True)
# convert to 1D array for easy iteration
axes = axes.flat
# iterate through dictionary and plot
for ax, (k, v) in zip(axes, df_dict.items()):
sns.scatterplot(data=v, x='x', y='y', hue='cat', palette=cmap, ax=ax)
sns.despine(top=True, right=True)
ax.legend_.remove() # remove the individual plot legends
ax.set_title(f'dataset = {k}', fontsize=11)
fig.tight_layout()
# create legend from cmap
# patches = [Patch(color=v, label=k) for k, v in cmap.items()] # square patches
patches = [Line2D([0], [0], marker='o', color='w', markerfacecolor=v, label=k, markersize=8) for k, v in cmap.items()] # round markers
# place legend outside of plot; change the right bbox value to move the legend up or down
plt.legend(title='cat', handles=patches, bbox_to_anchor=(1.06, 1.2), loc='center left', borderaxespad=0, frameon=False)
plt.show()
Option 2: Create subplots from a single dataframe with multiple separate datasets
The dataframes must be in a long form with the same column names.
This option uses pd.concat to combine multiple dataframes into a single dataframe, and .assign to add a new column.
See Import multiple csv files into pandas and concatenate into one DataFrame for creating a single dataframes from a list of files.
This option is easier because it doesn't require manually mapping colors to 'cat'
Combine DataFrames
# using df_dict, with dataframes as values, from the top
# combine all the dataframes in df_dict to a single dataframe with an identifier column
df = pd.concat((v.assign(dataset=k) for k, v in df_dict.items()), ignore_index=True)
# display(df.head())
cat x y dataset
0 B 0.944595 0.606329 1
1 A 0.586555 0.568851 1
2 A 0.903402 0.317362 1
3 B 0.137475 0.988616 1
4 B 0.139276 0.579745 1
# display(df.tail())
cat x y dataset
595 B 0.881222 0.263168 6
596 A 0.193668 0.636758 6
597 A 0.824001 0.638832 6
598 C 0.323998 0.505060 6
599 C 0.693124 0.737582 6
Plot a FacetGrid with seaborn.relplot
sns.relplot(kind='scatter', data=df, x='x', y='y', hue='cat', col='dataset', col_wrap=3, height=3)
Both options create the same result, however, it's less complicated to combine all the dataframes, and plot a figure-level plot with sns.relplot.

Building on #joris response above, if you have already established a reference to the subplot, you can use the reference as well. For example,
ax1 = plt.subplot2grid((50,100), (0, 0), colspan=20, rowspan=10)
...
df.plot.barh(ax=ax1, stacked=True)

Here is a working pandas subplot example, where modes is the column names of the dataframe.
dpi=200
figure_size=(20, 10)
fig, ax = plt.subplots(len(modes), 1, sharex="all", sharey="all", dpi=dpi)
for i in range(len(modes)):
ax[i] = pivot_df.loc[:, modes[i]].plot.bar(figsize=(figure_size[0], figure_size[1]*len(modes)),
ax=ax[i], title=modes[i], color=my_colors[i])
ax[i].legend()
fig.suptitle(name)

import numpy as np
import pandas as pd
imoprt matplotlib.pyplot as plt
fig, ax = plt.subplots(2,2)
df = pd.DataFrame({'A':np.random.randint(1,100,10),
'B': np.random.randint(100,1000,10),
'C':np.random.randint(100,200,10)})
for ax in ax.flatten():
df.plot(ax =ax)

Related

Pandas histogram plot with Y axis or colorbar

In Pandas, I am trying to generate a Ridgeline plot for which the density values are shown (either as Y axis or color-ramp). I am using the Joyplot but any other alternative ways are fine.
So, first I created the Ridge plot to show the different distribution plot for each condition (you can reproduce it using this code):
import pandas as pd
import joypy
import matplotlib
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'Category1':np.random.choice(['C1','C2','C3'],1000),'Category2':np.random.choice(['B1','B2','B3','B4','B5'],1000),
'year':np.arange(start=1900, stop=2900, step=1),
'Data':np.random.uniform(0,1,1000),"Period":np.random.choice(['AA','CC','BB','DD'],1000)})
data_pivot=df1.pivot_table('Data', ['Category1', 'Category2','year'], 'Period')
fig, axes = joypy.joyplot(data_pivot, column=['AA', 'BB', 'CC', 'DD'], by="Category1", ylim='own', figsize=(14,10), legend=True, alpha=0.4)
so it generates the figure but without my desired Y axis. So, based on this post, I could add a colorramp, which neither makes sense nor show the differences between the distribution plot of the different categories on each line :) ...
ar=df1['Data'].plot.kde().get_lines()[0].get_ydata() ## a workaround to get the probability values to set the colorramp max and min
norm = plt.Normalize(ar.min(), ar.max())
original_cmap = plt.cm.viridis
cmap = matplotlib.colors.ListedColormap(original_cmap(norm(ar)))
sm = matplotlib.cm.ScalarMappable(cmap=original_cmap, norm=norm)
sm.set_array([])
# plotting ....
fig, axes = joypy.joyplot(data_pivot,colormap = cmap , column=['AA', 'BB', 'CC', 'DD'], by="Category1", ylim='own', figsize=(14,10), legend=True, alpha=0.4)
fig.colorbar(sm, ax=axes, label="density")
But what I want is some thing like either of these figures (preferably with colorramp) :

In matplotlib, is there a method to fix or arrange the order of x-values of a mixed type with a character and digits?

There are several Q/A for x-values in matplotlib and it shows when the x values are int or float, matploblit plots the figure in the right order of x. For example, in character type, the plot shows x values in the order of
1 15 17 2 21 7 etc
but when it became int, it becomes
1 2 7 15 17 21 etc
in human order.
If the x values are mixed with character and digits such as
NN8 NN10 NN15 NN20 NN22 etc
the plot will show in the order of
NN10 NN15 NN20 NN22 NN8 etc
Is there a way to fix the order of x values in the human order or the existing order in the x list without removing 'NN' in x-values.
In more detail, the xvalues are directory names and using grep sort inside linux function, the results are displayed in linux terminal as follows, which can be saved in text file.
joonho#login:~/NDataNpowN$ get_TEFrmse NN 2 | sort -n -t N -k 3
NN7 0.3311
NN8 0.3221
NN9 0.2457
NN10 0.2462
NN12 0.2607
NN14 0.2635
Without sort, the linux shell also displays in the machine order such as
NN10 0.2462
NN12 0.2607
NN14 0.2635
NN7 0.3311
NN8 0.3221
NN9 0.2457
As I said, pandas would make this task easier than dealing with base Python lists and such:
import matplotlib.pyplot as plt
import pandas as pd
#imports the text file assuming that your data are separated by space, as in your example above
df = pd.read_csv("test.txt", delim_whitespace=True, names=["X", "Y"])
#extracting the number in a separate column, assuming you do not have terms like NN1B3X5
df["N"] = df.X.str.replace(r"\D", "", regex=True).astype(int)
#this step is only necessary, if your file is not pre-sorted by Linux
df = df.sort_values(by="N")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6))
#categorical plotting
df.plot(x="X", y="Y", ax=ax1)
ax1.set_title("Evenly spaced")
#numerical plotting
df.plot(x="N", y="Y", ax=ax2)
ax2.set_xticks(df.N)
ax2.set_xticklabels(df.X)
ax2.set_title("Numerical spacing")
plt.show()
Sample output:
Since you asked if there is a non-pandas solution - of course. Pandas makes some things just more convenient. In this case, I would revert to numpy. Numpy is a matplotlib dependency, so in contrast to pandas, it must be installed, if you use matplotlib:
import matplotlib.pyplot as plt
import numpy as np
import re
#read file as strings
arr = np.genfromtxt("test.txt", dtype="U15")
#remove trailing strings
Xnums = np.asarray([re.sub(r"\D", "", i) for i in arr[:, 0]], dtype=int)
#sort array
arr = arr[np.argsort(Xnums)]
#extract x-values as strings...
Xstr = arr[:, 0]
#...and y-values as float
Yvals = arr[:, 1].astype(float)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6))
#categorical plotting
ax1.plot(Xstr, Yvals)
ax1.set_title("Evenly spaced")
#numerical plotting
ax2.plot(np.sort(Xnums), Yvals)
ax2.set_xticks(np.sort(Xnums))
ax2.set_xticklabels(Xstr)
ax2.set_title("Numerical spacing")
plt.show()
Sample output:

Making a Scatter Plot from a DataFrame in Pandas

I have a DataFrame and need to make a scatter-plot from it.
I need to use 2 columns as the x-axis and y-axis and only need to plot 2 rows from the entire dataset. Any suggestions?
For example, my dataframe is below (50 states x 4 columns). I need to plot 'rgdp_change' on the x-axis vs 'diff_unemp' on the y-axis, and only need to plot for the states, "Michigan" and "Wisconsin".
So from the dataframe, you'll need to select the rows from a list of the states you want: ['Michigan', 'Wisconsin']
I also figured you would probably want a legend or some way to differentiate one point from the other. To do this, we create a colormap assigning a different color to each state. This way the code is generalizable for more than those two states.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
# generate a random df with the relevant rows, columns to your actual df
df = pd.DataFrame({'State':['Alabama', 'Alaska', 'Michigan', 'Wisconsin'], 'real_gdp':[1.75*10**5, 4.81*10**4, 2.59*10**5, 1.04*10**5],
'rgdp_change': [-0.4, 0.5, 0.4, -0.5], 'diff_unemp': [-1.3, 0.4, 0.5, -11]})
fig, ax = plt.subplots()
states = ['Michigan', 'Wisconsin']
colormap = cm.viridis
colorlist = [colors.rgb2hex(colormap(i)) for i in np.linspace(0, 0.9, len(states))]
for i,c in enumerate(colorlist):
x = df.loc[df["State"].isin(['Michigan', 'Wisconsin'])].rgdp_change.values[i]
y = df.loc[df["State"].isin(['Michigan', 'Wisconsin'])].diff_unemp.values[i]
legend_label = states[i]
ax.scatter(x, y, label=legend_label, s=50, linewidth=0.1, c=c)
ax.legend()
plt.show()
Use the dataframe plot method, but first filter the sates you need using index isin method:
states = ["Michigan", "Wisconsin"]
df[df.index.isin(states)].plot(kind='scatter', x='rgdp_change', y='diff_unemp')

Create matplotlib subplots without manually counting number of subplots?

When doing ad-hoc analysis in Jupyter Notebook, I often want to view sequences of transformations to some Pandas DataFrame as vertically stacked subplots. My usual quick-and-dirty method is to not use subplots at all, but create a new figure for each plot:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame({"a": range(100)}) # Some arbitrary DataFrame
df.plot(title="0 to 100")
plt.show()
df = df * -1 # Some transformation
df.plot(title="0 to -100")
plt.show()
df = df * 2 # Some other transformation
df.plot(title="0 to -200")
plt.show()
This method has limitations. The x-axis ticks are unaligned even when identically indexed (because the x-axis width depends on y-axis labels) and the Jupyter cell output contains several separate inline images, not a single one that I can save or copy-and-paste.
As far as I know, the proper solution is to use plt.subplots():
fig, axes = plt.subplots(3, figsize=(20, 9))
df = pd.DataFrame({"a": range(100)}) # Arbitrary DataFrame
df.plot(ax=axes[0], title="0 to 100")
df = df * -1 # Some transformation
df.plot(ax=axes[1], title="0 to -100")
df = df * 2 # Some other transformation
df.plot(ax=axes[2], title="0 to -200")
plt.tight_layout()
plt.show()
This yields exactly the output I'd like. However, it also introduces an annoyance that makes me use the first method by default: I have to manually count the number of subplots I've created and update this count in several different places as the code changes.
In the multi-figure case, adding a fourth plot is as simple as calling df.plot() and plt.show() a fourth time. With subplots, the equivalent change requires updating the subplot count, plus arithmetic to resize the output figure, replacing plt.subplots(3, figsize=(20, 9)) with plt.subplots(4, figsize=(20, 12)). Each newly added subplot needs to know how many other subplots already exist (ax=axes[0], ax=axes[1], ax=axes[2], etc.), so any additions or removals require cascading changes to the plots below.
This seems like it should be trivial to automate — it's just counting and multiplication — but I'm finding it impossible to implement with the matplotlib/pyplot API. The closest I can get is the following partial solution, which is terse enough but still requires explicit counting:
n_subplots = 3 # Must still be updated manually as code changes
fig, axes = plt.subplots(n_subplots, figsize=(20, 3 * n_subplots))
i = 0 # Counts how many subplots have been added so far
df = pd.DataFrame({"a": range(100)}) # Arbitrary DataFrame
df.plot(ax=axes[i], title="0 to 100")
i += 1
df = df * -1 # Arbitrary transformation
df.plot(ax=axes[i], title="0 to -100")
i += 1
df = df * 2 # Arbitrary transformation
df.plot(ax=axes[i], title="0 to -200")
i += 1
plt.tight_layout()
plt.show()
The root problem is that any time df.plot() is called, there must exist an axes list of known size. I considered delaying the execution of df.plot() somehow, e.g. by appending to a list of lambda functions that can be counted before they're called in sequence, but this seems like an extreme amount of ceremony just to avoid updating an integer by hand.
Is there a more convenient way to do this? Specifically, is there a way to create a figure with an "expandable" number of subplots, suitable for ad-hoc/interactive contexts where the count is not known in advance?
(Note: This question may appear to be a duplicate of either this question or this one, but the accepted answers to both questions contain exactly the problem I'm trying to solve — that the nrows= parameter of plt.subplots() must be declared before adding subplots.)
First create an empty figure and then add subplots using add_subplot. Update the subplotspecs of the existing subplots in the figure using a new GridSpec for the new geometry (the figure keyword is only needed if you're using constrained layout instead of tight layout).
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
def append_axes(fig, as_cols=False):
"""Append new Axes to Figure."""
n = len(fig.axes) + 1
nrows, ncols = (1, n) if as_cols else (n, 1)
gs = mpl.gridspec.GridSpec(nrows, ncols, figure=fig)
for i,ax in enumerate(fig.axes):
ax.set_subplotspec(mpl.gridspec.SubplotSpec(gs, i))
return fig.add_subplot(nrows, ncols, n)
fig = plt.figure(layout='tight')
df = pd.DataFrame({"a": range(100)}) # Arbitrary DataFrame
df.plot(ax=append_axes(fig), title="0 to 100")
df = df * -1 # Some transformation
df.plot(ax=append_axes(fig), title="0 to -100")
df = df * 2 # Some other transformation
df.plot(ax=append_axes(fig), title="0 to -200")
Example for adding the new subplots as columns (and using constrained layout for a change):
fig = plt.figure(layout='constrained')
df = pd.DataFrame({"a": range(100)}) # Arbitrary DataFrame
df.plot(ax=append_axes(fig, True), title="0 to 100")
df = df + 10 # Some transformation
df.plot(ax=append_axes(fig, True), title="10 to 110")
IIUC you need some container for your transformations to achieve this - a list for example. Something like:
arbitrary_trx = [
lambda x: x, # No transformation
lambda x: x * -1, # Arbitrary transformation
lambda x: x * 2] # Arbitrary transformation
fig, axes = plt.subplots(nrows=len(arbitrary_trx))
for ax, f in zip(axes, arbitrary_trx):
df = df.apply(f)
df.plot(ax=ax)
You can create an object that stores the data and only creates the figure once you tell it to do so.
import pandas as pd
import matplotlib.pyplot as plt
class AxesStacker():
def __init__(self):
self.data = []
self.titles = []
def append(self, data, title=""):
self.data.append(data)
self.titles.append(title)
def create(self):
nrows = len(self.data)
self.fig, self.axs = plt.subplots(nrows=nrows)
for d, t, ax in zip(self.data, self.titles, self.axs.flat):
d.plot(ax=ax, title=t)
stacker = AxesStacker()
df = pd.DataFrame({"a": range(100)}) # Some arbitrary DataFrame
stacker.append(df, title="0 to 100")
df = df * -1 # Some transformation
stacker.append(df, title="0 to -100")
df = df * 2 # Some other transformation
stacker.append(df, title="0 to -200")
stacker.create()
plt.show()

Combine two dataframe boxplots in a twinx figure

I want to display two Pandas dataframes within one figure as boxplots.
As each of the two dataframes has different value range, I would like to have them combined in a twinx figure.
Reduced to the minimum, I have tried the following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(100,200,size=(100, 2)), columns=list('EF'))
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
df1.boxplot(ax=ax1)
df2.boxplot(ax=ax2)
plt.show()
The result is expectedly not what it should look like (there should be 6 boxes on the plot, actually!)
How can I manage to have the boxplots next to each other?
I tried to set some dummy scatter points on ax1 and ax2, but this did not really help.
The best solution is to concatenate the data frames for plotting and to use a mask. In the creation of the mask, we use the dfs == dfs | dfs.isnull() to create a full matrix with True and then we query on all column names that are not 'E' or 'F'. This gives a 2D matrix that allows you to only plot the first four boxes, as the last two two are masked (so their ticks do appear at the bottom). With the inverse mask ~mask you plot the last two on their own axis and mask the first four.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame(np.random.randint( 0,100,size=(100, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(100,200,size=(100, 2)), columns=list('EF' ))
dfs = pd.concat([df1, df2])
mask = ((dfs == dfs) | dfs.isnull()) & (dfs.columns != 'E') & (dfs.columns != 'F')
fig, ax1 = plt.subplots()
dfs[mask].boxplot()
ax2 = ax1.twinx()
dfs[~mask].boxplot()
plt.show()