Overlaying actual data on a boxplot from a pandas dataframe - matplotlib

I am using Seaborn to make boxplots from pandas dataframes. Seaborn boxplots seem to essentially read the dataframes the same way as the pandas boxplot functionality (so I hope the solution is the same for both -- but I can just use the dataframe.boxplot function as well). My dataframe has 12 columns and the following code generates a single plot with one boxplot for each column (just like the dataframe.boxplot() function would).
fig, ax = plt.subplots()
sns.set_style("darkgrid", {"axes.facecolor":"darkgrey"})
pal = sns.color_palette("husl",12)
sns.boxplot(dataframe, color = pal)
Can anyone suggest a simple way of overlaying all the values (by columns) while making a boxplot from dataframes?
I will appreciate any help with this.

This hasn't been added to the seaborn.boxplot function yet, but there's something similar in the seaborn.violinplot function, which has other advantages:
x = np.random.randn(30, 6)
sns.violinplot(x, inner="points")
sns.despine(trim=True)

A general solution for the boxplot for the entire dataframe, which should work for both seaborn and pandas as their are all matplotlib based under the hood, I will use pandas plot as the example, assuming import matplotlib.pyplot as plt already in place. As you have already have the ax, it would make better sense to just use ax.text(...) instead of plt.text(...).
In [35]:
print df
V1 V2 V3 V4 V5
0 0.895739 0.850580 0.307908 0.917853 0.047017
1 0.931968 0.284934 0.335696 0.153758 0.898149
2 0.405657 0.472525 0.958116 0.859716 0.067340
3 0.843003 0.224331 0.301219 0.000170 0.229840
4 0.634489 0.905062 0.857495 0.246697 0.983037
5 0.573692 0.951600 0.023633 0.292816 0.243963
[6 rows x 5 columns]
In [34]:
df.boxplot()
for x, y, s in zip(np.repeat(np.arange(df.shape[1])+1, df.shape[0]),
df.values.ravel(), df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
For a single series in the dataframe, a few small changes is necessary:
In [35]:
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
for x, y, s in zip(np.repeat(1, df.shape[0]),
sub_df.ravel(), sub_df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
Making scatter plots is also similar:
#for the whole thing
df.boxplot()
plt.scatter(np.repeat(np.arange(df.shape[1])+1, df.shape[0]), df.values.ravel(), marker='+', alpha=0.5)
#for just one column
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
plt.scatter(np.repeat(1, df.shape[0]), sub_df.ravel(), marker='+', alpha=0.5)
To overlay stuff on boxplot, we need to first guess where each boxes are plotted at among xaxis. They appears to be at 1,2,3,4,..... Therefore, for the values in the first column, we want them to be plot at x=1; the 2nd column at x=2 and so on.
Any efficient way of doing it is to use np.repeat, repeat 1,2,3,4..., each for n times, where n is the number of observations. Then we can make a plot, using those numbers as x coordinates. Since it is one-dimensional, for the y coordinates, we will need a flatten view of the data, provided by df.ravel()
For overlaying the text strings, we need a anther step (a loop). As we can only plot one x value, one y value and one text string at a time.

I have the following trick:
data = np.random.randn(6,5)
df = pd.DataFrame(data,columns = list('ABCDE'))
Now assign a dummy column to df:
df['Group'] = 'A'
print df
A B C D E Group
0 0.590600 0.226287 1.552091 -1.722084 0.459262 A
1 0.369391 -0.037151 0.136172 -0.772484 1.143328 A
2 1.147314 -0.883715 -0.444182 -1.294227 1.503786 A
3 -0.721351 0.358747 0.323395 0.165267 -1.412939 A
4 -1.757362 -0.271141 0.881554 1.229962 2.526487 A
5 -0.006882 1.503691 0.587047 0.142334 0.516781 A
Use the df.groupby.boxplot(), you get it done.
df.groupby('Group').boxplot()

Related

Draw bar-charts with value_counts() for multiple columns in a Pandas DataFrame

I'm trying to draw bar-charts with counts of unique values for all columns in a Pandas DataFrame. Kind of what df.hist() does for numerical columns, but I have categorical columns.
I'd prefer to use the object-oriented approach, because if feels more natural and explicit to me.
I'd like to have multiple Axes (subplots) within a single Figure, in a grid fashion (again like what df.hist() does).
My solution below does exactly what I want, but it feels cumbersome. I doubt whether I really need the direct dependency on Matplotlib (and all the code for creating the Figure, removing the unused Axes etc.). I see that pandas.Series.plot has parameters subplots and layout which seem to point to what I want, but maybe I'm totally off here. I tried looping over the columns in my DataFrame and apply these parameters, but I cannot figure it out.
Does anyone know a more compact way to do what I'm trying to achieve?
# Defining the grid-dimensions of the Axes in the Matplotlib Figure
nr_of_plots = len(ames_train_categorical.columns)
nr_of_plots_per_row = 4
nr_of_rows = math.ceil(nr_of_plots / nr_of_plots_per_row)
# Defining the Matplotlib Figure and Axes
figure, axes = plt.subplots(nrows=nr_of_rows, ncols=nr_of_plots_per_row, figsize=(25, 50))
figure.subplots_adjust(hspace=0.5)
# Plotting on the Axes
i, j = 0, 0
for column_name in ames_train_categorical:
if ames_train_categorical[column_name].nunique() <= 30:
axes[i][j].set_title(column_name)
ames_train_categorical[column_name].value_counts().plot(kind='bar', ax=axes[i][j])
j += 1
if j % nr_of_plots_per_row == 0:
i += 1
j = 0
# Cleaning up unused Axes
# plt.subplots creates a square grid of Axes. On the last row, not all Axes will always be used. Unused Axes are removed here.
axes_flattened = axes.flatten()
for ax in axes_flattened:
if not ax.has_data():
ax.remove()
Edit: alternative idea
Using the pyplot/state-machine WoW, you could do it like this with very limited lines of code. But this also has the downside that every graph gets it's own figure, you they're not nicely arranged in a grid.
for column_name in ames_train_categorical:
ames_train_categorical[column_name].value_counts().plot(kind='bar')
plt.show()
Desired output
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"MS Zoning": ["RL", "FV", "RL", "RH", "RL", "RL"],
"Street": ["Pave", "Pave", "Pave", "Grvl", "Pave", "Pave"],
"Alley": ["Grvl", "Grvl", "Grvl", "Grvl", "Pave", "Pave"],
"Utilities": ["AllPub", "NoSewr", "AllPub", "AllPub", "NoSewr", "AllPub"],
"Land Slope": ["Gtl", "Mod", "Sev", "Mod", "Sev", "Sev"],
}
)
Here is a bit more idiomatic way to do it:
import math
from matplotlib import pyplot as plt
size = math.ceil(df.shape[1]** (1/2))
fig = plt.figure()
for i, col in enumerate(df.columns):
fig.add_subplot(size, size, i + 1)
df[col].value_counts().plot(kind="bar", ax=plt.gca(), title=col, rot=0)
fig.tight_layout()

How to plot coordinates from single pandas series

I have a pandas series called df1['geometry.coordinates'] of coordinate values in the following format:
geometry.coordinates
0 [150.792711, -34.210868]
1 [151.551228, -33.023339]
2 [148.92149870748742, -34.767207772932835]
3 [151.033742, -33.919998]
4 [150.953963043732, -32.3935017885229]
... ...
432 [114.8927165, -28.902492300000002]
433 [115.34601918477634, -30.041742290803096]
434 [115.4632611, -30.8581035]
435 [121.42151909999998, -30.7804027]
436 [115.69424934340425, -30.680970908597665]
I want to plot each point on a graph, probably through using a scatter plot.
I tried: df1['geometry.coordinates'].plot.scatter() but it gets confused because it only reads it as one list value rather than two and therefore I always get the following error:
TypeError: scatter() missing 2 required positional arguments: 'x' and 'y'
Anyone know how I can solve this?
You need to separate the column containing the list so that you can specify x and y in the plot call.
You can split a column containing a list by constructing a data frame from a list.
pd.DataFrame(df2["geometry.coordinates"].to_list(), columns=['x', 'y']).plot.scatter(x=“x”, y=“y”)
Step 1: Split array into multiple columns
df1[['x','y']] = pd.DataFrame(df1['geometry.coordinates'].tolist(), index= df1.index)
Step 2: Plot
df1.plot.scatter(x = 'x', y = 'y', s = 30) #s is size of dots
You are not giving the parameters to scatter(), so the error is quite logical. Something among the lines of df.scatter.plot(df[0],df[1]) should work.
Also, as you are working working with column vectors, you need to transpose your data for it to be viewed as rows: df.scatter.plot(df.T[0],df.T[1])
I did it this way.
import matplotlib.pyplot as plt
geometry = pd.Series([
[150.792711, -34.210868],
[151.551228, -33.023339],
[148.92149870748742, -34.767207772932835],
[151.033742, -33.919998],
[150.953963043732, -32.3935017885229]])
df = pd.DataFrame(geometry.to_list(), columns = ['x','y'])
plt.scatter(x = df['x'], y = df['y'],
edgecolor ='black')
plt.grid(alpha=.15)
you can try
import pandas as pd
geometry_coordinates=[[150.792711, -34.210868],
[151.551228, -33.023339],
[148.92149870748742, -34.767207772932835],
[151.033742, -33.919998],
[150.953963043732, -32.3935017885229],
[114.8927165, -28.902492300000002],
[115.34601918477634, -30.041742290803096],
[115.4632611, -30.8581035],
[121.42151909999998, -30.7804027],
[115.69424934340425, -30.680970908597665]]
geometry_coordinates=pd.DataFrame(geometry_coordinates,columns=['lat','long'])
geometry_coordinates.plot.scatter(x='lat',y='long')

Making multiple pie charts out of a pandas dataframe (one for each column)

My question is similar to Making multiple pie charts out of a pandas dataframe (one for each row).
However, instead of each row, I am looking for each column in my case.
I can make pie chart for each column, however, as I have 12 columns the pie charts are too much close to each other.
I have used this code:
fig, axes = plt.subplots(4, 3, figsize=(10, 6))
for i, (idx, row) in enumerate(df.iterrows()):
ax = axes[i // 3, i % 3]
row = row[row.gt(row.sum() * .01)]
ax.pie(row, labels=row.index, startangle=30)
ax.set_title(idx)
fig.subplots_adjust(wspace=.2)
and I have the following result
But I want is on the other side. I need to have 12 pie charts (becuase I have 12 columns) and each pie chart should have 4 sections (which are leg, car, walk, and bike)
and if I write this code
fig, axes = plt.subplots(4,3)
for i, col in enumerate(df.columns):
ax = axes[i // 3, i % 3]
plt.plot(df[col])
then I have the following results:
and if I use :
plot = df.plot.pie(subplots=True, figsize=(17, 8),labels=['pt','car','walk','bike'])
then I have the following results:
Which is quite what I am looking for. but it is not possible to read the pie charts. if it can produce in more clear output, then it is better.
As in your linked post I would use matplotlib.pyplot for this. The accepted answer uses plt.subplots(2, 3) and I would suggest doing the same for creating two rows with each 3 plots in them.
Like this:
fig, axes = plt.subplots(2,3)
for i, col in enumerate(df.columns):
ax = axes[i // 3, i % 3]
ax.plot(df[col])
Finally, I understood that if I swap rows and columns
df_sw = df.T
Then I can use the code in the examples:
Making multiple pie charts out of a pandas dataframe (one for each row)

Combining Pandas Subplots into a Single Figure

I'm having trouble understanding Pandas subplots - and how to create axes so that all subplots are shown (not over-written by subsequent plot).
For each "Site", I want to make a time-series plot of all columns in the dataframe.
The "Sites" here are 'shark' and 'unicorn', both with 2 variables. The output should be be 4 plotted lines - the time-indexed plot for Var 1 and Var2 at each site.
Make Time-Indexed Data with Nans:
df = pd.DataFrame({
# some ways to create random data
'Var1':pd.np.random.randn(100),
'Var2':pd.np.random.randn(100),
'Site':pd.np.random.choice( ['unicorn','shark'], 100),
# a date range and set of random dates
'Date':pd.date_range('1/1/2011', periods=100, freq='D'),
# 'f':pd.np.random.choice( pd.date_range('1/1/2011', periods=365,
# freq='D'), 100, replace=False)
})
df.set_index('Date', inplace=True)
df['Var2']=df.Var2.cumsum()
df.loc['2011-01-31' :'2011-04-01', 'Var1']=pd.np.nan
Make a figure with a sub-plot for each site:
fig, ax = plt.subplots(len(df.Site.unique()), 1)
counter=0
for site in df.Site.unique():
print(site)
sitedat=df[df.Site==site]
sitedat.plot(subplots=True, ax=ax[counter], sharex=True)
ax[0].title=site #Set title of the plot to the name of the site
counter=counter+1
plt.show()
However, this is not working as written. The second sub-plot ends up overwriting the first. In my actual use case, I have 14 variable number of sites in each dataframe, as well as a variable number of 'Var1, 2, ...'. Thus, I need a solution that does not require creating each axis (ax0, ax1, ...) by hand.
As a bonus, I would love a title of each 'site' above that set of plots.
The current code over-writes the first 'Site' plot with the second. What I missing with the axes here?!
When you are using DataFrame.plot(..., subplot=True) you need to provide the correct number of axes that will be used for each column (and with the right geometry, if using layout=). In your example, you have 2 columns, so plot() needs two axes, but you are only passing one in ax=, therefore pandas has no choice but to delete all the axes and create the appropriate number of axes itself.
Therefore, you need to pass an array of axes of length corresponding to the number of columns you have in your dataframe.
# the grouper function is from itertools' cookbook
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
fig, axs = plt.subplots(len(df.Site.unique())*(len(df.columns)-1),1, sharex=True)
for (site,sitedat),axList in zip(df.groupby('Site'),grouper(axs,len(df.columns)-1)):
sitedat.plot(subplots=True, ax=axList)
axList[0].set_title(site)
plt.tight_layout()

colormap with pandas dataframe plot function

I have data from multiple sites that record a sharp change in the monitored parameter. How could I plot the data for all these sites using value-dependent colors to enhance the visualization?
import numpy as np
import pandas as pd
import string
# site names
cols = string.ascii_uppercase
# number of days
ndays = 3
# index
index = pd.date_range('2018-05-01', periods=3*24*60, freq='T')
# simulated daily data
d1 = np.random.randn(len(index)//ndays, len(cols))
d2 = np.random.randn(len(index)//ndays, len(cols))+2
d3 = np.random.randn(len(index)//ndays, len(cols))-2
data=np.concatenate([d1, d2, d3])
# df = pd.DataFrame(data=data, index=index, columns=list(cols))
df.plot(legend=False)
Each site (column) gets assigned one color in the above code. Is there a way to represent the parameter values to different colors?
I guess one alternative is using colormaps option from scatter plot function: How to use colormaps to color plots of Pandas DataFrames
ax = plt.subplots(figsize=(12,6))
collection = [plt.scatter(range(len(df)), df[col], c=df[col], s=25, cmap=cmap, edgecolor='None') for col in df.columns]
However, if I plot over time (i.e., x=df.index) things appear not to work as expected.
Is there any other alternative? or suggestion how to better visualize the sudden change in the time series?
In what follows I will use only 3 columns and hourly data in order to make the plots look less messy. The examples work as well with the original data.
cols = string.ascii_uppercase[:3]
ndays = 3
index = pd.date_range('2018-05-01', periods=3*24, freq='H')
# simulated daily data
d1 = np.random.randn(len(index)//ndays, len(cols))
d2 = np.random.randn(len(index)//ndays, len(cols))+2
d3 = np.random.randn(len(index)//ndays, len(cols))-2
data=np.concatenate([d1, d2, d3])
df = pd.DataFrame(data=data, index=index, columns=list(cols))
df.plot(legend=False)
The pandas way
You are out of luck,DataFrame.plot.scatter does not work with datetime-like data due to a long standing bug.
The matplotlib way
Matplotlib's scatter can handle datetime-like data but the x-axis does not scale as expected.
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])
plt.gcf().autofmt_xdate()
This looks like a bug to me but I could not find any reports. You can work around this by manually adjusting the x-limits.
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])
start, end = df.index[[0, -1]]
xmargin = (end - start) * plt.gca().margins()[0]
plt.xlim(start - xmargin, end + xmargin)
plt.gcf().autofmt_xdate()
Unfortunately the x-axis formatter is not as nice as the pandas one.
The pandas way, revisited
I discovered this trick by chance and I do not understand why it works. If you plot a pandas series indexed by the same datetime data before calling matplotlib's scatter, the autoscaling issue disappear and you get the nice pandas formatting.
So I made an invisible plot of the first column and then the scatter plot.
df.iloc[:, 0].plot(lw=0) # invisible plot
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])