Plotting box plot with 3 features - pandas

For example, I have a data frame like this
x1
x2
class
0.1
0.2
1
0.3
0.4
2
...
...
...
How can I use boxplot to create a chart like this
To achieve a scatter plot, I seperate the dataframe into 2, based on class and plot them separately,on the same plot. But how to achieve something like the image above with boxplot?

A box plot uses boxes and lines to depict the distributions of one or more groups of numeric data. This means the one axis must be categorical and the other numerical.
Refer to this link for more info: https://chartio.com/learn/charts/box-plot-complete-guide/
If still, you want to achieve this then you can bin one column.
Refer This: https://pandas.pydata.org/docs/reference/api/pandas.cut.html

Related

Matplotlib 3D Scatter Plot

I would like to ask a question regarding Matplotlib 3D Scatter Plots
I have a data frame, consists of 3 columns and 55 rows.
I formatted it to a numpy array with
dataframe.to_numpy()
I have also some subset of my dataframe which should be coloured differently on the plot.
For example:
subset1
subset2
I would like plot my data frame with 3D Scatter Plots while coloring the points of different subsets differently.
I have tried bunch of methods but always getting error because of the shape of the subsets. Is there any more efficient way to do it?
I would appreciate for your suggestions.

Plotting large datasets as kind=bar ineffective

I am working with a semi-large data set of approx 100,000 records. When I plot a df column as a line with the code below the plot takes approx 2 seconds.
with plt.style.context('ggplot'):
plt.figure(3,figsize=(16,12))
plt.subplot(411)
df_pca_std['PC1_resid'].plot(title ="PC1 Residual", color='r')
#If I change the plot to a bar (no other change)
df_X_std['PC1_resid'].plot(**kind='bar'**, title ="PC1 Residual", color='r')
it takes 112 seconds and the render changes like this (jumbled x axis):
I have suppressed the axis and changed the style but neither helped. Anyone have ideas how to better render and take less time? The data being plotted is being checked for mean reversion and is better displayed as bar plot.
Not the best charts visually but at least it renders. Plotted 2.1 million bars in 14.2 secs.
import pygal
bar_chart = pygal.Bar()
bar_chart.add('PC1_residuals',df_X_std['PC1_resid'])
bar_chart.render_to_file('bar_chart.svg')
One possible solution: I do not actually need to plot bars but can use the very fast line plot and the 'fill_between' attribute to color the plot from zero to the line. The effect is similar to plotting all the bars in a fraction of the time.
Use pydatetime method of DatetimeIndex to convert Date (the df index) to an array of datetime.datetime's that can be used by matplotlib then change the plot.
plotDates = mpl.date2num(df.index.to_pydatetime())
plt.fill_between(plotDates,0,df_pca_std['PC1_resid'], alpha=0.5)

colorbars for grid of line (not contour) plots in matplotlib

I'm having trouble giving colorbars to a grid of line plots in Matplotlib.
I have a grid of plots, which each shows 64 lines. The lines depict the penalty value vs time when optimizing the same system under 64 different values of a certain hyperparameter h.
Since there are so many lines, instead of using a standard legend, I'd like to use a colorbar, and color the lines by the value of h. In other words, I'd like something that looks like this:
The above was done by adding a new axis to hold the colorbar, by calling figure.add_axes([0.95, 0.2, 0.02, 0.6]), passing in the axis position explicitly as parameters to that method. The colorbar was then created as in the example code here, by instantiating a ColorbarBase(). That's fine for single plots, but I'd like to make a grid of plots like the one above.
To do this, I tried doubling the number of subplots, and using every other subplot axis for the colorbar. Unfortunately, this led to the colorbars having the same size/shape as the plots:
Is there a way to shrink just the colorbar subplots in a grid of subplots like the 1x2 grid above?
Ideally, it'd be great if the colorbar just shared the same axis as the line plot it describes. I saw that the colorbar.colorbar() function has an ax parameter:
ax
parent axes object from which space for a new colorbar axes will be stolen.
That sounds great, except that colorbar.colorbar() requires you to pass in a imshow image, or a ContourSet, but my plot is neither an image nor a contour plot. Can I achieve the same (axis-sharing) effect using ColorbarBase?
It turns out you can have different-shaped subplots, so long as all the plots in a given row have the same height, and all the plots in a given column have the same width.
You can do this using gridspec.GridSpec, as described in this answer.
So I set the columns with line plots to be 20x wider than the columns with color bars. The code looks like:
grid_spec = gridspec.GridSpec(num_rows,
num_columns * 2,
width_ratios=[20, 1] * num_columns)
colormap_type = cm.cool
for (x_vec_list,
y_vec_list,
color_hyperparam_vec,
plot_index) in izip(x_vec_lists,
y_vec_lists,
color_hyperparam_vecs,
range(len(x_vecs))):
line_axis = plt.subplot(grid_spec[grid_index * 2])
colorbar_axis = plt.subplot(grid_spec[grid_index * 2 + 1])
colormap_normalizer = mpl.colors.Normalize(vmin=color_hyperparam_vec.min(),
vmax=color_hyperparam_vec.max())
scalar_to_color_map = mpl.cm.ScalarMappable(norm=colormap_normalizer,
cmap=colormap_type)
colorbar.ColorbarBase(colorbar_axis,
cmap=colormap_type,
norm=colormap_normalizer)
for (line_index,
x_vec,
y_vec) in zip(range(len(x_vec_list)),
x_vec_list,
y_vec_list):
hyperparam = color_hyperparam_vec[line_index]
line_color = scalar_to_color_map.to_rgba(hyperparam)
line_axis.plot(x_vec, y_vec, color=line_color, alpha=0.5)
For num_rows=1 and num_columns=1, this looks like:

matplotlib precision for a specific column

I am plotting a scatter plot with the following part:
cm = plt.cm.get_cmap('RdYlBu')
a = axes([.65, .6, .2, .2], axisbg='none')
sc=plt.scatter(x, y, c=z,s=500, marker='s',cmap=cm)
plt.colorbar(sc)
title('Density')
The z data sometimes happens to be very close to be homogeneous i.e 1.99,2.01,2.03,1.98 etc. This lead to a colorplot with different z value. How can I set them all to 2 e.g in a way that colormap will be flat?
here is the same issue that I had with gnuplot and I got this hint which solved the issue:
set gnuplot precision for a specific column
Thanks!

Matplotlib / Pandas histogram incorrect alignment

# A histogram
n = np.random.randn(100000)
fig, axes = plt.subplots(1, 2, figsize=(12,4))
axes[0].hist(n)
axes[0].set_title("Default histogram")
axes[0].set_xlim((min(n), max(n)))
axes[1].hist(n, cumulative=True, bins=50)
axes[1].set_title("Cumulative detailed histogram")
axes[1].set_xlim((min(n), max(n)));
This is from an ipython notebook here In[41]
It seems that the histogram bars don't correctly align with the grids (see first subplot). That is the same problem I face in my own plots.
Can someone explain why?
Look for the align option in matplotlib hist. You can align left, right, or center. By default your bins will not be centered which is why you see left aligned bins. This is spelled out in the matplotlib hist docs: http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist
What if you have a gaussian that spread from -2647 to +1324 do yo expect to have 3971 bins ? maybe too much. 39 ? then you are off by 0.71. what about 40 ? Off by 0.29.
The way histogram works is you can set the bins= parameter (number of bins, default 10). On the right graph, the scale seem to go from around -4.5 to +4.5 which make a span of 9 divided by 10 bins that gives 0.9/bin.
Also when you do histogram, it is not obvious "how" you want to bin things and represent it.
if you have a bin from 0 to 1, is it 0 < x <= 1, 0 <= x < 1 ? if you have only integer values, I suspect you would also prefer bins to be centered around integer values ? right ?
So histogram is a quick method that give you insight in the data, but does not prevent you from setting its parameters to represent the data the way yo like.
This blog post has nice demo of affect of parameter in histogram plotting and explain some alternate methods of plotting.