Control the number of rows within a legend - matplotlib

I am currently trying to plot a large amount of data on a single plot. I have structured my representation using repeated colors and symbols. However, when plotting the final results, the legend appears slightly off because I cannot control the number of rows within it. Thus, instead of getting 5 repeated green, then 5 repeated red, 5 repeated blue then 2 other, I get 5 -4 -4 -4 (where I would have prefered 5 - 5 - 5 - 2)
You can clearly see this in attached image.
Right now I use these options for the legend:
axp.legend(loc="lower right",ncol=4)

I also had this problem a couple of times and use this workaround by adding dummy items to the legend to fill the last column, if there are more elegant methods available I would also be very interested to hear about them.
import numpy as np
import matplotlib.pylab as pl
pl.figure()
pl.plot(np.arange(10), np.random.random([10,5]), color='r', label='red')
pl.plot(np.arange(10), np.random.random([10,5]), color='g', label='green')
pl.plot(np.arange(10), np.random.random([10,5]), color='b', label='blue')
pl.plot(np.arange(10), np.random.random([10,2]), color='k', label='black')
# Add empty dummy legend items
pl.plot(np.zeros(1), np.zeros([1,3]), color='w', alpha=0, label=' ')
pl.legend(ncol=4)

Related

Matplotlib/Seaborn: Boxplot collapses on x axis

I am creating a series of boxplots in order to compare different cancer types with each other (based on 5 categories). For plotting I use seaborn/matplotlib. It works fine for most of the cancer types (see image right) however in some the x axis collapses slightly (see image left) or strongly (see image middle)
https://i.imgur.com/dxLR4B4.png
Looking into the code how seaborn plots a box/violin plot https://github.com/mwaskom/seaborn/blob/36964d7ffba3683de2117d25f224f8ebef015298/seaborn/categorical.py (line 961)
violin_data = remove_na(group_data[hue_mask])
I realized that this happens when there are too many nans
Is there any possibility to prevent this collapsing by code only
I do not want to modify my dataframe (replace the nans by zero)
Below you find my code:
boxp_df=pd.read_csv(pf_in,sep="\t",skip_blank_lines=False)
fig, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(data=boxp_df, ax=ax)
plt.xticks(rotation=-45)
plt.ylabel("label")
plt.tight_layout()
plt.savefig(pf_out)
The output is a per cancer type differently sized plot
(depending on if there is any category completely nan)
I am expecting each plot to be in the same width.
Update
trying to use the order parameter as suggested leads to the following output:
https://i.imgur.com/uSm13Qw.png
Maybe this toy example helps ?
|Cat1|Cat2|Cat3|Cat4|Cat5
|3.93| |0.52| |6.01
|3.34| |0.89| |2.89
|3.39| |1.96| |4.63
|1.59| |3.66| |3.75
|2.73| |0.39| |2.87
|0.08| |1.25| |-0.27
Update
Apparently, the problem is not the data but the length of the title
https://github.com/matplotlib/matplotlib/issues/4413
Therefore I would close the question
#Diziet should I delete it or does my issue might help other ones?
Sorry for not including the line below in the code example:
ax.set_title("VERY LONG TITLE", fontsize=20)
It's hard to be sure without data to test it with, but I think you can pass the names of your categories/cancers to the order= parameter. This forces seaborn to use/display those, even if they are empty.
for instance:
tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips, order=['Thur','Fri','Sat','Freedom Day','Sun','Durin\'s Day'])

Plotting large datasets as kind=bar ineffective

I am working with a semi-large data set of approx 100,000 records. When I plot a df column as a line with the code below the plot takes approx 2 seconds.
with plt.style.context('ggplot'):
plt.figure(3,figsize=(16,12))
plt.subplot(411)
df_pca_std['PC1_resid'].plot(title ="PC1 Residual", color='r')
#If I change the plot to a bar (no other change)
df_X_std['PC1_resid'].plot(**kind='bar'**, title ="PC1 Residual", color='r')
it takes 112 seconds and the render changes like this (jumbled x axis):
I have suppressed the axis and changed the style but neither helped. Anyone have ideas how to better render and take less time? The data being plotted is being checked for mean reversion and is better displayed as bar plot.
Not the best charts visually but at least it renders. Plotted 2.1 million bars in 14.2 secs.
import pygal
bar_chart = pygal.Bar()
bar_chart.add('PC1_residuals',df_X_std['PC1_resid'])
bar_chart.render_to_file('bar_chart.svg')
One possible solution: I do not actually need to plot bars but can use the very fast line plot and the 'fill_between' attribute to color the plot from zero to the line. The effect is similar to plotting all the bars in a fraction of the time.
Use pydatetime method of DatetimeIndex to convert Date (the df index) to an array of datetime.datetime's that can be used by matplotlib then change the plot.
plotDates = mpl.date2num(df.index.to_pydatetime())
plt.fill_between(plotDates,0,df_pca_std['PC1_resid'], alpha=0.5)

Efficiently Plotting Many Lines in VisPy

From all example code/demos I have seen in the VisPy library, I only see one way that people plot many lines, for example:
for i in range(N):
pos = pos.copy()
pos[:, 1] = np.random.normal(scale=5, loc=(i+1)*30, size=N)
line = scene.visuals.Line(pos=pos, color=color, parent=canvas.scene)
lines.append(line)
canvas.show()
My issue is that I have many lines to plot (each several hundred thousand points). Matplotlib proved too slow because of the total number of points plotted was in the millions, hence I switched to VisPy. But VisPy is even slower when you plot thousands of lines each with thousands of points (the speed-up comes when you have millions of points).
The root cause is in the way lines are drawn. When you create a plot widget and then plot a line, each line is rendered to the canvas. In matplotlib you can explicitly state to not show the canvas until all lines are drawn in memory, but there doesn't appear to be the same functionality in VisPy, making it useless.
Is there any way around this? I need to plot multiple lines so that I can change properties interactively, so flattening all the data points into one plot call won't work.
(I am using a PyQt4 to embed the plot in a GUI. I have also considered pyqtgraph.)
You should pass an array to the "connect" parameter of the Line() function.
xy = np.random.rand(5,2) # 2D positions
# Create an array of point connections :
toconnect = np.array([[0,1], [0,2], [1,4], [2,3], [2,4]])
# Point 0 in your xy will be connected with 1 and 2, point
# 1 with 4 and point 2 with 3 and 4.
line = scene.visuals.Line(pos=xy, connect=toconnect)
You only add one object to your canvas but the control pear line is more limited.

matplotlib: preventing a few very large (or small) values to affect my contour

in plotting the data some times there are a few very large (or very small) numbers which, if not taken care of, will affect the contour in a bad way. a solution is to take out the 10% highest and lowest data out of the contour color grading and considering them as less than and more than. the following figure shows the idea:
the two arrow shapes on the top and the bottom of the bar support this idea. any value above 14 will be shown in white and any value below -2 will be shown in black color. how is it possible in matplotlib?
How can I define:
- to put the 5% of highest values and 5% of lowest values in two categories shown in the triangular parts in both ends of the bar? (Should I define it the contour operation or are there other ways?)
- what if I want to give certain values instead of the percentage? for instance, ask to put any value above 14 on the white triangule and any value below -2 as black areas?
Thank you so much for your help.
Taken from http://matplotlib.org/examples/api/colorbar_only.html. You can play with it and you will see if it could solve your problem.
import matplotlib.pyplot as plt
from matplotlib import mpl
import numpy as np
x = np.linspace(-1,1,100)
X,Y = np.meshgrid(x,x)
Z = np.exp(-X**2-Y**2)
vmin = 0.3 #Lower value
vmax = 0.9 #Upper value
bounds = np.linspace(vmin,vmax,4)
cmap = mpl.colors.ListedColormap([(0,0,0),(0.5,0.5,0.5),(0,1,0),(1,1,1)])
norm = mpl.colors.BoundaryNorm(bounds, cmap.N)
plt.imshow(Z,cmap=cmap,interpolation='nearest',vmin=vmin,vmax=vmax)
ax = plt.colorbar().ax
cb = mpl.colorbar.ColorbarBase(ax, norm=norm,
extend='both',
cmap=cmap)
cmap.set_over([0,0,1])
cmap.set_under([1,0,0])
plt.show()

geometry of colorbars in matplotlib

Plotting a figure with a colorbar, like for example the ellipse collection of the matplotlib gallery, I'm trying to understand the geometry of the figure. If I add the following code in the source code (instead of plt.show()):
cc=plt.gcf().get_children()
print(cc[1].get_geometry())
print(cc[2].get_geometry())
I get
(1, 2, 1)
(3, 1, 2)
I understand the first one - 1 row, two columns, plot first (and presumably the second is the colorbar), but I don't understand the second one, which I would expect to be (1,2,2). What do these values correspond to?
Edit: It seems that the elements in cc do not have the same axes,which would explain the discrepancies. Somehow, I'm still confused with the geometries that are reported.
What's happening is when you call colorbar, use_gridspec defaults to True which then makes a call to matplotlib.colorbar.make_axes_gridspec which then creates a 1 by 2 grid to hold the plot and cbar axes then then cbar axis itself is actually a 3 by 1 grid that has its aspect ratio adjusted
the key line in matplotlib.colorbar.make_axes_gridspec which makes this happen is
gs2 = gs_from_sp_spec(3, 1, subplot_spec=gs[1], hspace=0.,
height_ratios=wh_ratios)
because wh_ratios == [0.0, 1.0, 0.0] by default so the other two subplots above and below are 0 times the size of the middle plot.
I've put what I did to figure this out into an IPython notebook