Plotting large datasets as kind=bar ineffective

Plotting large datasets as kind=bar ineffective - pandas

I am working with a semi-large data set of approx 100,000 records. When I plot a df column as a line with the code below the plot takes approx 2 seconds.
with plt.style.context('ggplot'):
plt.figure(3,figsize=(16,12))
plt.subplot(411)
df_pca_std['PC1_resid'].plot(title ="PC1 Residual", color='r')
#If I change the plot to a bar (no other change)
df_X_std['PC1_resid'].plot(**kind='bar'**, title ="PC1 Residual", color='r')
it takes 112 seconds and the render changes like this (jumbled x axis):
I have suppressed the axis and changed the style but neither helped. Anyone have ideas how to better render and take less time? The data being plotted is being checked for mean reversion and is better displayed as bar plot.

Not the best charts visually but at least it renders. Plotted 2.1 million bars in 14.2 secs.
import pygal
bar_chart = pygal.Bar()
bar_chart.add('PC1_residuals',df_X_std['PC1_resid'])
bar_chart.render_to_file('bar_chart.svg')

One possible solution: I do not actually need to plot bars but can use the very fast line plot and the 'fill_between' attribute to color the plot from zero to the line. The effect is similar to plotting all the bars in a fraction of the time.
Use pydatetime method of DatetimeIndex to convert Date (the df index) to an array of datetime.datetime's that can be used by matplotlib then change the plot.
plotDates = mpl.date2num(df.index.to_pydatetime())
plt.fill_between(plotDates,0,df_pca_std['PC1_resid'], alpha=0.5)

Related

Matplotlib/Seaborn: Boxplot collapses on x axis

I am creating a series of boxplots in order to compare different cancer types with each other (based on 5 categories). For plotting I use seaborn/matplotlib. It works fine for most of the cancer types (see image right) however in some the x axis collapses slightly (see image left) or strongly (see image middle)
https://i.imgur.com/dxLR4B4.png
Looking into the code how seaborn plots a box/violin plot https://github.com/mwaskom/seaborn/blob/36964d7ffba3683de2117d25f224f8ebef015298/seaborn/categorical.py (line 961)
violin_data = remove_na(group_data[hue_mask])
I realized that this happens when there are too many nans
Is there any possibility to prevent this collapsing by code only
I do not want to modify my dataframe (replace the nans by zero)
Below you find my code:
boxp_df=pd.read_csv(pf_in,sep="\t",skip_blank_lines=False)
fig, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(data=boxp_df, ax=ax)
plt.xticks(rotation=-45)
plt.ylabel("label")
plt.tight_layout()
plt.savefig(pf_out)
The output is a per cancer type differently sized plot
(depending on if there is any category completely nan)
I am expecting each plot to be in the same width.
Update
trying to use the order parameter as suggested leads to the following output:
https://i.imgur.com/uSm13Qw.png
Maybe this toy example helps ?
|Cat1|Cat2|Cat3|Cat4|Cat5
|3.93| |0.52| |6.01
|3.34| |0.89| |2.89
|3.39| |1.96| |4.63
|1.59| |3.66| |3.75
|2.73| |0.39| |2.87
|0.08| |1.25| |-0.27
Update
Apparently, the problem is not the data but the length of the title
https://github.com/matplotlib/matplotlib/issues/4413
Therefore I would close the question
#Diziet should I delete it or does my issue might help other ones?
Sorry for not including the line below in the code example:
ax.set_title("VERY LONG TITLE", fontsize=20)

It's hard to be sure without data to test it with, but I think you can pass the names of your categories/cancers to the order= parameter. This forces seaborn to use/display those, even if they are empty.
for instance:
tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips, order=['Thur','Fri','Sat','Freedom Day','Sun','Durin\'s Day'])

Heatmap colorbars accumulating in Matplotlib/Seaborn figures

I have a list of data frames, and I want to make heatmaps of every data frame in the list. The first heatmap comes out perfectly, but the second one has two colorbars, one much larger than the other, which distorts the figure. The third has THREE colorbars, the last one being even larger, and this continues for as many heatmaps as I make.
This seems like a bug to me, as I have no idea why it's happening. Each heatmap should be stored as a separate element in the list of heatmaps, and even if I plot them individually, instead of using a loop or list comprehension, I get the same problem.
Here is my code:
# Set the seaborn font size.
sns.set(font_scale=0.5)
# Ensure that labels are not cut off.
plt.gcf().subplots_adjust(bottom=0.18)
plt.gcf().subplots_adjust(right=.3)
black_yellow = sns.dark_palette("yellow",10)
heatmap_list = [sns.heatmap(df, cmap=black_yellow, xticklabels=True, yticklabels=True) for df in df_list]
[heatmap_list[x].figure.savefig(file_names_list[x]+'.pdf', format='pdf') for x in range(0,len(heatmap_list))]

sns.heatmap() creates a problem while we are working in loop. To resolve this issue, the first iteration will be done individually and rest of the loop remains the same but we will add a parameter cbar=False to stop this recursion of colorbar in the loop portion.
# Set the seaborn font size.
sns.set(font_scale=0.5)
# Ensure that labels are not cut off.
plt.gcf().subplots_adjust(bottom=0.18)
plt.gcf().subplots_adjust(right=.3)
black_yellow = sns.dark_palette("yellow", 10)
hm = sns.heatmap(df_list[0], cmap=black_yellow, xticklabels=True, yticklabels=True)
hm.figure.savefig(file_names_list[0]+'.pdf', format='pdf')
heatmap_list = [sns.heatmap(df_list[i], cmap=black_yellow, xticklabels=True, yticklabels=True, cbar=False) for i in range(1, len(df_list))]
[heatmap_list[x].figure.savefig(file_names_list[x+1]+'.pdf', format='pdf') for x in range(0, len(heatmap_list))]

Scatter plot without x-axis

I am trying to visualize some data and have built a scatter plot with this code -
sns.regplot(y="Calls", x="clientid", data=Drop)
This is the output -
I don't want it to consider the x-axis. I just want to see how the data lie w.r.t y-axis. Is there a way to do that?

As #iayork suggested, you can see the distribution of your points with a striplot or a swarmplot (you could also combine them with a violinplot). If you need to move the points closer to the y-axis, you can simply adjust the size of the figure so that the width is small compared to the height (here i'm doing 2 subplots on a 4x5 in figure, which means that each plot is roughly 2x5 in).
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(4,5))
sns.stripplot(d, orient='vert', ax=ax1)
sns.swarmplot(d, orient='vert', ax=ax2)
plt.tight_layout()
However, I'm going to suggest that maybe you want to use distplot instead. This function is specifically created to show the distribution of you data. Here i'm plotting the KDE of the data, as well as the "rugplot", which shows the position of the points along the y-axis:
fig = plt.figure()
sns.distplot(d, kde=True, vertical=True, rug=True, hist=False, kde_kws=dict(shade=True), rug_kws=dict(lw=2, color='orange'))

colorbars for grid of line (not contour) plots in matplotlib

I'm having trouble giving colorbars to a grid of line plots in Matplotlib.
I have a grid of plots, which each shows 64 lines. The lines depict the penalty value vs time when optimizing the same system under 64 different values of a certain hyperparameter h.
Since there are so many lines, instead of using a standard legend, I'd like to use a colorbar, and color the lines by the value of h. In other words, I'd like something that looks like this:
The above was done by adding a new axis to hold the colorbar, by calling figure.add_axes([0.95, 0.2, 0.02, 0.6]), passing in the axis position explicitly as parameters to that method. The colorbar was then created as in the example code here, by instantiating a ColorbarBase(). That's fine for single plots, but I'd like to make a grid of plots like the one above.
To do this, I tried doubling the number of subplots, and using every other subplot axis for the colorbar. Unfortunately, this led to the colorbars having the same size/shape as the plots:
Is there a way to shrink just the colorbar subplots in a grid of subplots like the 1x2 grid above?
Ideally, it'd be great if the colorbar just shared the same axis as the line plot it describes. I saw that the colorbar.colorbar() function has an ax parameter:
ax
parent axes object from which space for a new colorbar axes will be stolen.
That sounds great, except that colorbar.colorbar() requires you to pass in a imshow image, or a ContourSet, but my plot is neither an image nor a contour plot. Can I achieve the same (axis-sharing) effect using ColorbarBase?

It turns out you can have different-shaped subplots, so long as all the plots in a given row have the same height, and all the plots in a given column have the same width.
You can do this using gridspec.GridSpec, as described in this answer.
So I set the columns with line plots to be 20x wider than the columns with color bars. The code looks like:
grid_spec = gridspec.GridSpec(num_rows,
num_columns * 2,
width_ratios=[20, 1] * num_columns)
colormap_type = cm.cool
for (x_vec_list,
y_vec_list,
color_hyperparam_vec,
plot_index) in izip(x_vec_lists,
y_vec_lists,
color_hyperparam_vecs,
range(len(x_vecs))):
line_axis = plt.subplot(grid_spec[grid_index * 2])
colorbar_axis = plt.subplot(grid_spec[grid_index * 2 + 1])
colormap_normalizer = mpl.colors.Normalize(vmin=color_hyperparam_vec.min(),
vmax=color_hyperparam_vec.max())
scalar_to_color_map = mpl.cm.ScalarMappable(norm=colormap_normalizer,
cmap=colormap_type)
colorbar.ColorbarBase(colorbar_axis,
cmap=colormap_type,
norm=colormap_normalizer)
for (line_index,
x_vec,
y_vec) in zip(range(len(x_vec_list)),
x_vec_list,
y_vec_list):
hyperparam = color_hyperparam_vec[line_index]
line_color = scalar_to_color_map.to_rgba(hyperparam)
line_axis.plot(x_vec, y_vec, color=line_color, alpha=0.5)
For num_rows=1 and num_columns=1, this looks like:

matplotlib/pyplot: print only ticks once in scatter plot?

I am looking for a way to clean-up the ticks in my pyplot scatter plot.
To create a scatter plot from a Pandas dataset column with strings as elements, I followed the example in [2] - and got me a nice scatter plot:
input are 10k data points where the X axis has only ~200 unique 'names', that got matched to scalars for plotting. Obviously, plotting all the 10k ticks on the x axis is a bit clocked. So, I am looking for a way, to print each unique tick only once and not for each data point?
My code looks like:
fig2 = plt.figure()
WNsUniques, WNs = numpy.unique(taskDataFrame['modificationhost'], return_inverse=True)
scatterWNs = fig2.add_subplot(111)
scatterWNs.scatter(WNs, taskDataFrame['cpuconsumptiontime'])
scatterWNs.set(xticks=range(len(WNsUniques)), xticklabels=WNsUniques)
plt.xticks(rotation='vertical')
plt.savefig("%s_WNs-CPUTime_scatter.%s" % (dfName,"pdf"))
actually, I was hoping that setting the plot x ticks to the unique names should be sufficient - but apparently not? Probably it is something easy, but how do I reduce the ticks for my subplot to unique once (should they not already be uniqueified as returned by numpy.unique?)?
Maybe someone has an idea for me?
Cheers ans thanks,
Thomas

You can use the set_xticks method to accomplish this. Note that 200 axis ticks with labels are still quite a lot to force on a small plot like this, and this is what you might already be seeing with the above code. Without complete code to play with, I can't say for sure.
Additionally, what is the size of WNsUniques? That can easily be used to check if your call to unique is doing what you think.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas