Histogram in Bokeh charts takes a looong time - matplotlib

I am trying to move from matplotlib to bokeh. However, I am finding some annoying features. Last I encountered was that it took several minutes to make an histogram of about 1.5M entries - it would have taken a fraction of a second with Matplotlib. Is that normal? And if so, what's the reason?
from bokeh.charts import Histogram, output_file, show
import pandas as pd
output_notebook()
jd1 = pd.read_csv("somefile.csv")
p = Histogram(jd1['QTY'], bins=50)
show(p)

I'm not sure offhand what might be going on with Histogram in your case. Without the data file it's impossible to try and reproduce or debug. But in any case bokeh.charts does not really have a maintainer at the moment, so I would actually just recommend using bokeh.plotting to create your historgam. The bokeh.plotting API is stable (for several years now) and extensively documented. It's a few more lines of code but not many:
import numpy as np
from bokeh.plotting import figure, show, output_notebook
output_notebook()
# synthesize example data
measured = np.random.normal(0, 0.5, 1000)
hist, edges = np.histogram(measured, density=True, bins=50)
p = figure(title="Normal Distribution (μ=0, σ=0.5)")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color=None)
show(p)
As you can see that takes (on my laptop) ~half a second for a 10 million point histogram, including generating synthetic data and binning it.

Related

interactive large plot with vaex

I am using python 3.8 on Windows 10; trying to make a plot with about 700M points in it, sound wave analysis. Here: Interactive large plot with ~20 million sample points and gigabytes of data
Vaex was highly recommended. I am trying to use examples from the Vaex tutorial but the graph does not appear. I could not find a good example on Internet.
import vaex
import numpy as np
df = vaex.example()
df.plot1d(df.x, limits='99.7%');
The Vaex documents don't mention that pyplot.show() should be used to display. Plot1d plots a histogram. How to plot just connected points?
I am pretty sure that the vaex documentation explains that the (now deprecated) method .plot1d(...) is a wrapper around matplotlib plotting routines.
If you would like to create custom plots using the binned data, you can take this approach (I also found it in their docs)
import vaex
import numpy as np
import pylab as plt
# Load example data
df = vaex.example()
# Do the binning yourself
counts = df.count(binby=df.x, shape=64, limits='99.7%')
# Take care of the x-axis
limits = df.limits_percentage(df.x, percentage=99.7)
xvals = np.linspace(limits[0], limits[1], num=64)
# Create your custom plot via matplotlib, plotly or your favorite tool
p.plot(xvals, counts, marker='o', ms=5);

Fast image sequences / animation in Jupyter Notebook with matplotlib

I can't seem to find a simple and fast way of plotting image sequences with plain matplotlib in a Jupyter Notebook. I've tried FuncAnimation, fig.canvas.draw(), blitting, as well as just the standard imshow-pause combo; without success or with very slow refresh rate. I don't need the images to be interactive - they just need to be shown sequentially and can't pop up a new figure window for each image. I've seen many solutions here, with none seeming to work the way I want.
My general pipeline does significant processing, with each image generated and plotted within a while or for loop. FuncAnimation is not desirable since it requires passing a function handle and my use case involves many arguments and state variables that make it difficult to use.
The best I've got is the working example below using fig.canvas.draw() - showing that drawing time increases linearly per iteration, where I need it to remain constant!
import numpy as np
import matplotlib.pyplot as plt
from timeit import default_timer as timer
%matplotlib notebook
num_iters = 50
im = np.arange(60).reshape((15,4))
fig, ax = plt.subplots(1,1)
fig.show()
fig.canvas.draw()
iter_times = np.zeros(num_iters)
for i in range(num_iters):
im = np.roll( a=im, shift=1, axis=0 )
t0 = timer()
ax.imshow(im.T, vmin=im.min(), vmax=im.max())
ax.set_title('Iter # {}/{}'.format(i+1, num_iters))
fig.canvas.draw()
iter_times[i] = timer()-t0
plt.figure(figsize=(6,3))
plt.plot(np.arange(num_iters)+1, iter_times)
plt.title('Imshow/drawing time per iteration')
plt.xlabel('Iteration number')
plt.ylabel('Time (seconds)')
plt.tight_layout()
plt.show()
I think the problem is that the plots are 'building up', so every one is being plotted every time. If you add ax.clear() right before the imshow(), you'll get linear plot times.

Using multiple sliders to manipulate curves in a single graph

I created the following Jupyter Notebook. Here three functions are shifted using three sliders. In the future I would like to generalise it to an arbitrary number of curves (i.e. n-curves). However, right now, the graph updating procedure is very slow and the graph itself doesn't seem to be fixed in the corrispective cell . I didn't receive any error message but I'm pretty sure that there is a mistake in the update function.
Here is the the code
from ipywidgets import interact
import ipywidgets as widgets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
x = np.linspace(0, 2*np.pi, 2000)
y1=np.exp(0.3*x)*np.sin(5*x)
y2=5*np.exp(-x**2)*np.sin(20*x)
y3=np.sin(2*x)
m=[y1,y2,y3]
num_curve=3
def shift(v_X):
v_T=v_X
vector=np.transpose(m)
print(' ')
print(v_T)
print(' ')
curve=vector+v_T
return curve
controls=[]
o='vertical'
for i in range(num_curve):
title="x%i" % (i%num_curve+1)
sl=widgets.FloatSlider(description=title,min=-2.0, max=2.0, step=0.1,orientation=o)
controls.append(sl)
Dict = {}
for c in controls:
Dict[c.description] = c
uif = widgets.HBox(tuple(controls))
def update_N(**xvalor):
xvalor=[]
for i in range(num_curve):
xvalor.append(controls[i].value)
curve=shift(xvalor)
new_curve=pd.DataFrame(curve)
new_curve.plot()
plt.show()
outf = widgets.interactive_output(update_N,Dict)
display(uif, outf)
Your function is running on every single value the slider moves through, which is probably giving you the long times to run you are seeing. You can change this by adding continuous_update=False into your FloatSlider call (line 32).
sl=widgets.FloatSlider(description=title,
min=-2.0,
max=2.0,
step=0.1,
orientation=o,
continuous_update=False)
This got me much better performance, and the chart doesn't flicker as much as there are vastly fewer redraws. Does this help?

Basic axis malfuction in matplotlib

When plotting using matplotlib, I ran into an interesting issue where the y axis is scaled by a very inconvenient quantity. Here's a MWE that demonstrates the problem:
import numpy as np
import matplotlib.pyplot as plt
l = np.linspace(0.5,2,2**10)
a = (0.696*l**2)/(l**2 - 9896.2e-9**2)
plt.plot(l,a)
plt.show()
When I run this, I get a figure that looks like this picture
The y-axis clearly is scaled by a silly quantity even though the y data are all between 1 and 2.
This is similar to the question:
Axis numerical offset in matplotlib
I'm not satisfied with the answer to this question in that it makes no sense to my why I need to go the the convoluted process of changing axis settings when the data are between 1 and 2 (EDIT: between 0 and 1). Why does this happen? Why does matplotlib use such a bizarre scaling?
The data in the plot are all between 0.696000000017 and 0.696000000273. For such cases it makes sense to use some kind of offset.
If you don't want that, you can use you own formatter:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker
l = np.linspace(0.5,2,2**10)
a = (0.696*l**2)/(l**2 - 9896.2e-9**2)
plt.plot(l,a)
fmt = matplotlib.ticker.StrMethodFormatter("{x:.12f}")
plt.gca().yaxis.set_major_formatter(fmt)
plt.show()

Cutting up the x-axis to produce multiple graphs with seaborn?

The following code when graphed looks really messy at the moment. The reason is I have too many values for 'fare'. 'Fare' ranges from [0-500] with most of the values within the first 100.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
titanic = sns.load_dataset("titanic")
y =titanic.groupby([titanic.fare//1,'sex']).survived.mean().reset_index()
sns.set(style="whitegrid")
g = sns.factorplot(x='fare', y= 'survived', col = 'sex', kind ='bar' ,data= y,
size=4, aspect =2.5 , palette="muted")
g.despine(left=True)
g.set_ylabels("Survival Probability")
g.set_xlabels('Fare')
plt.show()
I would like to try slicing up the 'fare' of the plots into subsets but would like to see all the graphs at the same time on one screen. I was wondering it this is possible without having to resort to groupby.
I will have to play around with the values of 'fare' to see what I would want each graph to represent, but for a sample let's use break up the graph into these 'fare' values.
[0-18]
[18-35]
[35-70]
[70-300]
[300-500]
So the total would be 10 graphs on one page, because of the juxtaposition with the opposite sex.
Is it possible with Seaborn? Do I need to do a lot of configuring with matplotlib? Thanks.
Actually I wrote a little blog post about this a while ago. If you are plotting histograms you can use the by keyword:
import matplotlib.pyplot as plt
import seaborn.apionly as sns
sns.set() #rescue matplotlib's styles from the early '90s
data = sns.load_dataset('titanic')
data.hist(by='class', column = 'fare')
plt.show()
Otherwise if you're just plotting value-counts, you have to roll your own grid:
def categorical_hist(self,column,by,layout=None,legend=None,**params):
from math import sqrt, ceil
if layout==None:
s = ceil(sqrt(self[column].unique().size))
layout = (s,s)
return self.groupby(by)[column]\
.value_counts()\
.sort_index()\
.unstack()\
.plot.bar(subplots=True,layout=layout,legend=None,**params)
categorical_hist(data, by='class', column='embark_town')
Edit If you want survival rate by fare range, you could do something like this
data.groupby(pd.cut(data.fare,10)).apply(lambda x.survived.sum(): x./len(x))