interactive large plot with vaex - vaex

I am using python 3.8 on Windows 10; trying to make a plot with about 700M points in it, sound wave analysis. Here: Interactive large plot with ~20 million sample points and gigabytes of data
Vaex was highly recommended. I am trying to use examples from the Vaex tutorial but the graph does not appear. I could not find a good example on Internet.
import vaex
import numpy as np
df = vaex.example()
df.plot1d(df.x, limits='99.7%');
The Vaex documents don't mention that pyplot.show() should be used to display. Plot1d plots a histogram. How to plot just connected points?

I am pretty sure that the vaex documentation explains that the (now deprecated) method .plot1d(...) is a wrapper around matplotlib plotting routines.
If you would like to create custom plots using the binned data, you can take this approach (I also found it in their docs)
import vaex
import numpy as np
import pylab as plt
# Load example data
df = vaex.example()
# Do the binning yourself
counts = df.count(binby=df.x, shape=64, limits='99.7%')
# Take care of the x-axis
limits = df.limits_percentage(df.x, percentage=99.7)
xvals = np.linspace(limits[0], limits[1], num=64)
# Create your custom plot via matplotlib, plotly or your favorite tool
p.plot(xvals, counts, marker='o', ms=5);

Related

Histogram in Bokeh charts takes a looong time

I am trying to move from matplotlib to bokeh. However, I am finding some annoying features. Last I encountered was that it took several minutes to make an histogram of about 1.5M entries - it would have taken a fraction of a second with Matplotlib. Is that normal? And if so, what's the reason?
from bokeh.charts import Histogram, output_file, show
import pandas as pd
output_notebook()
jd1 = pd.read_csv("somefile.csv")
p = Histogram(jd1['QTY'], bins=50)
show(p)
I'm not sure offhand what might be going on with Histogram in your case. Without the data file it's impossible to try and reproduce or debug. But in any case bokeh.charts does not really have a maintainer at the moment, so I would actually just recommend using bokeh.plotting to create your historgam. The bokeh.plotting API is stable (for several years now) and extensively documented. It's a few more lines of code but not many:
import numpy as np
from bokeh.plotting import figure, show, output_notebook
output_notebook()
# synthesize example data
measured = np.random.normal(0, 0.5, 1000)
hist, edges = np.histogram(measured, density=True, bins=50)
p = figure(title="Normal Distribution (μ=0, σ=0.5)")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color=None)
show(p)
As you can see that takes (on my laptop) ~half a second for a 10 million point histogram, including generating synthetic data and binning it.

Cutting up the x-axis to produce multiple graphs with seaborn?

The following code when graphed looks really messy at the moment. The reason is I have too many values for 'fare'. 'Fare' ranges from [0-500] with most of the values within the first 100.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
titanic = sns.load_dataset("titanic")
y =titanic.groupby([titanic.fare//1,'sex']).survived.mean().reset_index()
sns.set(style="whitegrid")
g = sns.factorplot(x='fare', y= 'survived', col = 'sex', kind ='bar' ,data= y,
size=4, aspect =2.5 , palette="muted")
g.despine(left=True)
g.set_ylabels("Survival Probability")
g.set_xlabels('Fare')
plt.show()
I would like to try slicing up the 'fare' of the plots into subsets but would like to see all the graphs at the same time on one screen. I was wondering it this is possible without having to resort to groupby.
I will have to play around with the values of 'fare' to see what I would want each graph to represent, but for a sample let's use break up the graph into these 'fare' values.
[0-18]
[18-35]
[35-70]
[70-300]
[300-500]
So the total would be 10 graphs on one page, because of the juxtaposition with the opposite sex.
Is it possible with Seaborn? Do I need to do a lot of configuring with matplotlib? Thanks.
Actually I wrote a little blog post about this a while ago. If you are plotting histograms you can use the by keyword:
import matplotlib.pyplot as plt
import seaborn.apionly as sns
sns.set() #rescue matplotlib's styles from the early '90s
data = sns.load_dataset('titanic')
data.hist(by='class', column = 'fare')
plt.show()
Otherwise if you're just plotting value-counts, you have to roll your own grid:
def categorical_hist(self,column,by,layout=None,legend=None,**params):
from math import sqrt, ceil
if layout==None:
s = ceil(sqrt(self[column].unique().size))
layout = (s,s)
return self.groupby(by)[column]\
.value_counts()\
.sort_index()\
.unstack()\
.plot.bar(subplots=True,layout=layout,legend=None,**params)
categorical_hist(data, by='class', column='embark_town')
Edit If you want survival rate by fare range, you could do something like this
data.groupby(pd.cut(data.fare,10)).apply(lambda x.survived.sum(): x./len(x))

Change y-axis scaling fontsize in pandas dataframe.plot()

I am changing the font-sizes in my python pandas dataframe plot. The only part that I could not change is the scaling of y-axis values (see the figure below).
Could you please help me with that?
Added:
Here is the simplest code to reproduce my problem:
import pandas as pd
start = 10**12
finish = 1.1*10**12
y = np.linspace(start , finish)
pd.DataFrame(y).plot()
plt.tick_params(axis='x', labelsize=17)
plt.tick_params(axis='y', labelsize=17)
You will see that this result in the graph similar to above. No change in the scaling of the y-axis.
Ma
There are just so many features that you can control with the plotting capabilities of pandas, which leverages matplotlib. I found that seaborn is a lot easier to produce pretty charts, and you have a lot more control over the parameters of your plots.
This is not the most elegant solution, but it works; however, it has a seborn dependency:
%pylab inline
import pandas as pd
import seaborn as sns
import numpy as np
sns.set(style="darkgrid")
sns.set(font_scale=1.5)
start = 10**12
finish = 1.1*10**12
y = np.linspace(start , finish)
pd.DataFrame(y).plot()
plt.tick_params(axis='x', labelsize=17)
plt.tick_params(axis='y', labelsize=17)
I use Jupyter Notebook an that's why I use %pylab inline. The key element here is the use of
font_scale=1.5
Which you can set to whatver you want that produces your desired result. This is what I get:

X-axis labels on Seaborn Plots in Bokeh

I'm attempting to follow the violin plot example in bokeh, but am unable to add x-axis labels to my violins. According to the Seaborn documentation it looks like I should be able to add x-axis labels via the "names" argument, however, the following code does not add x-axis labels:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from bokeh import mpl
from bokeh.plotting import show
# generate some random data
data = 1 + np.random.randn(20, 6)
# Use Seaborn and Matplotlib normally
sns.violinplot(data, color="Set3", names=["kirk","spock","bones","scotty","uhura","sulu"])
plt.title("Seaborn violin plot in Bokeh")
# Convert to interactive Bokeh plot with one command
show(mpl.to_bokeh(name="violin"))
I believe that the issue is that I'm converting a figure from seaborn to matplotlib to bokeh, but I'm not sure at what level the x-axis labels go in.
I've confirmed that the labels are showing up in matplotlib before conversion to bokeh. I've also tried adding the labels to bokeh after conversion, but this results in a weird plot. I've created an issue for this problem with the bokeh developers here.
Since Bokeh 12.5 (April 2017), support for Matplotlib has been deprecated, so mpl.to_bokeh() is no longer available.

Plotting points in 3d from text file using Matplotlib or Octave

Hi I have a text file containing three columns of numbers; one column each for the x,y,z coordinates of a bunch of points. All numbers are between 0 ad 1.
I want to plot all these points in the unit cube [0,1]x[0,1]x[0,1].
Please let me know how I can do this in Octave or MatPlot lib, whichever prduces a better quality image.
If I understand your question correctly, this is how it looks in Matplotlib:
This is the code to produce this plot:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
np.random.seed(101)
x,y,z = np.random.rand(3,20)
fig = plt.figure()
# version 1.0.x syntax:
#ax = fig.add_subplot(111, projection='3d')
# version 0.99.x syntax: (accepted by 1.0.x as well)
ax = Axes3D(fig)
ax.scatter(x,y,z)
fig.savefig('scatter3d.png')
As the code suggests, there are slight differences in how matplotlib version 0.99.1.1 and version 1.0.1 behave, as noted in this SO question/answer. I am using 0.99.1.1, and I had trouble using all the options available to 2D scatter plots, which should be the same for 3D plots as well. The full list of scatter features are listed here.
The above code resulted from looking at the matplotlib tutorial on 3D plotting.