Matplotlib graphic's line smoothing - matplotlib

bullets' trajectory comparison
I'm a new python user. I'm using this powerful code to do scientific research and data analysis.
I'm writing my thesis in physics, I'm trying to describe and analyze the external ballistics behind the bullet flight.
I'm using matplotlib to draw graphics representing the bullet's parabolic path and the related cross points; given that I'd like to know if there is a special code to smooth up the graphic lines drawn following the real experimental data avoiding to have a graphic made by a lot of linear segments.
Thanks a lot to all of you!
Francesco

All right Francesco, thanks for uploading the image. Now, let's have some fun with coding.
As first I suggest to use the numpy function to fit a polynomial curve of a certain degree to a set of value: np.polyfit(). Be aware of the degree you set as the results can widely change. For more information, please take a look at this documentation: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.polyfit.html
Then, in order to smooth your curve down, you need to increase the number of point to draw the function with np.linspace() and use this new set to apply the
function np.poly1d() (it calculates the y coordinates based on the fitting you did with polyfit).
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = [0, 50, 100, 150, 200, 250]
y = [-1, 0.8, 1.9, 1.6, 0, -3]
z = np.polyfit(x, y, 2)
p = np.poly1d(z)
xp = np.linspace(-2, 255)
plt.plot(x, y, '.', xp, p(xp), '-')
plt.show()

Related

Ticks position in heatmap with categorical data (seaborn)

I am trying to plot a confusion matrix of my predictions. My data is multi-class (13 different labels) so I'm using a heatmap.
As you can see below, my heat map looks generally okay but the labels are a bit out of position: y ticks should be a little lower and x ticks should be a bit more to the right. I want to move both axis ticks a bit so they will aligned with the center of each square.
my code:
sns.set()
my_mask = np.zeros((con_matrix.shape[0], con_matrix.shape[0]), dtype=int)
for i in range(con_matrix.shape[0]):
for j in range(con_matrix.shape[0]):
my_mask[i][j] = con_matrix[i][j] == 0
fig_dims = (10, 10)
plt.subplots(figsize=fig_dims)
ax = sns.heatmap(con_matrix, annot=True, fmt="d", linewidths=.5, cmap="Pastel1", cbar=False, mask=my_mask, vmax=15)
plt.xticks(range(len(party_names)), party_names, rotation=45)
plt.yticks(range(len(party_names)), party_names, rotation='horizontal')
plt.show()
and for reproduction purpose, here are con_matrix and party_names hard-coded:
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
con_matrix = np.array([[55, 0, 0, 0,0, 0, 0,0,0,0,0,0,2], [0,199,0,0,0,0,0,0,0,0,2,0,1],
[0, 0,52,0,0,0,0,0,0,0,0,0,1],
[0,0,0,39,0,0,0,0,0,0,0,0,0],
[0,0,0,0,90,0,0,0,0,0,0,4,3],
[0,0,0,1,0,35,0,0,0,0,0,0,0],
[0,0,0,0,5,0,26,0,0,1,0,1,0],
[0,5,0,0,0,1,0,44,0,0,3,0,1],
[0,1,0,0,0,0,0,0,52,0,0,0,0],
[0,1,0,0,2,0,0,0,0,235,0,1,1],
[1,2,0,0,0,0,0,3,0,0,34,0,3],
[0,0,0,0,5,0,0,0,0,1,0,40,0],
[0,0,0,0,0,0,0,0,0,1,0,0,46]])
party_names = ['Blues', 'Browns', 'Greens', 'Greys', 'Khakis', 'Oranges', 'Pinks', 'Purples', 'Reds', 'Turquoises', 'Violets', 'Whites', 'Yellows']
I already tried to work with position argument of different axes, but it did not turn out well. Could not find an exactly answer in this site as well (at least not a solution that works for categorical data).
I'm new in visualization with seaborn, any improvement with explanations would be appreciated (not only for my problem but on my code & visualization as well).
You can shift both the ticklabels by 0.5 offset to have the desired alignment. To do so, I have used NumPy's arange that enables vectorized addition of 0.5 to the whole array.
plt.xticks(np.arange(len(party_names))+0.5, party_names, rotation=45)
plt.yticks(np.arange(len(party_names))+0.5, party_names, rotation='horizontal')

Estimation of t-distribution by mean of samples does not work

I am trying to create a t-distribution by taking the mean of many samples from a normal distribution (and then estimating the shape with kernel density estimation).
For some reason, I am getting pretty different results when I compare what I get with a proper t-distribution. I don't understand what is going wrong, so I think I am confused about something.
Here is the code:
import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
import seaborn
inner_sample_size = 10
X = np.arange(-3, 3, 0.01)
results = [
np.mean(np.random.normal(size=inner_sample_size))
for _ in range(10000)
]
estimation = gaussian_kde(results)
plt.plot(X, estimation.evaluate(X))
t_samples = np.random.standard_t(inner_sample_size, 10000)
t_estimator = gaussian_kde(t_samples)
plt.plot(X, t_estimator.evaluate(X))
plt.ylabel("Probability density")
plt.show()
And here is the plot I get:
Where the orange line is numpy's own t-distribution, and the blue line is the one estimated by sampling.
Your assumption that the mean of Standard Normals has T distribution is incorrect. In fact, the mean of Standard Normals has Normal Distribution, which explains the shape of your blue graph. To generate one random variable T from a T distribution with k degrees of freedom, you first generate k+1 independent Standard Normals Z_i, i=0,...,k. You then compute
T = Z_0 / sqrt( sum(Z_i^2, i=1 to k)/k ).
The sum of squared Standard Normals sum(Z_i^2, i=1 to k) has Chi-Squared Distribution with k degrees of freedom, so if there is a pre-canned method to generate this, you should use it, since it's likely more efficient.

Annotate a data point with a graph

For the lack of better term, is there a way to annotate a data point with a graph? I include an example of what I am for below
Big black data point with a graph corresponding to it. Note that graph is rotated so its "x" axis (not shown) is perpendicular to the "y" axis of the scatter plot
annotation_box http://matplotlib.org/examples/pylab_examples/demo_annotation_box.html is the closest thing I can find at the moment, but even knowing the proper term for what I want to do, would make my life easier.
If I understood the problem correctly, what you need are floating axes that you can place as annotations over your plot. Unfortunately, this is not easily possible in matplotlib, as far I know.
An easy solution would be to just plot the points and graphs in the same axis, with the graphs scaled down and shifted close to the points.
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt
xp = [5, 1, 3]
yp = [2, 1, 4]
# just generate some curves
curves_x = np.array([np.linspace(0, 10, 100)] * 3)
curves_y = sps.gamma.pdf(curves_x[0], [[2], [5], [7]], 1)
plt.scatter(xp, yp, s=50)
for x, y, cx, cy in zip(xp, yp, curves_x, curves_y):
plt.plot(x + cy / np.max(cy) + 0.1 , y + cx / np.max(cx) - 0.5)
plt.show()
This is a very simplistic example. The numbers will have to be tuned to look nice with varying scale of the data.

Figures with lots of data points in matplotlib

I generated the attached image using matplotlib (png format). I would like to use eps or pdf, but I find that with all the data points, the figure is really slow to render on the screen. Other than just plotting less of the data, is there anyway to optimize it so that it loads faster?
I think you have three options:
As you mentioned yourself, you can plot fewer points. For the plot you showed in your question I think it would be fine to only plot every other point.
As #tcaswell stated in his comment, you can use a line instead of points which will be rendered more efficiently.
You could rasterize the blue dots. Matplotlib allows you to selectively rasterize single artists, so if you pass rasterized=True to the plotting command you will get a bitmapped version of the points in the output file. This will be way faster to load at the price of limited zooming due to the resolution of the bitmap. (Note that the axes and all the other elements of the plot will remain as vector graphics and font elements).
First, if you want to show a "trend" in your plot , and considering the x,y arrays you are plotting are "huge" you could apply a random sub-sampling to your x,y arrays, as a fraction of your data:
import numpy as np
import matplotlib.pyplot as plt
fraction = 0.50
x_resampled = []
y_resampled = []
for k in range(0,len(x)):
if np.random.rand() < fraction:
x_resampled.append(x[k])
y_resampled.append(y[k])
plt.scatter(x_resampled,y_resampled , s=6)
plt.show()
Second, have you considered using log-scale in the x-axis to increase visibility?
In this example, only the plotting area is rasterized, the axis are still in vector format:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.uniform(size=400000)
y = np.random.uniform(size=400000)
plt.scatter(x, y, marker='x', rasterized=False)
plt.savefig("norm.pdf", format='pdf')

Exponential decay curve fitting in numpy and scipy

I'm having a bit of trouble with fitting a curve to some data, but can't work out where I am going wrong.
In the past I have done this with numpy.linalg.lstsq for exponential functions and scipy.optimize.curve_fit for sigmoid functions. This time I wished to create a script that would let me specify various functions, determine parameters and test their fit against the data. While doing this I noticed that Scipy leastsq and Numpy lstsq seem to provide different answers for the same set of data and the same function. The function is simply y = e^(l*x) and is constrained such that y=1 at x=0.
Excel trend line agrees with the Numpy lstsq result, but as Scipy leastsq is able to take any function, it would be good to work out what the problem is.
import scipy.optimize as optimize
import numpy as np
import matplotlib.pyplot as plt
## Sampled data
x = np.array([0, 14, 37, 975, 2013, 2095, 2147])
y = np.array([1.0, 0.764317544, 0.647136491, 0.070803763, 0.003630962, 0.001485394, 0.000495131])
# function
fp = lambda p, x: np.exp(p*x)
# error function
e = lambda p, x, y: (fp(p, x) - y)
# using scipy least squares
l1, s = optimize.leastsq(e, -0.004, args=(x,y))
print l1
# [-0.0132281]
# using numpy least squares
l2 = np.linalg.lstsq(np.vstack([x, np.zeros(len(x))]).T,np.log(y))[0][0]
print l2
# -0.00313461628963 (same answer as Excel trend line)
# smooth x for plotting
x_ = np.arange(0, x[-1], 0.2)
plt.figure()
plt.plot(x, y, 'rx', x_, fp(l1, x_), 'b-', x_, fp(l2, x_), 'g-')
plt.show()
Edit - additional information
The MWE above includes a small sample of the dataset. When fitting the actual data the scipy.optimize.curve_fit curve presents an R^2 of 0.82, while the numpy.linalg.lstsq curve, which is the same as that calculated by Excel, has an R^2 of 0.41.
You are minimizing different error functions.
When you use numpy.linalg.lstsq, the error function being minimized is
np.sum((np.log(y) - p * x)**2)
while scipy.optimize.leastsq minimizes the function
np.sum((y - np.exp(p * x))**2)
The first case requires a linear dependency between the dependent and independent variables, but the solution is known analitically, while the second can handle any dependency, but relies on an iterative method.
On a separate note, I cannot test it right now, but when using numpy.linalg.lstsq, I you don't need to vstack a row of zeros, the following works as well:
l2 = np.linalg.lstsq(x[:, None], np.log(y))[0][0]
To expound a bit on Jaime's point, any non-linear transformation of the data will lead to a different error function and hence to different solutions. These will lead to different confidence intervals for the fitting parameters. So you have three possible criteria to use to make a decision: which error you want to minimize, which parameters you want more confidence in, and finally, if you are using the fitting to predict some value, which method yields less error in the interesting predicted value. Playing around a bit analytically and in Excel suggests that different kinds of noise in the data (e.g. if the noise function scales the amplitude, affects the time-constant or is additive) leads to different choices of solution.
I'll also add that while this trick "works" for exponential decay to 0, it can't be used in the more general (and common) case of damped exponentials (rising or falling) to values that cannot be assumed to be 0.