Cubic spline interpolation drops out halfway - matplotlib

I am trying to make a cubic spline interpolation and for some reason, the interpolation drops off in the middle of it. It's very mysterious and I can't find any mention of similar occurrences anywhere online.
This is for my dissertation so I have excluded some labels etc. to keep it obscure intentionally, but all the relevant code is as follows. For context, this is an astronomy related plot.
from scipy.interpolate import CubicSpline
import numpy as np
import matplotlib.pyplot as plt
W = np.array([0.435,0.606,0.814,1.05,1.25,1.40,1.60])
sum_all = np.array([sum435,sum606,sum814,sum105,sum125,sum140,sum160])
sum_can = np.array([sumc435,sumc606,sumc814,sumc105,sumc125,sumc140,sumc160])
fall = CubicSpline(W,sum_all)
newallx=np.arange(0.435,1.6,0.001)
newally=fall(newallx)
fcan = CubicSpline(W,sum_can)
newcanx=np.arange(0.435,1.6,0.001)
newcany=fcan(newcanx)
#----plot
plt.plot(newallx,newally)
plt.plot(newcanx,newcany)
plt.plot(W,sum_all,marker='o',color='r',linestyle='')
plt.plot(W,sum_can,marker='o',color='b',linestyle='')
plt.yscale("log")
plt.ylabel("Flux S$_v$ [erg s$^-$$^1$ cm$^-$$^2$ Hz$^-$$^1$]")
plt.xlabel("Wavelength [n$\lambda$]")
plt.show()
The plot that I get from that comes out like this, with a clear gap in the interpolation:
And in case you are wondering, these are the values in the sum_all and sum_can arrays (I assume it doesn't matter, but just in case you want the numbers to plot it yourself):
sum_all:
[ 3.87282732e+32 8.79993191e+32 1.74866333e+33 1.59946687e+33
9.08556547e+33 6.70458731e+33 9.84832359e+33]
can_all:
[ 2.98381061e+28 1.26194810e+28 3.30328780e+28 2.90254609e+29
3.65117723e+29 3.46256846e+29 3.64483736e+29]
The gap happens between [0.606,1.26194810e+28] and [0.814,3.30328780e+28]. If I change the intervals from 0.001 to something higher, it's obvious that the plot doesn't actually break off but merely dips below 0 on the y-axis (but the plot is continuous). So why does it do that? Surely that's not a correct interpolation? Just looking with our eyes, that's clearly not a well-interpolated connection between those two points.
Any tips or comments would be extremely appreciated. Thank you so much in advance!

The reason for the breakdown can be better observed on a linear scale.
We see that the spline actually passes below 0, which is undefined on a log scale.
So I would suggest to first take the logarithm of the data, perform the spline interpolation on the logarithmically scaled data, and then scale back by the 10th power.
from scipy.interpolate import CubicSpline
import numpy as np
import matplotlib.pyplot as plt
W = np.array([0.435,0.606,0.814,1.05,1.25,1.40,1.60])
sum_all = np.array([ 3.87282732e+32, 8.79993191e+32, 1.74866333e+33, 1.59946687e+33,
9.08556547e+33, 6.70458731e+33, 9.84832359e+33])
sum_can = np.array([ 2.98381061e+28, 1.26194810e+28, 3.30328780e+28, 2.90254609e+29,
3.65117723e+29, 3.46256846e+29, 3.64483736e+29])
fall = CubicSpline(W,np.log10(sum_all))
newallx=np.arange(0.435,1.6,0.001)
newally=fall(newallx)
fcan = CubicSpline(W,np.log10(sum_can))
newcanx=np.arange(0.435,1.6,0.01)
newcany=fcan(newcanx)
plt.plot(newallx,10**newally)
plt.plot(newcanx,10**newcany)
plt.plot(W,sum_all,marker='o',color='r',linestyle='')
plt.plot(W,sum_can,marker='o',color='b',linestyle='')
plt.yscale("log")
plt.ylabel("Flux S$_v$ [erg s$^-$$^1$ cm$^-$$^2$ Hz$^-$$^1$]")
plt.xlabel("Wavelength [n$\lambda$]")
plt.show()

Related

Matplotlib streamplot with streamlines that don't break or end

I'd like to make a streamplot with lines that don't stop when they get too close together. I'd rather each streamline be calculated in both directions until it hits the edge of the window. The result is there'd be some areas where they'd all jumble up. But that's what I want.
I there anyway to do this in matplotlib? If not, is there another tool I can use for this that could interface with python/numpy?
import numpy as np
import matplotlib.pyplot as plt
Y,X = np.mgrid[-10:10:.01, -10:10:.01]
U, V = Y**2, X**2
plt.streamplot(X,Y, U,V, density=1)
plt.show(False)
Ok, I've figured out I can get mostly what I want by turning up the density a lot and using custom start points. I'm still interested if there is a better or alternate way to do this.
Here's my solution. Doesn't it look so much better?
import numpy as np
import matplotlib.pyplot as plt
Y,X = np.mgrid[-10:10:.01, -10:10:.01]
y,x = Y[:,0], X[0,:]
U, V = Y**2, X**2
stream_points = np.array(zip(np.arange(-9,9,.5), -np.arange(-9,9,.5)))
plt.streamplot(x,y, U,V, start_points=stream_points, density=35)
plt.show(False)
Edit: By the way, there seems to be some bug in streamplot such that start_points keyword only works if you use 1d arrays for the grid data. See Python Matplotlib Streamplot providing start points
As of Matplotlib version 3.6.0, an optional parameter broken_streamlines has been added for disabling streamline breaks.
Adding it to your snippet produces the following result:
import numpy as np
import matplotlib.pyplot as plt
Y,X = np.mgrid[-10:10:.01, -10:10:.01]
U, V = Y**2, X**2
plt.streamplot(X,Y, U,V, density=1, broken_streamlines=False)
plt.show(False)
Note
This parameter just extends the streamlines which were originally drawn (as in the question). This means that the streamlines in the modified plot above are much more uneven than the result obtained in the other answer, with custom start_points. The density of streamlines on any stream plot does not represent the magnitude of U or V at that point, only their direction. See the documentation for the density parameter of matplotlib.pyplot.streamplot for more details on how streamline start points are chosen by default, when they aren't specified by the optional start_points parameter.
For accurate streamline density, consider using matplotlib.pyplot.contour, but be aware that contour does not show arrows.
Choosing start points automatically
It may not always be easy to choose a set of good starting points automatically. However, if you know the streamfunction corresponding to the flow you wish to plot you can use matplotlib.pyplot.contour to produce a contour plot (which can be hidden from the output), and then extract a suitable starting point from each of the plotted contours.
In the following example, psi_expression is the streamfunction corresponding to the flow. When modifying this example for your own needs, make sure to update both the line defining psi_expression, as well as the one defining U and V. Ensure these both correspond to the same flow.
The density of the streamlines can be altered by changing contour_levels. Here, the contours are uniformly distributed.
import numpy as np
import matplotlib.pyplot as plt
import sympy as sy
x, y = sy.symbols("x y")
psi_expression = x**3 - y**3
psi_function = sy.lambdify((x, y), psi_expression)
Y, X = np.mgrid[-10:10:0.01, -10:10:0.01]
psi_evaluated = psi_function(X, Y)
U, V = Y**2, X**2
contour_levels = np.linspace(np.amin(psi_evaluated), np.amax(psi_evaluated), 30)
# Draw a temporary contour plot.
temp_figure = plt.figure()
contour_plot = plt.contour(X, Y, psi_evaluated, contour_levels)
plt.close(temp_figure)
points_list = []
# Iterate over each contour.
for collection in contour_plot.collections:
# Iterate over each segment in this contour.
for path in collection.get_paths():
middle_point = path.vertices[len(path.vertices) // 2]
points_list.append(middle_point)
# Reshape python list into numpy array of coords.
stream_points = np.reshape(np.array(points_list), (-1, 2))
plt.streamplot(X, Y, U, V, density=1, start_points=stream_points, broken_streamlines=False)
plt.show(False)

Basic axis malfuction in matplotlib

When plotting using matplotlib, I ran into an interesting issue where the y axis is scaled by a very inconvenient quantity. Here's a MWE that demonstrates the problem:
import numpy as np
import matplotlib.pyplot as plt
l = np.linspace(0.5,2,2**10)
a = (0.696*l**2)/(l**2 - 9896.2e-9**2)
plt.plot(l,a)
plt.show()
When I run this, I get a figure that looks like this picture
The y-axis clearly is scaled by a silly quantity even though the y data are all between 1 and 2.
This is similar to the question:
Axis numerical offset in matplotlib
I'm not satisfied with the answer to this question in that it makes no sense to my why I need to go the the convoluted process of changing axis settings when the data are between 1 and 2 (EDIT: between 0 and 1). Why does this happen? Why does matplotlib use such a bizarre scaling?
The data in the plot are all between 0.696000000017 and 0.696000000273. For such cases it makes sense to use some kind of offset.
If you don't want that, you can use you own formatter:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker
l = np.linspace(0.5,2,2**10)
a = (0.696*l**2)/(l**2 - 9896.2e-9**2)
plt.plot(l,a)
fmt = matplotlib.ticker.StrMethodFormatter("{x:.12f}")
plt.gca().yaxis.set_major_formatter(fmt)
plt.show()

plt.imshow(Z,norm=logNorm()) gives grey outline when Z=0

Sorry for no pictures, but this code reproduces the problem:
x=np.random.randn(1000)
y=np.random.randn(1000)
h,_,_=np.histogram2d(x,y)
plt.imshow(h, norm=LogNorm(), cmap=plt.cm.Greys)
I would expect a smooth white transition from very small values to 0 values, but there seems to be a blurred border I'd like to get rid of. Is there any way to do this?
This is to be expected because values less or equal to zero are masked and then positive values are normalized. That might mean that LogNorm is not the best option for you, but if you insist on using it you can try adding the minimum positive value to the histogram. In your case it would be 1 but let's do it more general for, say, normed histograms.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
x = np.random.randn(1000)
y = np.random.randn(1000)
h, _, _ = np.histogram2d(x, y)
im = plt.imshow(h, norm=LogNorm(), cmap=plt.cm.Greys,
interpolation='bilinear')
plt.colorbar(im)
im = plt.imshow(h + np.min(h[h > 0]), norm=LogNorm(), cmap=plt.cm.Greys,
interpolation='bilinear')
plt.colorbar(im)
Note that this change won't affect bilinear interpolation but might affect other interpolation algorithms. To ensure that interpolation is not affected you would have to create a custom subclass of Normalize.
The above figures were made using matplotlib 2.0.0rc1 which applies color mapping after interpolation. If you use a previous version you will see even more artifacts in the first figure.

Cutting up the x-axis to produce multiple graphs with seaborn?

The following code when graphed looks really messy at the moment. The reason is I have too many values for 'fare'. 'Fare' ranges from [0-500] with most of the values within the first 100.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
titanic = sns.load_dataset("titanic")
y =titanic.groupby([titanic.fare//1,'sex']).survived.mean().reset_index()
sns.set(style="whitegrid")
g = sns.factorplot(x='fare', y= 'survived', col = 'sex', kind ='bar' ,data= y,
size=4, aspect =2.5 , palette="muted")
g.despine(left=True)
g.set_ylabels("Survival Probability")
g.set_xlabels('Fare')
plt.show()
I would like to try slicing up the 'fare' of the plots into subsets but would like to see all the graphs at the same time on one screen. I was wondering it this is possible without having to resort to groupby.
I will have to play around with the values of 'fare' to see what I would want each graph to represent, but for a sample let's use break up the graph into these 'fare' values.
[0-18]
[18-35]
[35-70]
[70-300]
[300-500]
So the total would be 10 graphs on one page, because of the juxtaposition with the opposite sex.
Is it possible with Seaborn? Do I need to do a lot of configuring with matplotlib? Thanks.
Actually I wrote a little blog post about this a while ago. If you are plotting histograms you can use the by keyword:
import matplotlib.pyplot as plt
import seaborn.apionly as sns
sns.set() #rescue matplotlib's styles from the early '90s
data = sns.load_dataset('titanic')
data.hist(by='class', column = 'fare')
plt.show()
Otherwise if you're just plotting value-counts, you have to roll your own grid:
def categorical_hist(self,column,by,layout=None,legend=None,**params):
from math import sqrt, ceil
if layout==None:
s = ceil(sqrt(self[column].unique().size))
layout = (s,s)
return self.groupby(by)[column]\
.value_counts()\
.sort_index()\
.unstack()\
.plot.bar(subplots=True,layout=layout,legend=None,**params)
categorical_hist(data, by='class', column='embark_town')
Edit If you want survival rate by fare range, you could do something like this
data.groupby(pd.cut(data.fare,10)).apply(lambda x.survived.sum(): x./len(x))

Figures with lots of data points in matplotlib

I generated the attached image using matplotlib (png format). I would like to use eps or pdf, but I find that with all the data points, the figure is really slow to render on the screen. Other than just plotting less of the data, is there anyway to optimize it so that it loads faster?
I think you have three options:
As you mentioned yourself, you can plot fewer points. For the plot you showed in your question I think it would be fine to only plot every other point.
As #tcaswell stated in his comment, you can use a line instead of points which will be rendered more efficiently.
You could rasterize the blue dots. Matplotlib allows you to selectively rasterize single artists, so if you pass rasterized=True to the plotting command you will get a bitmapped version of the points in the output file. This will be way faster to load at the price of limited zooming due to the resolution of the bitmap. (Note that the axes and all the other elements of the plot will remain as vector graphics and font elements).
First, if you want to show a "trend" in your plot , and considering the x,y arrays you are plotting are "huge" you could apply a random sub-sampling to your x,y arrays, as a fraction of your data:
import numpy as np
import matplotlib.pyplot as plt
fraction = 0.50
x_resampled = []
y_resampled = []
for k in range(0,len(x)):
if np.random.rand() < fraction:
x_resampled.append(x[k])
y_resampled.append(y[k])
plt.scatter(x_resampled,y_resampled , s=6)
plt.show()
Second, have you considered using log-scale in the x-axis to increase visibility?
In this example, only the plotting area is rasterized, the axis are still in vector format:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.uniform(size=400000)
y = np.random.uniform(size=400000)
plt.scatter(x, y, marker='x', rasterized=False)
plt.savefig("norm.pdf", format='pdf')