cumulative simpson integration with scipy - numpy

I have some code which uses scipy.integration.cumtrapz to compute the antiderivative of a sampled signal. I would like to use Simpson's rule instead of Trapezoid. However scipy.integration.simps seems not to have a cumulative counterpart... Am I missing something? Is there a simple way to get a cumulative integration with "scipy.integration.simps"?

You can always write your own:
def cumsimp(func,a,b,num):
#Integrate func from a to b using num intervals.
num*=2
a=float(a)
b=float(b)
h=(b-a)/num
output=4*func(a+h*np.arange(1,num,2))
tmp=func(a+h*np.arange(2,num-1,2))
output[1:]+=tmp
output[:-1]+=tmp
output[0]+=func(a)
output[-1]+=func(b)
return np.cumsum(output*h/3)
def integ1(x):
return x
def integ2(x):
return x**2
def integ0(x):
return np.ones(np.asarray(x).shape)*5
First look at the sum and derivative of a constant function.
print cumsimp(integ0,0,10,5)
[ 10. 20. 30. 40. 50.]
print np.diff(cumsimp(integ0,0,10,5))
[ 10. 10. 10. 10.]
Now check for a few trivial examples:
print cumsimp(integ1,0,10,5)
[ 2. 8. 18. 32. 50.]
print cumsimp(integ2,0,10,5)
[ 2.66666667 21.33333333 72. 170.66666667 333.33333333]
Writing your integrand explicitly is much easier here then reproducing the simpson's rule function of scipy in this context. Picking intervals will be difficult to do when provided a single array, do you either:
Use every other value for the edges of simpson's rule and the remaining values as centers?
Use the array as edges and interpolate values of centers?
There are also a few options for how you want the intervals summed. These complications could be why its not coded in scipy.

Your question has been answered a long time ago, but I came across the same problem recently. I wrote some functions to compute such cumulative integrals for equally spaced points; the code can be found on GitHub. The order of the interpolating polynomials ranges from 1 (trapezoidal rule) to 7. As Daniel pointed out in the previous answer, some choices have to be made on how the intervals are summed, especially at the borders; results may thus be sightly different depending on the package you use. Be also aware that the numerical integration may suffer from Runge's phenomenon (unexpected oscillations) for high orders of polynomials.
Here is an example:
import numpy as np
from scipy import integrate as sp_integrate
from gradiompy import integrate as gp_integrate
# Definition of the function (polynomial of degree 7)
x = np.linspace(-3,3,num=15)
dx = x[1]-x[0]
y = 8*x + 3*x**2 + x**3 - 2*x**5 + x**6 - 1/5*x**7
y_int = 4*x**2 + x**3 + 1/4*x**4 - 1/3*x**6 + 1/7*x**7 - 1/40*x**8
# Cumulative integral using scipy
y_int_trapz = y_int [0] + sp_integrate.cumulative_trapezoid(y,dx=dx,initial=0)
print('Integration error using scipy.integrate:')
print(' trapezoid = %9.5f' % np.linalg.norm(y_int_trapz-y_int))
# Cumulative integral using gradiompy
y_int_trapz = gp_integrate.cumulative_trapezoid(y,dx=dx,initial=y_int[0])
y_int_simps = gp_integrate.cumulative_simpson(y,dx=dx,initial=y_int[0])
print('\nIntegration error using gradiompy.integrate:')
print(' trapezoid = %9.5f' % np.linalg.norm(y_int_trapz-y_int))
print(' simpson = %9.5f' % np.linalg.norm(y_int_simps-y_int))
# Higher order cumulative integrals
for order in range(5,8,2):
y_int_composite = gp_integrate.cumulative_composite(y,dx,order=order,initial=y_int[0])
print(' order %i = %9.5f' % (order,np.linalg.norm(y_int_composite-y_int)))
# Display the values of the cumulative integral
print('\nCumulative integral (with initial offset):\n',y_int_composite)
You should get the following result:
'''
Integration error using scipy.integrate:
trapezoid = 176.10502
Integration error using gradiompy.integrate:
trapezoid = 176.10502
simpson = 2.52551
order 5 = 0.48758
order 7 = 0.00000
Cumulative integral (with initial offset):
[-6.90203571e+02 -2.29979407e+02 -5.92267425e+01 -7.66415188e+00
2.64794452e+00 2.25594840e+00 6.61937372e-01 1.14797061e-13
8.20130517e-01 3.61254267e+00 8.55804341e+00 1.48428883e+01
1.97293221e+01 1.64257877e+01 -1.13464286e+01]
'''

I would go with Daniel's solution. But you need to be careful if the function that you are integrating is itself subject to fluctuations. Simpson's requires the function to be well-behaved (meaning in this case, one that is continuous).
There are techniques for making a moderately badly behaved function look like it is better behaved than it really is (really forms of approximation of your function) but in that case you have to be sure that the function "adequately" approximates yours. In that case you might make the intervals may be non-uniform to handle the problem.
An example might be in considering the flow of a field that, over longer time scales, is approximated by a well-behaved function but which over shorter periods is subject to limited random fluctuations in its density.

Related

Parameters for numpy.random.lognormal function

I need to create a fictitious log-normal distribution of household income in a particular area. The data I have are: Average: 13,600 and Standard Deviation 7,900.
What should be the parameters in the function numpy.random.lognormal?
When i set the mean and the standard deviation as they are most of the values in the distribution are "inf", and the values also doesn't make sense when i set the parameters as the log of the mean and standard deviation.
If someone can help me to figure out what the parameters are it would be great.
Thanks!
This is indeed a nontrivial task as the moments of the log-normal distribution should be solved for the unknown parameters. By looking at say [Wikipedia][1], you will find the mean and variance of the log-normal distribution to be exp(mu + sigma2) and [exp(sigma2)-1]*exp(2*mu+sigma**2), respectively.
The choice of mu and sigma should solve exp(mu + sigma**2) = 13600 and [exp(sigma**2)-1]*exp(2*mu+sigma**2)= 7900**2. This can be solved analytically because the first equation squared provides exactly exp(2*mu+sigma**2) thus eliminating the variable mu from the second equation.
A sample code is provided below. I took a large sample size to explicitly show that the mean and standard deviation of the simulated data are close to the desired numbers.
import numpy as np
# Input characteristics
DataAverage = 13600
DataStdDev = 7900
# Sample size
SampleSize = 100000
# Mean and variance of the standard lognormal distribution
SigmaLogNormal = np.sqrt( np.log(1+(DataStdDev/DataAverage)**2))
MeanLogNormal = np.log( DataAverage ) - SigmaLogNormal**2/2
print(MeanLogNormal, SigmaLogNormal)
# Obtain draw from log-normal distribution
Draw = np.random.lognormal(mean=MeanLogNormal, sigma=SigmaLogNormal, size=SampleSize)
# Check
print( np.mean(Draw), np.std(Draw))

scipy-optimize-minimize does not perform the optimization - CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL

I am trying to minimize a function defined as follows:
utility(decision) = decision * (risk - cost)
where variables take the following form:
decision = binary array
risk = array of floats
cost = constant
I know the solution will take the form of:
decision = 1 if (risk >= threshold)
decision = 0 otherwise
Therefore, in order to minimize this function I can assume that I transform the function utility to depend only on this threshold. My direct translation to scipy is the following:
def utility(threshold,risk,cost):
selection_list = [float(risk[i]) >= threshold for i in range(len(risk))]
v = np.array(risk.astype(float)) - cost
total_utility = np.dot(v, selection_list)
return -1.0*total_utility
result = minimize(fun=utility, x0=0.2, args=(r,c),bounds=[(0,1)], options={"disp":True} )
This gives me the following result:
fun: array([-17750.44298655]) hess_inv: <1x1 LbfgsInvHessProduct with
dtype=float64>
jac: array([0.])
message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
nfev: 2
nit: 0 status: 0 success: True
x: array([0.2])
However, I know the result is wrong because in this case it must be equal to cost. On top of that, no matter what x0 I use, it always returns it as the result. Looking at the results I observe that jacobian=0 and does not compute 1 iteration correctly.
Looking more thoroughly into the function. I plot it and observe that it is not convex on the limits of the bounds but we can clearly see the minimum at 0.1. However, no matter how much I adjust the bounds to be in the convex part only, the result is still the same.
What could I do to minimize this function?
The error message tells you that the gradient was at some point too small and thus numerically the same as zero. This is likely due to the thresholding that you do when you calculate your selection_list. There you say float(risk[i]) >= threshold, which has derivative 0 almost everywhere. Hence, almost every starting value will give you the warning you receive.
A solution could be to apply some smoothing to the thresholding operation. So instead of float(risk[i]) >= threshold, you would use a continuous function:
def g(x):
return 1./(1+np.exp(-x))
With this function, you can express the thresholding operation as
g((risk[i] - threshold)/a), which a parameter a. The larger a, the closer is this modified error function to what you are doing so far. At something like a=20 or so, you would probably have pretty much the same that you have at the moment. You would therefore derive a sequence of solutions, where you start with a=1 and then take that solution as a starting value for the same problem with a=2, take that solution as a starting value for the problem with a=4, and so on. At some point, you will notice that changing a does no longer change the solution and you're done.

How to calculate a very large correlation matrix

I have an np.array of observations z where z.shape is (100000, 60). I want to efficiently calculate the 100000x100000 correlation matrix and then write to disk the coordinates and values of just those elements > 0.95 (this is a very small fraction of the total).
My brute-force version of this looks like the following but is, not surprisingly, very slow:
for i1 in range(z.shape[0]):
for i2 in range(i1+1):
r = np.corrcoef(z[i1,:],z[i2,:])[0,1]
if r > 0.95:
file.write("%6d %6d %.3f\n" % (i1,i2,r))
I realize that the correlation matrix itself could be calculated much more efficiently in one operation using np.corrcoef(z), but the memory requirement is then huge. I'm also aware that one could break up the data set into blocks and calculate bite-size subportions of the correlation matrix at one time, but programming that and keeping track of the indices seems unnecessarily complicated.
Is there another way (e.g., using memmap or pytables) that is both simple to code and doesn't put excessive demands on physical memory?
After experimenting with the memmap solution proposed by others, I found that while it was faster than my original approach (which took about 4 days on my Macbook), it still took a very long time (at least a day) -- presumably due to inefficient element-by-element writes to the outputfile. That wasn't acceptable given my need to run the calculation numerous times.
In the end, the best solution (for me) was to sign in to Amazon Web Services EC2 portal, create a virtual machine instance (starting with an Anaconda Python-equipped image) with 120+ GiB of RAM, upload the input data file, and do the calculation (using the matrix multiplication method) entirely in core memory. It completed in about two minutes!
For reference, the code I used was basically this:
import numpy as np
import pickle
import h5py
# read nparray, dimensions (102000, 60)
infile = open(r'file.dat', 'rb')
x = pickle.load(infile)
infile.close()
# z-normalize the data -- first compute means and standard deviations
xave = np.average(x,axis=1)
xstd = np.std(x,axis=1)
# transpose for the sake of broadcasting (doesn't seem to work otherwise!)
ztrans = x.T - xave
ztrans /= xstd
# transpose back
z = ztrans.T
# compute correlation matrix - shape = (102000, 102000)
arr = np.matmul(z, z.T)
arr /= z.shape[0]
# output to HDF5 file
with h5py.File('correlation_matrix.h5', 'w') as hf:
hf.create_dataset("correlation", data=arr)
From my rough calculations, you want a correlation matrix that has 100,000^2 elements. That takes up around 40 GB of memory, assuming floats.
That probably won't fit in computer memory, otherwise you could just use corrcoef.
There's a fancy approach based on eigenvectors that I can't find right now, and that gets into the (necessarily) complicated category...
Instead, rely on the fact that for zero mean data the covariance can be found using a dot product.
z0 = z - mean(z, 1)[:, None]
cov = dot(z0, z0.T)
cov /= z.shape[-1]
And this can be turned into the correlation by normalizing by the variances
sigma = std(z, 1)
corr = cov
corr /= sigma
corr /= sigma[:, None]
Of course memory usage is still an issue.
You can work around this with memory mapped arrays (make sure it's opened for reading and writing) and the out parameter of dot (For another example see Optimizing my large data code with little RAM)
N = z.shape[0]
arr = np.memmap('corr_memmap.dat', dtype='float32', mode='w+', shape=(N,N))
dot(z0, z0.T, out=arr)
arr /= sigma
arr /= sigma[:, None]
Then you can loop through the resulting array and find the indices with a large correlation coefficient. (You may be able to find them directly with where(arr > 0.95), but the comparison will create a very large boolean array which may or may not fit in memory).
You can use scipy.spatial.distance.pdist with metric = correlation to get all the correlations without the symmetric terms. Unfortunately this will still leave you with about 5e10 terms that will probably overflow your memory.
You could try reformulating a KDTree (which can theoretically handle cosine distance, and therefore correlation distance) to filter for higher correlations, but with 60 dimensions it's unlikely that would give you much speedup. The curse of dimensionality sucks.
You best bet is probably brute forcing blocks of data using scipy.spatial.distance.cdist(..., metric = correlation), and then keep only the high correlations in each block. Once you know how big a block your memory can handle without slowing down due to your computer's memory architecture it should be much faster than doing one at a time.
please check out deepgraph package.
https://deepgraph.readthedocs.io/en/latest/tutorials/pairwise_correlations.html
I tried on z.shape = (2500, 60) and pearsonr for 2500 * 2500. It has an extreme fast speed.
Not sure for 100000 x 100000 but worth trying.

Confused by random.randn()

I am a bit confused by the numpy function random.randn() which returns random values from the standard normal distribution in an array in the size of your choosing.
My question is that I have no idea when this would ever be useful in applied practices.
For reference about me I am a complete programming noob but studied math (mostly stats related courses) as an undergraduate.
The Python function randn is incredibly useful for adding in a random noise element into a dataset that you create for initial testing of a machine learning model. Say for example that you want to create a million point dataset that is roughly linear for testing a regression algorithm. You create a million data points using
x_data = np.linspace(0.0,10.0,1000000)
You generate a million random noise values using randn
noise = np.random.randn(len(x_data))
To create your linear data set you follow the formula
y = mx + b + noise_levels with the following code (setting b = 5, m = 0.5 in this example)
y_data = (0.5 * x_data ) + 5 + noise
Finally the dataset is created with
my_data = pd.concat([pd.DataFrame(data=x_data,columns=['X Data']),pd.DataFrame(data=y_data,columns=['Y'])],axis=1)
This could be used in 3D programming to generate non-overlapping random values. This would be useful for optimization of graphical effects.
Another possible use for statistical applications would be applying a formula in order to test against spacial factors affecting a given constant. Such as if you were measuring a span of time with some formula doing something but then needing to know what the effectiveness would be given various spans of time. This would return a statistic measuring for example that your formula is more effective in the shorter intervals or longer intervals, etc.
np.random.randn(d0, d1, ..., dn) Return a sample (or samples) from the “standard normal” distribution(mu=0, stdev=1).
For random samples from , use:
sigma * np.random.randn(...) + mu
This is because if Z is a standard normal deviate, then will have a normal distribution with expected value and standard deviation .
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html
https://en.wikipedia.org/wiki/Normal_distribution

Creating grid and interpolating (x,y,z) for contour plot sagemath

!I have values in the form of (x,y,z). By creating a list_plot3d plot i can clearly see that they are not quite evenly spaced. They usually form little "blobs" of 3 to 5 points on the xy plane. So for the interpolation and the final "contour" plot to be better, or should i say smoother(?), do i have to create a rectangular grid (like the squares on a chess board) so that the blobs of data are somehow "smoothed"? I understand that this might be trivial to some people but i am trying this for the first time and i am struggling a bit. I have been looking at the scipy packages like scipy.interplate.interp2d but the graphs produced at the end are really bad. Maybe a brief tutorial on 2d interpolation in sagemath for an amateur like me? Some advice? Thank you.
EDIT:
https://docs.google.com/file/d/0Bxv8ab9PeMQVUFhBYWlldU9ib0E/edit?pli=1
This is mostly the kind of graphs it produces along with this message:
Warning: No more knots can be added because the number of B-spline
coefficients
already exceeds the number of data points m. Probably causes:
either
s or m too small. (fp>s)
kx,ky=3,3 nx,ny=17,20 m=200 fp=4696.972223 s=0.000000
To get this graph i just run this command:
f_interpolation = scipy.interpolate.interp2d(*zip(*matrix(C)),kind='cubic')
plot_interpolation = contour_plot(lambda x,y:
f_interpolation(x,y)[0], (22.419,22.439),(37.06,37.08) ,cmap='jet', contours=numpy.arange(0,1400,100), colorbar=True)
plot_all = plot_interpolation
plot_all.show(axes_labels=["m", "m"])
Where matrix(c) can be a huge matrix like 10000 X 3 or even a lot more like 1000000 x 3. The problem of bad graphs persists even with fewer data like the picture i attached now where matrix(C) was only 200 x 3. That's why i begin to think that it could be that apart from a possible glitch with the program my approach to the use of this command might be totally wrong, hence the reason for me to ask for advice about using a grid and not just "throwing" my data into a command.
I've had a similar problem using the scipy.interpolate.interp2d function. My understanding is that the issue arises because the interp1d/interp2d and related functions use an older wrapping of FITPACK for the underlying calculations. I was able to get a problem similar to yours to work using the spline functions, which rely on a newer wrapping of FITPACK. The spline functions can be identified because they seem to all have capital letters in their names here http://docs.scipy.org/doc/scipy/reference/interpolate.html. Within the scipy installation, these newer functions appear to be located in scipy/interpolate/fitpack2.py, while the functions using the older wrappings are in fitpack.py.
For your purposes, RectBivariateSpline is what I believe you want. Here is some sample code for implementing RectBivariateSpline:
import numpy as np
from scipy import interpolate
# Generate unevenly spaced x/y data for axes
npoints = 25
maxaxis = 100
x = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
y = (np.random.rand(npoints)*maxaxis) - maxaxis/2.
xsort = np.sort(x)
ysort = np.sort(y)
# Generate the z-data, which first requires converting
# x/y data into grids
xg, yg = np.meshgrid(xsort,ysort)
z = xg**2 - yg**2
# Generate the interpolated, evenly spaced data
# Note that the min/max of x/y isn't necessarily 0 and 100 since
# randomly chosen points were used. If we want to avoid extrapolation,
# the explicit min/max must be found
interppoints = 100
xinterp = np.linspace(xsort[0],xsort[-1],interppoints)
yinterp = np.linspace(ysort[0],ysort[-1],interppoints)
# Generate the kernel that will be used for interpolation
# Note that the default version uses three coefficients for
# interpolation (i.e. parabolic, a*x**2 + b*x +c). Higher order
# interpolation can be used by setting kx and ky to larger
# integers, i.e. interpolate.RectBivariateSpline(xsort,ysort,z,kx=5,ky=5)
kernel = interpolate.RectBivariateSpline(xsort,ysort,z)
# Now calculate the linear, interpolated data
zinterp = kernel(xinterp, yinterp)