Linear regression slope error in numpy - numpy

I use numpy.polyfit to get a linear regression: coeffs = np.polyfit(x, y, 1).
What is the best way to calculate the error of the fit's slope using numpy?

As already mentioned by #ebarr in the comments, you can use np.polyfit to return the residuals by using the keyword argument full=True.
Example:
x = np.array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([0.0, 0.8, 0.9, 0.1, -0.8, -1.0])
z, residuals, rank, singular_values, rcond = np.polyfit(x, y, 3, full=True)
residuals then is the sum of least squares.
Alternatively, you can use the keyword argument cov=True to get the covariance matrix.
Example:
x = np.array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([0.0, 0.8, 0.9, 0.1, -0.8, -1.0])
z, cov = np.polyfit(x, y, 3, cov=True)
Then, the diagonal elements of cov are the variances of the coefficients in z, i.e. np.sqrt(np.diag(cov)) gives you the standard deviations of the coefficients. You can use the standard deviations to estimate the probability that the absolute error exceeds a certain value, e.g. by inserting the standard deviations in the uncertainty propagation calculation. If you use e.g. 3*standard deviations in the uncertainty propagation, you calculate the error which will not be exceeded in 99.7% of the cases.
One last hint: you have to choose whether you choose full=True or cov=True. cov=True only works when full=False (default) or vice versa.

Related

How to permutation index view for weights and input nodes for a layer? TensorFlow/numpy

Let's say I have a permutation index:
pa = [2,0,4,3,1,5] # node permutation index
pw = [0,3,4,1,2,5] # weight permutation index
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
w = np.array([0.7, 0.6, 0.5, 0.4, 0.3, 0.2]
I want semantics that resemble this:
ap = a[pa] # [0.3, 0.1, 0.5, 0.4, 0.2, 0.6]
except that instead of a copy, I want a view. That is, I want:
ap[0] = 0.123
assert(a[2] == 0.123)
I don't think numpy has this concept. But wondering if there is a way to make this happen in TensorFlow.
I want this concept in TensorFlow for arbitrary weight sharing. I need the weight sharing to be pre-defined by an arbitrary index because each target node will have a different ordering of the same set of weights, and apply these weights to an arbitrary subset of the input layer. The weight must be referenced through the permutation so that back-propagation will modify all instances of the same weight.

Linear interpolation of two 2D arrays

In a previous question (fastest way to use numpy.interp on a 2-D array) someone asked for the fastest way to implement the following:
np.array([np.interp(X[i], x, Y[i]) for i in range(len(X))])
assume X and Y are matrices with many rows so the for loop is costly. There is a nice solution in this case that avoids the for loop (see linked answer above).
I am faced with a very similar problem, but I am unclear on whether the for loop can be avoided in this case:
np.array([np.interp(x, X[i], Y[i]) for i in range(len(X))])
In other words, I want to use linear interpolation to upsample a large number of signals stored in the rows of two matrices X and Y.
I was hoping to find a function in numpy or scipy (scipy.interpolate.interp1d) that supported this operation via broadcasting semantics but I so far can't seem to find one.
Other points:
If it helps, the rows X[i] and x are pre-sorted in my application. Also, in my case len(x) is quite a bit larger than len(X[i]).
The function scipy.signal.resample almost does what I want, but it doesn't use linear interpolation...
This is a vectorized approach that directly implements linear interpolation. First, for each x value and each i, j compute the weight w expressing how much of the interval (X[i, j], X[i, j+1]) is to the left of x.
If the entire interval is to the left of x, the weight of that interval is 1.
If none of the subinterval is to the left, the weight is 0
Otherwise, the weight is a number between 0 and 1, expressing the proportion of that interval to the left of x.
Then the value of PL interpolant is computed as Y[i, 0] + sum of differences dY[i, j] multiplied by the corresponding weight. The logic is to follow by how much the interpolant changes from interval to interval. The differences dY = np.diff(Y, axis=1) show how much it changes over the entire interval. Multiplication by the weight prorates that change accordingly.
Setup, with some small data arrays
import numpy as np
X = np.array([[0, 2, 5, 6, 9], [1, 3, 4, 7, 8]])
Y = np.array([[3, 5, 2, 4, 1], [8, 6, 9, 5, 4]])
x = np.linspace(1, 8, 20)
The computation
dX = np.diff(X, axis=1)
dY = np.diff(Y, axis=1)
w = np.clip((x - X[:, :-1, None])/dX[:, :, None], 0, 1)
y = Y[:, [0]] + np.sum(w*dY[:, :, None], axis=1)
Demonstration
This is only to show that the interpolation is correct. Blue points: original data, red ones are computed.
import matplotlib.pyplot as plt
plt.plot(x, y[0], 'ro')
plt.plot(X[0], Y[0], 'bo')
plt.plot(x, y[1], 'rd')
plt.plot(X[1], Y[1], 'bd')
plt.show()

How does tf.nn.moments calculate variance?

Look at the test example:
import tensorflow as tf
x = tf.constant([[1,2],[3,4],[5,6]])
mean, variance = tf.nn.moments(x, [0])
with tf.Session() as sess:
m, v = sess.run([mean, variance])
print(m, v)
The output is:
[3 4]
[2 2]
We want to calculate variance along the axis 0, the first column is [1,3,5], and mean = (1+3+5)/3=3, it is right, the variance = [(1-3)^2+(3-3)^2+(5-3)^2]/3=2.6666, but the output is 2, who can tell me how tf.nn.moments calculates variance?
By the way, view the API DOC, what does shift do?
The problem is that x is an integer tensor and TensorFlow, instead of forcing a conversion, performs the computation as good as it can without changing the type (so the outputs are also integers). You can pass float numbers in the construction of x or specify the dtype parameter of tf.constant:
x = tf.constant([[1,2],[3,4],[5,6]], dtype=tf.float32)
Then you get the expected result:
import tensorflow as tf
x = tf.constant([[1,2],[3,4],[5,6]], dtype=tf.float32)
mean, variance = tf.nn.moments(x, [0])
with tf.Session() as sess:
m, v = sess.run([mean, variance])
print(m, v)
>>> [ 3. 4.] [ 2.66666675 2.66666675]
About the shift parameter, it seems to allow you specify a value to, well, "shift" the input. By shift they mean subtract, so if your input is [1., 2., 4.] and you give a shift of, say, 2.5, TensorFlow would first subtract that amount and compute the moments from [-1.5, 0.5, 1.5]. In general, it seems safe to just leave it as None, which will perform a shift by the mean of the input, but I suppose there may be cases where giving a predetermined shift value (e.g. if you know or have an approximate idea of the mean of the input) may yield better numerical stability.
# Replace the following line with correct data dtype
x = tf.constant([[1,2],[3,4],[5,6]])
# suppose you don't want tensorflow to trim the decimal then use float data type.
x = tf.constant([[1,2],[3,4],[5,6]], dtype=tf.float32)
Results: array([ 2.66666675, 2.66666675], dtype=float32)
Note: from the original implementation shift is not used

Equivalent of R's of cor.test in Python

Is there a way I can find the r confidence interval in Python?
In R i could do something like:
cor.test(m, h)
Pearson's product-moment correlation
data: m and h
t = 0.8974, df = 4, p-value = 0.4202
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.6022868 0.9164582
sample estimates:
cor
0.4093729
In Python I can calculate r (cor) using:
r,p = scipy.stats.pearsonr(df.age, df.pets)
But that doesn't return the r confidence interval.
Here's one way to calculate confidence internal
First get the correlation value (pearson's)
In [85]: from scipy import stats
In [86]: corr = stats.pearsonr(df['col1'], df['col2'])
In [87]: corr
Out[87]: (0.551178607008175, 0.0)
Use the Fisher transformation to get z
In [88]: z = np.arctanh(corr[0])
In [89]: z
Out[89]: 0.62007264620685021
And, the sigma value i.e standard error
In [90]: sigma = (1/((len(df.index)-3)**0.5))
In [91]: sigma
Out[91]: 0.013840913308956662
Get normal 95% interval probability density function for normal continuous random variable apply two-sided conditional formula
In [92]: cint = z + np.array([-1, 1]) * sigma * stats.norm.ppf((1+0.95)/2)
Finally take hyperbolic tangent to get interval values for 95%
In [93]: np.tanh(cint)
Out[93]: array([ 0.53201034, 0.56978224])

numpy.fft.fft not computing dft at frequencies given by numpy.fft.fftfreq?

This is a mathematical question, but it is tied to the numpy implementation, so I decided to ask it at SO. Perhaps I'm hugely misunderstanding something, but if so I would like to be put straight.
numpy.ftt.ftt computes DFT according to equation:
numpy.ftt.fftfreq is supposed to return frequencies at which DFT was computed.
Say we have:
x = [0, 0, 1, 0, 0]
X = np.fft.fft(x)
freq = np.fft.fftfreq(5)
Then for signal x, its DFT transformation is X, and frequencies at which X is computed are given by freq. For example X[0] is DFT of x at frequency freq[0], X[1] is DFT of x at frequency freq[1], and so on.
But when I compute DFT of a simple signal by hand with the formula quoted above, my results indicate that X[1] is DFT of x at frequency 1, not at freq[1], X[2] is DFT of x at frequency 2, etc, not at freq[2], etc.
As an example:
In [32]: x
Out[32]: [0, 0, 1, 0, 0]
In [33]: X
Out[33]:
array([
1.00000000+0.j,
-0.80901699-0.58778525j,
0.30901699+0.95105652j, 0.30901699-0.95105652j,
-0.80901699+0.58778525j])
In [34]: freq
Out[34]: array([ 0. , 0.2, 0.4, -0.4, -0.2])
If I compute DFT of above signal for k = 0.2 (or freq[1]), I get
X at freq = 0.2: 0.876 - 0.482j, which isn't X[1].
If however I compute for k = 1 I get the same results as are in X[1] or -0.809 - 0.588j.
So what am I misunderstanding? If numpy.fft.fft(x)[n] is a DFT of x at frequency n, not at frequency numpy.fft.fttfreq(len(x))[n], what is the purpose of numpy.fft.fttfreq?
I think that because the values in the array returned by the numpy.fft.fttfreq are equal to the (k/n)*sampling frequency.
The frequencies of the dft result are equal to k/n divided by the time spacing, because the periodic function's period's amplitude will become the inverse of the original value after fft. You can consider the digital signal function is a periodic sampling function convoluted by the analog signal function. The convolution in time domain means multiplication in frequency domain, so that the time spacing of the input data will affect the frequency spacing of the dft result and the frequency spacing's value will become the original one divided by the time spacing. Originally, the frequency spacing of the dft result is equal to 1/n when the time spacing is equal to 1. So after the dft, the frequency spacing will become 1/n divided by the time spacing, which eqauls to 1/n multiplied by the sampling frequency.
To calculate that, the numpy.fft.fttfreq has two arguments, the length of the input and time spacing, which means the inverse of the sampling rate. The length of the input is equal to n, and the time spacing is equal to the value which the result k/n divided by (Default is 1.)
I have tried to let k = 2, and the result is equal to the X[2] in your example. In this situation, the k/n*1 is equal to the freq[2].
The DFT is a dimensionless basis transform or matrix multiplication. The output or result of a DFT has nothing to do with frequencies unless you know the sampling rate represented by the input vector (samples per second, per meter, per radian, etc.)
You can compute a Goertzel filter of the same length N with k=0.2, but that result isn't contained in an DFT or FFT result of length N. A DFT only contains complex Goertzel filter results for integer k values. And to get from k to the frequency represented by X[k], you need to know the sample rate.
Yours is not a SO question
You wrote
If I compute DFT of above signal for k = 0.2 .
and I reply "You shouldn't"... the DFT can be meaningfully computed only for integer values of k.
The relationship between an index k and a frequency is given by f_k = k Δf or, if you prefer circular frequencies, ω_k = k Δω where Δf = 1/T and Δω = 2πΔf, T being the period of the signal.
The arguments of fftfreq are a bit misleading... the required one is the number of samples n and the optional argument is the sampling interval, by default d=1.0, but at any rate T=n*d and Δf = 1/(n*d)
>>> fftfreq(5) # d=1
array([ 0. , 0.2, 0.4, -0.4, -0.2])
>>> fftfreq(5,2)
array([ 0. , 0.1, 0.2, -0.2, -0.1])
>>> fftfreq(5,10)
array([ 0. , 0.02, 0.04, -0.04, -0.02])
and the different T are 5,10,50 and the respective df are -.2,0.1,0.02 as (I) expected.
Why fftfreq doesn't simply require the signal's period? because it is mainly intended as an helper in demangling the Nyquist frequency issue.
As you know, the DFT is periodic, for a signal x of length N you have that
DFT(x,k) is equal to DFT(x,k+mN) where m is an integer.
This imply that there are only N/2 positive and N/2 negative distinct frequencies and that, when N/2<k<N, the frequency that must be associated with k in the most meaningful way is not k df but (k-N) df.
To perform this, fftfreq needs more information that the period T, hence the choice of requiring n and computing df from an assumption on sampling interval.