np.poly1d: how to calculate R^2 - numpy

I am fitting my data to a linear regression. But I want to know how to calculate the R2 values. The following is the code I have so far.
total_csv= pd.read_csv('IgG1_sigma_biospin_neg.csv',header=0).iloc[:,:]
x_values=(19,20,21,22)
y_values=IgG1_sigma_biospin_neg.loc[0, ['19-', '20-', '21-', '22-']].tolist()
my_fitting= np.polyfit(x_values,y_values,1)
my_lin_fitting = np.poly1d(my_fitting)
my_x=Symbol('x')
print('my_equation:',expand(my_lin_fitting (my_x)))
I get the equation of the linear fitting of my data 35.6499591999999*x + 6018.6395529.
In [95]:y_values
Out[95]: [6698.0902240000005, 6733.253559000001, 6757.754712999999, 6808.75637]
Do you know how to calculate R2 values?

To the best of my knowledege, np.polyfit does not provide a coefficient of determination (R2).
The residual that Richard mentioned in his answer is something different, named Sum of Squares Error (SSE). More info about it here:
https://365datascience.com/tutorials/statistics-tutorials/sum-squares/
Good news is, you can easily calculate R2 from SSE. First you calculate the Sum of Square Total (SST), then the R2 is merely R2 = 1 - SSE / SST. (See above link for further explanations.)
import numpy as np
# generate pseudo-data so the code can be run standalone (nicer for a mwe)
x_values = np.arange(100)
y_values = 3 * x_values + 2 + np.random.random(100)-0.5
my_fitting = np.polyfit(x_values, y_values, 1, full=True)
coeff = my_fitting[0]
### Residual or Sum of Square Error (SSE)
SSE = my_fitting[1][0]
### Determining the Sum of Square Total (SST)
## the squared differences between the observed dependent variable and its mean
diff = y_values - y_values.mean()
square_diff = diff ** 2
SST = square_diff.sum()
### Now getting the coefficient of determination (R2)
R2 = 1 - SSE/SST
print(R2)
Another approach is to use the already implemented function provided by Scikit-learn./
## Alternative using sklearn
from sklearn.metrics import r2_score
predict = np.poly1d(coeff)
R2 = r2_score(y_values, predict(x_values))
print(R2)
Both methods give me the very same answer.

first thing - you should be using np.polynomial.polynomial Class/methods instead of np.polyfit (see the doc's on np.polyfit, pointing people to use the newer code)
You can then use the polyfit method there. It will by default only return the coefficients. If you want the residual (R2), then specify full=True. polyfit will then also return a list, with the first element the residual (R2). See here.
The mod to your code above would be below:
import numpy.polynomial.polynomial as poly
my_fitting, stats = poly.polyfit(x_values,y_values,1, full=True)
R2 = stats[0][0]

Related

Calculating cross-correlation between 2 signals using FFT without considering lags

I'm trying to calculate the cross correlation between 2 signals without considering a lag. Essentially I want to recreate the cross correlation of 2 signals with zero lags, to see if my understanding of how cross correlation is calculated is correct.
The following is my code:
x1 = np.linspace(0,2*np.pi,1000)
y1 = np.sin(x1)
#second signal is with a phase shift of pi/4
y2 = np.sin(x1 - np.pi/4)
#do FFT on each signal
y1_fft = np.fft.fft(y1)
y2_fft = np.fft.fft(y2)
#complex conjugate of y2_fft
y2_conj = np.conjugate(y2_fft)
#take inner product of the fft and conjugate of fft
np.inner(y1_fft,y2_conj)
The result is -353199.837 - 2.59E-11i which is wrong
In comparison when I use scipy.signal.correlate the following is the result
corr = sg.correlate(y1,y2,method='fft')
lags = sg.correlation_lags(y1.shape[0],y2.shape[0])
fig,ax = plt.subplots(1,1,figsize = (10,10))
ax.plot(lags,corr)
As seen, the cross-correlation at zero lag is around 475, however my result is very different.
Where am I going wrong?
The unnormalized circular correlation is calculated as follows
# your code
import numpy as np
x1 = np.linspace(0,2*np.pi,1000)
y1 = np.sin(x1)
y2 = np.sin(x1 - np.pi/4)
y1_fft = np.fft.fft(y1)
y2_fft = np.fft.fft(y2)
y2_conj = np.conjugate(y2_fft)
# calculate the correlation (without padding)
corr = np.fft.ifft(y1_fft * y2_conj)
np.sum(y1*y2), corr[0]
by circular correlation I mean that the signal will be rolled instead of shifted, rolling or shifting are equivalent for lag[0], but for e.g. lag=3 you would get something like
np.sum(y1*np.roll(y2, 3)), corr[3]
not the same as
np.sum(y1[3:] * y2[:-3])
If you want only correlation with lag 0, to be honest I think it is better to compute by the definition directly np.inner(y1, y2).

Difficulty with numpy broadcasting

I have two 2d point clouds (oldPts and newPts) which I whish to combine. They are mx2 and nx2 numpyinteger arrays with m and n of order 2000. newPts contains many duplicates or near duplicates of oldPts and I need to remove these before combining.
So far I have used the histogram2d function to produce a 2d representation of oldPts (H). I then compare each newPt to an NxN area of H and if it is empty I accept the point. This last part I am currently doing with a python loop which i would like to remove. Can anybody show me how to do this with broadcasting or perhaps suggest a completely different method of going about the problem. the working code is below
npzfile = np.load(path+datasetNo+'\\temp.npz')
arrs = npzfile.files
oldPts = npzfile[arrs[0]]
newPts = npzfile[arrs[1]]
# remove all the negative values
oldPts = oldPts[oldPts.min(axis=1)>=0,:]
newPts = newPts[newPts.min(axis=1)>=0,:]
# round to integers
oldPts = np.around(oldPts).astype(int)
newPts = newPts.astype(int)
# put the oldPts into 2d array
H, xedg,yedg= np.histogram2d(oldPts[:,0],oldPts[:,1],
bins = [xMax,yMax],
range = [[0, xMax], [0, yMax]])
finalNewList = []
N = 5
for pt in newPts:
if not H[max(0,pt[0]-N):min(xMax,pt[0]+N),
max(0,pt[1]- N):min(yMax,pt[1]+N)].any():
finalNewList.append(pt)
finalNew = np.array(finalNewList)
The right way to do this is to use linear algebra to compute the distance between each pair of 2-long vectors, and then accept only the new points that are "different enough" from each old point: using scipy.spatial.distance.cdist:
import numpy as np
oldPts = np.random.randn(1000,2)
newPts = np.random.randn(2000,2)
from scipy.spatial.distance import cdist
dist = cdist(oldPts, newPts)
print(dist.shape) # (1000, 2000)
okIndex = np.max(dist, axis=0) > 5
print(np.sum(okIndex)) # prints 1503 for me
finalNew = newPts[okIndex,:]
print(finalNew.shape) # (1503, 2)
Above I use the Euclidean distance of 5 as the threshold for "too close": any point in newPts that's farther than 5 from all points in oldPts is accepted into finalPts. You will have to look at the range of values in dist to find a good threshold, but your histogram can guide you in picking the best one.
(One good way to visualize dist is to use matplotlib.pyplot.imshow(dist).)
This is a more refined version of what you were doing with the histogram. In fact, you ought to be able to get the exact same answer as the histogram by passing in metric='minkowski', p=1 keyword arguments to cdist, assuming your histogram bin widths are the same in both dimensions, and using 5 again as the threshold.
(PS. If you're interested in another useful function in scipy.spatial.distance, check out my answer that uses pdist to find unique rows/columns in an array.)

how is pandas kurtosis defined?

I am trying to get kurtosis using pandas. By doing some exploration, I have
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
however, the output is:
-0.006755982906479385
But I think the kurtosis (https://en.wikipedia.org/wiki/Kurtosis) should be close to (maybe normalize over N-1 instead of N, but this does not matter here)
(test_series - test_series.mean()).pow(4).mean()/np.power(test_series.std(),4)
which is
2.9908543104146026
The pandas documentation says the following
Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0)
This is probably the excess kurtosis, defined as kurtosis - 3.
Pandas is calculating the UNBIASED estimator of the excess Kurtosis. Kurtosis is the normalized 4th central moment. To find the unbiased estimators of the cumulants you need the k-statistics.
So the unbiased estimator of kurtosis is (k4/k2**2)
To illustrate this:
import pandas as pd
import numpy as np
np.random.seed(11234)
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
#-0.0411811269445872
Now we can calculate this explicitly using the k-statistics:
n = len(test_series)
S1 = test_series.pow(1).sum()
S2 = test_series.pow(2).sum()
S3 = test_series.pow(3).sum()
S4 = test_series.pow(4).sum()
# Eq (7) and (5) from the k-statistics link
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
k2 = (n*S2-S1**2)/(n*(n-1))
# k2 is the same as the N-1 variance: test_series.std(ddof=1)**2
k4/k2**2
#-0.04118112694458816
If you want better agreement to more decimal places, you'll need to be careful with the sums as they get rather large. But they're identical to 12 places.

tensorflow logistic regression matrix

hello I'm new to tensorflow and I'm getting a feel for it. so i was given a task to multiply these 4 matrices. i was able to do that but now I'm being asked to Take the (16,4) outputs from the multiplication of the (16,8) and (8,4) and Apply a Logistics function on all outputs. Then multiply this new matrix of shape (16,4) by the (4,2) matrix. Take these (16,2) outputs and apply a Logistics function on them. Now multiply this new (16,2) matrix by the (2,1) matrix. I'm suppose to be able to do all this with matrix manipulation. I'm kind of confused on how to go about it because i only kind of sort of understand linear regression. i know they are similar but i wouldn't know how to apply it. any tips please. no I'm not asking for someone to finish i just would like a better example than what i was given because i can't figure out how to go about a logistic function using a matrix. this is what i have so far
import tensorflow as ts
import numpy as np
import os
# AWESOME SAUCE WARNING MESSAGE WAS GETTING ANNOYING
os.environ['TF_CPP_MIN_LOG_LEVEL']='2' #to avoid warnings about compilation
# for different matrix asked to multiply with
# use random for random numbers in each matrix
m1 = np.random.rand(16,8)
m2 = np.random.rand(8,4)
m3 = np.random.rand(4,2)
m4 = np.random.rand(2,1)
# using matmul to mulitply could use # or dot() but using tensorflow
c = ts.matmul(m1,m2)
d = ts.matmul(c,m3)
e = ts.matmul(d, m4)
#attempting to create log regression
arf = ts.Variable(m1,name = "ARF")
with ts.Session() as s:
r1 = s.run(c)
print("M1 * M2: \n",r1)
r2 = s.run(d)
print("Result of C * M3: \n ", r2)
r3 = s.run(e)
print("Result of D * M4: \n",r3)
#learned i cant reshape just that easily
#r4 = ts.reshape(m1,(16,4))
#print("Result of New M1: \n", r4)
I think you have the right idea. The logistic function is just 1 / (1 + exp(-z)) where z is the matrix you want to apply it to. With that in mind you can simply do:
logistic = 1 / (1 + ts.exp(-c))
This will apply the formula element-wise to the input. The result is that this:
lg = s.run(logistic)
print("Result of logistic function \n ", lg)
…will print a matrix the same size as c (16,4), where all values are between 0 and 1. You can then go on to the rest of the multiplications the assignment is asking for.

Categorical features correlation

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.
import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns
print('Pandas version:', pd.__version__)
# Pandas version: 1.3.0
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
np.arange(0, 55, 5),
include_lowest=True,
right=False)
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
confusion_matrix = pd.crosstab(tips["day"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[2]: 0.9386619340722221
confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[3]: 0.1649870749498837
please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead
I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation
I was looking to do same thing in BigQuery.
For numeric features you can use built in CORR(x,y) function.
For categorical features, you can calculate it as:
cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2).
Which translates to following SQL:
SELECT
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...
Higher number means lower correlation.
I used following python script to generate SQL:
import itertools
arr = range(1,10)
query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b)
for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM `...`;'
print (query)
It should be straightforward to do same thing in numpy.