How to compute the p-value in hypothesis testing (linear regression) - testing

Currently I'm working on an awk script to do some statistical analysis on measurement data. I'm using linear regression to get parameter estimates, standard errors etc. and would also like to compute the p-value for a null-hypothesis test (t-test).
This is my script so far, any idea how to compute the p-value?
BEGIN {
ybar = 0.0
xbar = 0.0
n = 0
a0 = 0.0
b0 = 0.0
qtinf0975 = 1.960 # 5% n = inf
}
{ # y_i is in $1, x_i has to be counted
n = n + 1
yi[n] = $1*1.0
xi[n] = n*1.0
}
END {
for ( i = 1; i <= n ; i++ ) {
ybar = ybar + yi[i]
xbar = xbar + xi[i]
}
ybar = ybar/(n*1.0)
xbar = xbar/(n*1.0)
bhat = 0.0
ssqx = 0.0
for ( i = 1; i <= n; i++ ) {
bhat = bhat + (yi[i] - ybar)*(xi[i] - xbar)
ssqx = ssqx + (xi[i] - xbar)*(xi[i] - xbar)
}
bhat = bhat/ssqx
ahat = ybar - bhat*xbar
print "n: ", n
print "alpha-hat: ", ahat
print "beta-hat: ", bhat
sigmahat2 = 0.0
for ( i = 1; i <= n; i++ ) {
ri[i] = yi[i] - (ahat + bhat*xi[i])
sigmahat2 = sigmahat2 + ri[i]*ri[i]
}
sigmahat2 = sigmahat2 / ( n*1.0 - 2.0 )
print "sigma-hat square: ", sigmahat2
seb = sqrt(sigmahat2/ssqx)
print "se(b): ", seb
sigmahat = sqrt((seb*seb)*ssqx)
print "sigma-hat: ", sigma
sea = sqrt(sigmahat*sigmahat * ( 1 /(n*1.0) + xbar*xbar/ssqx))
print "se(a): ", sea
# Tests
print "q(inf)(97.5%): ", qtinf0975
Tb = (bhat - b0) / seb
if ( qtinf0975 > Tb )
print "T(b) plausible: ", Tb, " < ", qtinf0975
else
print "T(b) NOT plausible: ", Tb, " > ", qtinf0975
print "confidence(b): [", bhat - seb * qtinf0975,", ", bhat + seb * qtinf0975 ,"]"
Ta = (ahat - a0) / sea
if ( qtinf0975 > Ta )
print "T(a) plausible: ", Ta, " < ", qtinf0975
else
print "T(a) NOT plausible: ", Ta, " > ", qtinf0975
print "confidence(a): [", ahat - seb * qtinf0975,", ", ahat + seb * qtinf0975 ,"]"
}

You're probably trying to do a paired t-test under the assumption of variance equality. I suggest you have a look at the corresponding entry in the excellent MathWorld website.

OK, I've found a javascript implementation and ported it to awk this are the functions used to compute the p-value:
function statcom ( mq, mi, mj, mb )
{
zz = 1
mz = zz
mk = mi
while ( mk <= mj ) {
zz = zz * mq * mk / ( mk - mb)
mz = mz + zz
mk = mk + 2
}
return mz
}
function studpval ( mt , mn )
{
PI = 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679 # thank you wikipedia
if ( mt < 0 )
mt = -mt
mw = mt / sqrt(mn)
th = atan2(mw, 1)
if ( mn == 1 )
return 1.0 - th / (PI/2.0)
sth = sin(th)
cth = cos(th)
if ( mn % 2 == 1 )
return 1.0 - (th+sth*cth*statcom(cth*cth, 2, mn-3, -1))/(PI/2.0)
else
return 1.0 - sth * statcom(cth*cth, 1, mn-3, -1)
}
I've integrated them like this:
pvalb = studpval(Tb, n)
if ( pvalb > 0.05 )
print "p-value(b) plausible: ", pvalb, " > 0.05"
else
print "p-value(b) NOT plausible: ", pvalb, " < 0.05"
pvala = studpval(Ta, n)
if ( pvala > 0.05 )
print "p-value(a) plausible: ", pvala, " > 0.05"
else
print "p-value(a) NOT plausible: ", pvala, " < 0.05"

Related

Conversion ECEF XYZ to LLH (LAT/LONG/HEIGHT) and translation back - not accurate / possible error in IronPython script

I've modeled a 3D earth with gridpoints, as below:
The points are represented in 3D space as XYZ coordinates.
I then convert XYZ to Lat/Long/Height(elevation) based on the script I took from here:
JSFiddle
For some reason I got really strange results when trying to find XYZ of LLH not from my set, so I tried to verify the initial script by converting XYZ to LLH and then the same LLH back to XYZ to see if I get the same coordinate.
Instead, the resulting coordinate is some XYZ on earth, unrelated to the original XYZ position.
XYZ to LLH script:
Source: JSFiddle
def xyzllh(x,y,z):
""" xyz vector to lat,lon,height
output:
llhvec[3] with components
flat geodetic latitude in deg
flon longitude in deg
altkm altitude in km
"""
dtr = math.pi/180.0
rrnrm = [0.0] * 3
llhvec = [0.0] * 3
geodGBL()
esq = EARTH_Esq
rp = math.sqrt( x*x + y*y + z*z )
flatgc = math.asin( z / rp )/dtr
testval= abs(x) + abs(y)
if ( testval < 1.0e-10):
flon = 0.0
else:
flon = math.atan2( y,x )/dtr
if (flon < 0.0 ):
flon = flon + 360.0
p = math.sqrt( x*x + y*y )
# on pole special case
if ( p < 1.0e-10 ):
flat = 90.0
if ( z < 0.0 ):
flat = -90.0
altkm = rp - rearth(flat)
llhvec[0] = flat
llhvec[1] = flon
llhvec[2] = altkm
return llhvec
# first iteration, use flatgc to get altitude
# and alt needed to convert gc to gd lat.
rnow = rearth(flatgc)
altkm = rp - rnow
flat = gc2gd(flatgc,altkm)
rrnrm = radcur(flat)
rn = rrnrm[1]
for x in range(5):
slat = math.sin(dtr*flat)
tangd = ( z + rn*esq*slat ) / p
flatn = math.atan(tangd)/dtr
dlat = flatn - flat
flat = flatn
clat = math.cos( dtr*flat )
rrnrm = radcur(flat)
rn = rrnrm[1]
altkm = (p/clat) - rn
if ( abs(dlat) < 1.0e-12 ):
break
llhvec[0] = flat
llhvec[1] = flon
llhvec[2] = altkm
return llhvec
# globals
EARTH_A = 0
EARTH_B = 0
EARTH_F = 0
EARTH_Ecc = 0
EARTH_Esq = 0
# starting function do_llhxyz()
CallCount = 0
llh = [0.0] * 3
dtr = math.pi/180
CallCount = CallCount + 1
sans = " \n"
llh = xyzllh(x,y,z)
latitude = llh[0]
longitude= llh[1]
hkm = llh[2]
height = 1000.0 * hkm
latitude = fformat(latitude,5)
longitude = fformat(longitude,5)
height = fformat(height,1)
sans = sans +"Latitude,Longitude, Height (ellipsoidal) from ECEF\n"
sans = sans + "\n"
sans = sans +"Latitude : " + str(latitude) + " deg N\n"
sans = sans +"Longitude : " + str(longitude - 180) + " deg E\n"
sans = sans +"Height : " + str(height) + " m\n"
lats = []
longs = []
heights = []
lats.append(str(latitude))
longs.append(str(longitude - 180))
heights.append(str(height))
And this is the LLH to XYZ script:
Source: www.mathworks.com
a = 6378137
t = 8.1819190842622e-2
# (prime vertical radius of curvature)
N = a / math.sqrt(1 - (t*t) * (math.sin(lat)*math.sin(lat)))
x = []
y = []
z = []
# results:
x.append( ((N+height) * math.cos(lat) * math.cos(long))/1000 )
y.append( ((N+height) * math.cos(lat) * math.sin(long))/1000 )
z.append( (((1-t*t) * N + height) * math.sin(lat))/1000 )
Anyone know what I'm doing wrong here?
Thanks!

Find missing number in sorted array

Whats wrong with this code ? Not able able to search missing number in a consecutive array using binary search.
a = [1,2,3,4,5,7,8]
lent = len(a)
beg =0
end = lent-1
while beg < end:
mid = (beg + end) / 2
if (a[mid]-a[beg])==(mid - beg):
beg = mid + 1
else:
end = mid -1
if(beg == end):
mid = (beg + end) / 2
print "missing"
print a[0]+ beg
Update #1: Yes, there was another one mistake. You're right. Here's updated version
Try this variant:
a = [1,2,3,4,5,7,8]
lent = len(a)
beg =0
end = lent-1
while beg < end:
mid = (beg + end) / 2
if (a[mid]-a[beg])==(mid - beg):
beg = mid
else:
end = mid
if abs(beg-end) <= 1:
print "missing: %s" % (a[0] + max(beg, mid),)
Result:
missing: 6
Also, try use functions, so you could easily test and debug your code on different lists:
def find_missing(a):
lent = len(a)
beg =0
end = lent-1
while beg < end:
mid = (beg + end) / 2
if (a[mid]-a[beg])==(mid - beg):
beg = mid
else:
end = mid
if abs(beg-end) <= 1:
return a[0] + max(beg, mid)
a = [1,2,3,4,5,7,8]
print find_missing(a)
a = [1,3,4,5,6]
print find_missing(a)
a = [1,2,3,4,5,7,8,9,10]
print find_missing(a)
Result:
6
2
6
//Implementation of the algorithm using java.
static int missingNumber(int [] nums) {
int i=0;
while(i < nums.length) {
int correct = nums[i];
if (nums[i] < nums.length && nums[i] != nums[correct]) {
swap(nums, i, correct);
} else {
i++;
}
}
for( int index=0; index<nums.length; index++){
if(index != nums[index]) {
return index;
}
}
return nums.length;
}
static void swap(int[] nums, int first, int second) {
int temp = nums[first];
nums[first] = nums[second];
nums[second] = temp;
}
int findMiss(int arr[], int low, int high)
{
if(low>high)
return -1;
if(arr[low]-1 != low)
return arr[low]-1;
int mid = (low + high) / 2;
if(arr[mid]-1 != mid)
return findMiss(arr,low,mid);
else
return findMiss(arr,mid+1,high);
}

How to remove scientific notation for Rplot chart

I developed this R-script to drive a decision flow Rplot chart, but I can't get it to show numeric values instead of scientific notation. I spent half of the work day yesterday trying to make it numeric by following examples I found on stackoverflow, but so far no luck. See code and screenshot for details.
#automatically convert columns with few unique values to factors
convertCol2factors<-function(data, minCount = 3)
{
for (c in 1:ncol(data))
if(is.logical(data[, c])){
data[, c] = as.factor(data[, c])
}else{
uc<-unique(data[, c])
if(length(uc) <= minCount)
data[, c] = as.factor(data[, c])
}
return(data)
}
#compute root node error
rootNodeError<-function(labels)
{
ul<-unique(labels)
g<-NULL
for (u in ul) g = c(g, sum(labels == u))
return(1-max(g)/length(labels))
}
# this function is almost identical to fancyRpartPlot{rattle}
# it is duplicated here because the call for library(rattle) may trigger GTK load,
# which may be missing on user's machine
replaceFancyRpartPlot<-function (model, main = "", sub = "", palettes, ...)
{
num.classes <- length(attr(model, "ylevels"))
default.palettes <- c("Greens", "Blues", "Oranges", "Purples",
"Reds", "Greys")
if (missing(palettes))
palettes <- default.palettes
missed <- setdiff(1:6, seq(length(palettes)))
palettes <- c(palettes, default.palettes[missed])
numpals <- 6
palsize <- 5
pals <- c(RColorBrewer::brewer.pal(9, palettes[1])[1:5],
RColorBrewer::brewer.pal(9, palettes[2])[1:5], RColorBrewer::brewer.pal(9,
palettes[3])[1:5], RColorBrewer::brewer.pal(9, palettes[4])[1:5],
RColorBrewer::brewer.pal(9, palettes[5])[1:5], RColorBrewer::brewer.pal(9,
palettes[6])[1:5])
if (model$method == "class") {
yval2per <- -(1:num.classes) - 1
per <- apply(model$frame$yval2[, yval2per], 1, function(x) x[1 +
x[1]])
}
else {
per <- model$frame$yval/max(model$frame$yval)
}
per <- as.numeric(per)
if (model$method == "class")
col.index <- ((palsize * (model$frame$yval - 1) + trunc(pmin(1 +
(per * palsize), palsize)))%%(numpals * palsize))
else col.index <- round(per * (palsize - 1)) + 1
col.index <- abs(col.index)
if (model$method == "class")
extra <- 104
else extra <- 101
rpart.plot::prp(model, type = 2, extra = extra, box.col = pals[col.index],
nn = TRUE, varlen = 0, faclen = 0, shadow.col = "grey",
fallen.leaves = TRUE, branch.lty = 3, ...)
title(main = main, sub = sub)
}
###############Upfront input correctness validations (where possible)#################
pbiWarning<-""
pbiInfo<-""
dataset <- dataset[complete.cases(dataset[, 1]), ] #remove rows with corrupted labels
dataset = convertCol2factors(dataset)
nr <- nrow( dataset )
nc <- ncol( dataset )
nl <- length( unique(dataset[, 1]))
goodDim <- (nr >=minRows && nc >= 2 && nl >= 2)
##############Main Visualization script###########
set.seed(randSeed)
opt = NULL
dtree = NULL
if(autoXval)
xval<-autoXvalFunc(nr)
dNames <- names(dataset)
X <- as.vector(dNames[-1])
form <- as.formula(paste('`', dNames[1], '`', "~ .", sep = ""))
# Run the model
if(goodDim)
{
for(a in 1:maxNumAttempts)
{
dtree <- rpart(form, dataset, control = rpart.control(minbucket = minBucket, cp = complexity, maxdepth = maxDepth, xval = xval)) #large tree
rooNodeErr <- rootNodeError(dataset[, 1])
opt <- optimalCPbyXError(as.data.frame(dtree$cptable))
dtree<-prune(dtree, cp = opt$CP)
if(opt$ind > 1)
break;
}
}
#info for classifier
if( showInfo && !is.null(dtree) && dtree$method == 'class')
pbiInfo <- paste("Rel error = ", d2form(opt$relErr * rooNodeErr),
"; CVal error = ", d2form(opt$xerror * rooNodeErr),
"; Root error = ", d2form(rooNodeErr),
";cp = ", d2form(opt$CP, 3), sep = "")
if(goodDim && opt$ind>1)
{
#fancyRpartPlot(dtree, sub = pbiInfo)
replaceFancyRpartPlot(dtree, sub = pbiInfo)
}else{
if( showWarnings )
pbiWarning <- ifelse(goodDim, paste("The tree depth is zero. Root error = ", d2form(rooNodeErr), sep = ""),
"Wrong data dimensionality" )
plot.new()
title( main = NULL, sub = pbiWarning, outer = FALSE, col.sub = "gray40" )
}
remove("dataset")
Also, how can I tell what "n" means from the photo below? (I copied this code from a project).
Try adding digits = -2 to the prp call in your code

Using libSVM programmatically

I have started using libSVM (java: https://github.com/cjlin1/libsvm) programmatically. I wrote the following code to test it:
svm_parameter param = new svm_parameter();
// default values
param.svm_type = svm_parameter.C_SVC;
param.kernel_type = svm_parameter.RBF;
param.degree = 3;
param.gamma = 0;
param.coef0 = 0;
param.nu = 0.5;
param.cache_size = 40;
param.C = 1;
param.eps = 1e-3;
param.p = 0.1;
param.shrinking = 1;
param.probability = 0;
param.nr_weight = 0;
param.weight_label = new int[0];
param.weight = new double[0];
svm_problem prob = new svm_problem();
prob.l = 4;
prob.y = new double[prob.l];
prob.x = new svm_node[prob.l][2];
for(int i = 0; i < prob.l; i++)
{
prob.x[i][0] = new svm_node();
prob.x[i][1] = new svm_node();
prob.x[i][0].index = 1;
prob.x[i][1].index = 2;
prob.x[i][0].value = (i%2!=0)?-1:1;
prob.x[i][1].value = (i/2%2==0)?-1:1;
prob.y[i] = (prob.x[i][0].value == 1 && prob.x[i][1].value == 1)?1:-1;
System.out.println("X = [ " + prob.x[i][0].value + ", " + prob.x[i][1].value + " ] \t -> " + prob.y[i] );
}
svm_model model = svm.svm_train(prob, param);
int test_length = 4;
for( int i = 0; i < test_length; i++)
{
svm_node[] x_test = new svm_node[2];
x_test[0] = new svm_node();
x_test[1] = new svm_node();
x_test[0].index = 1;
x_test[0].value = (i%2!=0)?-1:1;
x_test[1].index = 2;
x_test[1].value = (i/2%2==0)?-1:1;
double d = svm.svm_predict(model, x_test);
System.out.println("X[0] = " + x_test[0].value + " X[1] = " + x_test[1].value + "\t\t\t Y = "
+ ((x_test[0].value == 1 && x_test[1].value == 1)?1:-1) + "\t\t\t The predicton = " + d);
}
Since I am testing on the same training data, I'd expect to get 100% accuracy, but the output that I get, is the following:
X = [ 1.0, -1.0 ] -> -1.0
X = [ -1.0, -1.0 ] -> -1.0
X = [ 1.0, 1.0 ] -> 1.0
X = [ -1.0, 1.0 ] -> -1.0
*
optimization finished, #iter = 1
nu = 0.5
obj = -20000.0, rho = 1.0
nSV = 2, nBSV = 2
Total nSV = 2
X[0] = 1.0 X[1] = -1.0 Y = -1 The predicton = -1.0
X[0] = -1.0 X[1] = -1.0 Y = -1 The predicton = -1.0
X[0] = 1.0 X[1] = 1.0 Y = 1 The predicton = -1.0
X[0] = -1.0 X[1] = 1.0 Y = -1 The predicton = -1.0
We can see that the following prediction is erroneous:
X[0] = 1.0 X[1] = 1.0 Y = 1 The predicton = -1.0
Anyone knows what is the mistake in my code?
You're using Radial Basis Function (param.kernel_type = svm_parameter.RBF) which uses gamma. Setting 'param.gamma = 1' should yield 100% accuracy.

missing value where TRUE/FALSE needed

I'm working on pairs trading data and following function should give total.profit with value "k".
optimal.k = function (k) {
u = m + k * s
l = m - k * s
profit = 0
profit = 0
total.profit = 0
i = 1
p = 0.001
while ( i <= length(r) ) {
if ( r[i] >= u ) {
buy.unit = 1/East$Close[i]
sell.unit = 1/South$Close[i]
if ( i == length(r) ) {
buy.price = buy.unit * East$Close[i]
sell.price = sell.unit * South$Close[i]
profit = sell.price - buy.price
costs = (sell.price + buy.price) * p
total.profit = total.profit + profit - costs
break
}
while ( r[i] > m ) { #################################### here
i = i + 1
}
buy.price = buy.unit * East$Close[i]
sell.price = sell.unit * South$Close[i]
profit = sell.price - buy.price
costs = (sell.price + buy.price) * p
total.profit = total.profit + profit - costs
}
if ( r[i] <= l ) {
buy.unit = 1/South$Close[i]
sell.unit = 1/East$Close[i]
if ( i == length(r) ) {
buy.price = buy.unit * South$Close[i]
sell.price = sell.unit * East$Close[i]
profit = sell.price - buy.price
costs = (sell.price + buy.price) * p
total.profit = total.profit + profit - costs
break
}
while ( r[i] < m ) {
i = i + 1
}
buy.price = buy.unit * East$Close[i]
sell.price = sell.unit * South$Close[i]
profit = sell.price - buy.price
costs = (sell.price + buy.price) * p
total.profit = total.profit + profit - costs
}
if ( i == length(r) ) stop
i = i + 1
}
print(total.profit)
}
If I run the function, I get this error message.
optimal.k(1)
Error in while (r[i] > m) { : missing value where TRUE/FALSE needed
I don't get it why (r[i] > m) is NA
Does anyone know why it occurs?