Linear regression confidence intervals in SQL - sql

I'm using some fairly straight-forward SQL code to calculate the coefficients of regression (intercept and slope) of some (x,y) data points, using least-squares. This gives me a nice best-fit line through the data. However we would like to be able to see the 95% and 5% confidence intervals for the line of best-fit (the curves below).
(source: curvefit.com)
What these mean is that the true line has 95% probability of being below the upper curve and 95% probability of being above the lower curve. How can I calculate these curves? I have already read wikipedia etc. and done some googling but I haven't found understandable mathematical equations to be able to calculate this.
Edit: here is the essence of what I have right now.
--sample data
create table #lr (x real not null, y real not null)
insert into #lr values (0,1)
insert into #lr values (4,9)
insert into #lr values (2,5)
insert into #lr values (3,7)
declare #slope real
declare #intercept real
--calculate slope and intercept
select
#slope = ((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(Power(x,2)))-Power(Sum(x),2)),
#intercept = avg(y) - ((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(Power(x,2)))-Power(Sum(x),2)) * avg(x)
from #lr
Thank you in advance.

An equation for confidence interval width as f(x) is given here under "Confidence Interval on Fitted Values"
http://www.weibull.com/DOEWeb/confidence_intervals_in_simple_linear_regression.htm
The page walks you through an example calculation too.

Try this site and scroll down to the middle. For each point of your best fit line, you know your Z, your sample size, and your std Deviation.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

#PowerUser: He needs to use the equations for two-variable setups, not for one-variable setups.
Matt: If I had my old Statistics textbook with me, I'd be able to tell you what you want; unfortunately, I don't have it with me, nor do I have my notes from my high school statistics course. On the other hand, from what I remember it may only have had stuff for the confidence interval of the regression line's slope...
Anyway, this page will hopefully be of some help: http://www.stat.yale.edu/Courses/1997-98/101/linregin.htm.

Related

Plotting extremely small Y-axis values in ggplot

In my data, some values of which are extremely small, as small as e-19. When I plot the data, those small values are clustered around zero.
Suppose I have 21 datapoints; 20 of which are ranging from e-05 to e-02 and 1 of which has a value of 1.
my.df <- data.frame(group=state.name[1:21],col1 = c(runif(20,min=1e-05,max=1e-02),1))
p <- ggplot(my.df, aes(x=group,y=col1)) +
geom_point()
I want to have more precise y axis values so that those small values would spread more, not only stuck around zero. Preferably, I would like to have y values to increase in a geometric scale such as e-05, e-04, e-03, e-02, e-01, 1, something like this scale.
Thank you for your support in advance!
PS: I carefully checked if there is a replicate. I hope I haven't missed an already existing one.

unable to get the odds of a user- bayes theorm

I'm trying to solve the question which is quite basic using confusion matrix but my solution is not matching the correct solution.
Q: Let's say we have a drug test that can accurately identify the users of a drug 99% of the time, and accurately has a negative result for 99% of non-users. But only 0.3% of the overall users use this drug.
What are the odds of someone being an actual user of the drug given that they tested positive?
Also, is TP / (TP + FN) is same as P(A) P(B|A)/P(B) ?
My Approach:
TP TN Total
Users Predicted positive 29.7 0.3 30
Non-Users Predicted negative 99.7 9870.3 9970
129.4 9870.6 10000
From the above data, I got : 29.7/129.4 = 0.2295208655 around 22.95%
But the solution states : 22.8% . I'm confused. What is the right way to do this?
I got it :
The approach which was given was something like this - P(B) is 1.3% ( 0.99*0.003 + 0.01*0.997) So, P(B|A) = P(A) P(B|A) / P(B) = 0.003*0.99 / 0.013 = 0.228 . So, '22.8%'
But they have rounded the number to 1.3% instead of 1.294% and that's why the value is different!!

GIS buffer value degree to meters with spatiallite

I am new to Spatialite. I have following query:
select A.*
from linka as A, pointa as B
where Contains(Buffer(B.Geometry, 100), A.Geometry)
I actually want to create 100 meters buffer and get to know which are the link's are contained by it.
I can able to find the inserted '100' is actually degree value and it's giving me output which are coming in this range.
I can put the degree value also in my query but the transformation from degree to meters/kilometers is not same all around the world.
I gone through many sites and able to know 1 degree = 110 KM approx.
but from GIS expert and some reference sites also get to know at each pole on earth it's different.
For instance, the difference at Alta/Norway between metrical x and y for planar approximation is 34 km in x direction equal 111 km in y direction. The buffer looks similar to this while using geographic coordinates:
http://extremelysatisfactorytotalitarianism.com/blog/wp-content/uploads/2010/08/tissot_indicatrix_equirectangular_proj.png
I build software which convert geographical data to geometrical (X, Y -coordinate format) data and make transformation where Spatiallite can understand.
I also trying to read regarding SRID things but not able to understand how to insert it into my query.
temporary transform your geometry to a metric projection (eg UTM)
if i assume your current projection is WGS84 try the following statment
transform (buffer (transform (B.geometry, #projection), #dist), 4326))
-in #projection: your new projection, eg: 32631 for WGS 84 / UTM zone 31N (choose the projection that fits your Zone)
-in #dist: distance in meters
(4326 for WGS84)
If You are using SQL server 2008 or later, You should be able to use spatial types
lets assume linka contains geography column, and its name is geo, and it contains Points
dont forget to create spatial index !
try this
DECLARE #buffer geography = geography::Point( 1.234, 5.678, 4326 );
DECLARE #distance float = 100.0;
SELECT * from linka
WHERE linka.geo.STDistance(#buffer) < #distance

Efficiently finding the distance between 2 lat/longs in SQL

I'm working with billions of rows of data, and each row has an associated start latitude/longitude, and end latitude/longitude. I need to calculate the distance between each start/end point - but it is taking an extremely long time.
I really need to make what I'm doing more efficient.
Currently I use a function (below) to calculate the hypotenuse between points. Is there some way to make this more efficient?
I should say that I have already tried casting the lat/longs as spatial geographies and using SQL built in STDistance() functions (not indexed), but this was even slower.
Any help would be much appreciated. I'm hoping there is some way to speed up the function, even if it degrades accuracy a little (nearest 100m is probably ok).
Thanks in advance!
DECLARE #l_distance_m float
, #l_long_start FLOAT
, #l_long_end FLOAT
, #l_lat_start FLOAT
, #l_lat_end FLOAT
, #l_x_diff FLOAT
, #l_y_diff FLOAT
SET #l_lat_start = #lat_start
SET #l_long_start = #long_start
SET #l_lat_end = #lat_end
SET #l_long_end = #long_end
-- NOTE 2 x PI() x (radius of earth) / 360 = 111
SET #l_y_diff = 111 * (#l_lat_end - #l_lat_start)
SET #l_x_diff = 111 * (#l_long_end - #l_long_start) * COS(RADIANS((#l_lat_end + #l_lat_start) / 2))
SET #l_distance_m = 1000 * SQRT(#l_x_diff * #l_x_diff + #l_y_diff * #l_y_diff)
RETURN #l_distance_m
I haven't done any SQL programming since around 1994, however I'd make the following observations:The formula that you're using is a formula that works as long as the distances between your coordinates doesn't get too big. It'll have big errors for working out the distance between e.g. New York and Singapore, but for working out the distance between New York and Boston it should be fine to within 100m.I don't think there's any approximation formula that would be faster, however I can see some minor implementation improvements that might speed it up such as (1) why do you bother to assign #l_lat_start from #lat_start, can't you just use #lat_start directly (and same for #long_start, #lat_end, #long_end), (2) Instead of having 111 in the formulas for #l_y_diff and #l_x_diff, you could get rid of it there hence saving a multiplication, and instead of 1000 in the formula for #l_distance_m you could have 111000, (3) using COS(RADIANS(#l_lat_end)) or COS(RADIANS(#l_lat_start)) won't degrade the accuracy as long as the points aren't too far away, or if the points are all within the same city you could just work out the cosine of any point in the cityApart from that, I think you'd need to look at other ideas such as creating a table with the results, and whenever points are added/deleted from the table, updating the results table at that time.

Is there an iterative way to calculate radii along a scanline?

I am processing a series of points which all have the same Y value, but different X values. I go through the points by incrementing X by one. For example, I might have Y = 50 and X is the integers from -30 to 30. Part of my algorithm involves finding the distance to the origin from each point and then doing further processing.
After profiling, I've found that the sqrt call in the distance calculation is taking a significant amount of my time. Is there an iterative way to calculate the distance?
In other words:
I want to efficiently calculate: r[n] = sqrt(x[n]*x[n] + y*y)). I can save information from the previous iteration. Each iteration changes by incrementing x, so x[n] = x[n-1] + 1. I can not use sqrt or trig functions because they are too slow except at the beginning of each scanline.
I can use approximations as long as they are good enough (less than 0.l% error) and the errors introduced are smooth (I can't bin to a pre-calculated table of approximations).
Additional information:
x and y are always integers between -150 and 150
I'm going to try a couple ideas out tomorrow and mark the best answer based on which is fastest.
Results
I did some timings
Distance formula: 16 ms / iteration
Pete's interperlating solution: 8 ms / iteration
wrang-wrang pre-calculation solution: 8ms / iteration
I was hoping the test would decide between the two, because I like both answers. I'm going to go with Pete's because it uses less memory.
Just to get a feel for it, for your range y = 50, x = 0 gives r = 50 and y = 50, x = +/- 30 gives r ~= 58.3. You want an approximation good for +/- 0.1%, or +/- 0.05 absolute. That's a lot lower accuracy than most library sqrts do.
Two approximate approaches - you calculate r based on interpolating from the previous value, or use a few terms of a suitable series.
Interpolating from previous r
r = ( x2 + y2 ) 1/2
dr/dx = 1/2 . 2x . ( x2 + y2 ) -1/2 = x/r
double r = 50;
for ( int x = 0; x <= 30; ++x ) {
double r_true = Math.sqrt ( 50*50 + x*x );
System.out.printf ( "x: %d r_true: %f r_approx: %f error: %f%%\n", x, r, r_true, 100 * Math.abs ( r_true - r ) / r );
r = r + ( x + 0.5 ) / r;
}
Gives:
x: 0 r_true: 50.000000 r_approx: 50.000000 error: 0.000000%
x: 1 r_true: 50.010000 r_approx: 50.009999 error: 0.000002%
....
x: 29 r_true: 57.825065 r_approx: 57.801384 error: 0.040953%
x: 30 r_true: 58.335225 r_approx: 58.309519 error: 0.044065%
which seems to meet the requirement of 0.1% error, so I didn't bother coding the next one, as it would require quite a bit more calculation steps.
Truncated Series
The taylor series for sqrt ( 1 + x ) for x near zero is
sqrt ( 1 + x ) = 1 + 1/2 x - 1/8 x2 ... + ( - 1 / 2 )n+1 xn
Using r = y sqrt ( 1 + (x/y)2 ) then you're looking for a term t = ( - 1 / 2 )n+1 0.36n with magnitude less that a 0.001, log ( 0.002 ) > n log ( 0.18 ) or n > 3.6, so taking terms to x^4 should be Ok.
Y=10000
Y2=Y*Y
for x=0..Y2 do
D[x]=sqrt(Y2+x*x)
norm(x,y)=
if (y==0) x
else if (x>y) norm(y,x)
else {
s=Y/y
D[round(x*s)]/s
}
If your coordinates are smooth, then the idea can be extended with linear interpolation. For more precision, increase Y.
The idea is that s*(x,y) is on the line y=Y, which you've precomputed distances for. Get the distance, then divide it by s.
I assume you really do need the distance and not its square.
You may also be able to find a general sqrt implementation that sacrifices some accuracy for speed, but I have a hard time imagining that beating what the FPU can do.
By linear interpolation, I mean to change D[round(x)] to:
f=floor(x)
a=x-f
D[f]*(1-a)+D[f+1]*a
This doesn't really answer your question, but may help...
The first questions I would ask would be:
"do I need the sqrt at all?".
"If not, how can I reduce the number of sqrts?"
then yours: "Can I replace the remaining sqrts with a clever calculation?"
So I'd start with:
Do you need the exact radius, or would radius-squared be acceptable? There are fast approximatiosn to sqrt, but probably not accurate enough for your spec.
Can you process the image using mirrored quadrants or eighths? By processing all pixels at the same radius value in a batch, you can reduce the number of calculations by 8x.
Can you precalculate the radius values? You only need a table that is a quarter (or possibly an eighth) of the size of the image you are processing, and the table would only need to be precalculated once and then re-used for many runs of the algorithm.
So clever maths may not be the fastest solution.
Well there's always trying optimize your sqrt, the fastest one I've seen is the old carmack quake 3 sqrt:
http://betterexplained.com/articles/understanding-quakes-fast-inverse-square-root/
That said, since sqrt is non-linear, you're not going to be able to do simple linear interpolation along your line to get your result. The best idea is to use a table lookup since that will give you blazing fast access to the data. And, since you appear to be iterating by whole integers, a table lookup should be exceedingly accurate.
Well, you can mirror around x=0 to start with (you need only compute n>=0, and the dupe those results to corresponding n<0). After that, I'd take a look at using the derivative on sqrt(a^2+b^2) (or the corresponding sin) to take advantage of the constant dx.
If that's not accurate enough, may I point out that this is a pretty good job for SIMD, which will provide you with a reciprocal square root op on both SSE and VMX (and shader model 2).
This is sort of related to a HAKMEM item:
ITEM 149 (Minsky): CIRCLE ALGORITHM
Here is an elegant way to draw almost
circles on a point-plotting display:
NEW X = OLD X - epsilon * OLD Y
NEW Y = OLD Y + epsilon * NEW(!) X
This makes a very round ellipse
centered at the origin with its size
determined by the initial point.
epsilon determines the angular
velocity of the circulating point, and
slightly affects the eccentricity. If
epsilon is a power of 2, then we don't
even need multiplication, let alone
square roots, sines, and cosines! The
"circle" will be perfectly stable
because the points soon become
periodic.
The circle algorithm was invented by
mistake when I tried to save one
register in a display hack! Ben Gurley
had an amazing display hack using only
about six or seven instructions, and
it was a great wonder. But it was
basically line-oriented. It occurred
to me that it would be exciting to
have curves, and I was trying to get a
curve display hack with minimal
instructions.