I am trying to write a function that will output some address information on a CIDR formatted IP (output underneath code):
create function dbo.ConvertIpToInt (#Ip as varchar(15))
returns bigint
as
begin
return (convert(bigint, parsename(#Ip, 1)) +
convert(bigint, parsename(#Ip, 2)) * 256 +
convert(bigint, parsename(#Ip, 3)) * 256 * 256 +
convert(bigint, parsename(#Ip, 4)) * 256 * 256 * 256)
end
go
create function dbo.ConvertIntToIp (#Int bigint)
returns varchar(15)
as
begin
declare
#IpHex varchar(8)
,#IpDotted varchar(15)
select
#IpHex = substring(convert(varchar(30), master.dbo.fn_varbintohexstr(#Int)), 11, 8)
select
#IpDotted = convert(varchar(3), convert(int, (convert(varbinary, substring(#IpHex, 1, 2), 2)))) + '.' +
convert(varchar(3), convert(int, (convert(varbinary, substring(#IpHex, 3, 2), 2)))) + '.' +
convert(varchar(3), convert(int, (convert(varbinary, substring(#IpHex, 5, 2), 2)))) + '.' +
convert(varchar(3), convert(int, (convert(varbinary, substring(#IpHex, 7, 2), 2))))
return #IpDotted
end
go
create function dbo.GetCidrIpRange (#CidrIp varchar(15))
returns #result table
(
CidrIp varchar(15) not null,
Mask int not null,
LowRange varchar(15) not null,
LowIp varchar(15) not null,
HighRange varchar(15) not null,
HighIp varchar(15) not null,
AddressQty bigint not null
)
as
begin
declare #Base bigint = cast(4294967295 as bigint)
declare #Mask int = cast(substring(#CidrIp, patindex('%/%' , #CidrIP) + 1, 2) as int)
declare #Power bigint = Power(2.0, 32.0 - #Mask) - 1
declare #LowRange bigint = dbo.ConvertIpToInt(left(#CidrIp, patindex('%/%' , #CidrIp) - 1)) & (#Base ^ #Power)
declare #HighRange bigint = #LowRange + #Power
insert #result
select
CidrIp = #CidrIp
, Mask = #Mask
, LowRange = #LowRange
, LowIp = dbo.ConvertIntToIp(#LowRange)
, HighRange = #HighRange
, HighIp = dbo.ConvertIntToIp(#HighRange)
, AddressQty = convert(bigint, power(2.0, (32.0 - #Mask)))
return
end
go
select * from dbo.GetCidrIpRange('195.65.254.11/2');
This outputs the following:
CidrIp Mask LowRange LowIp HighRange HighIp AddressQty
--------------------------------------------------------------------------------------
195.65.254.11/2 2 3221225472 192.0.0.0 4294967295 255.255.255.255 1073741824
I have been browsing SO and Google for some hours now, and I am quite convinced that ConvertIpToInt and ConvertIntToIp are correct.
However, I was expecting the following output:
CidrIp Mask LowRange LowIp HighRange HighIp AddressQty
--------------------------------------------------------------------------------------
195.65.254.11/2 2 3275881985 195.65.254.1 3275882238 195.65.254.254 254
Can someone please point me out where the mistake in my code is? I've been staring myself blind and I don't see it (or I am misunderstanding how to do this).
According to both http://www.ipaddressguide.com/cidr and http://jodies.de/ipcalc?host=195.65.254.11&mask1=2&mask2=, your calculations are correct. The only disagreement between those two sites is that the jodies.de/ipcalc page removes the lowest and highest (broadcast) IP addresses from the range.
I tested with both 195.65.254.11/2 and 195.65.254.11/24. In order to get your code working, I needed to change the input parameter specification on dbo.GetCidrIpRang to be VARCHAR(20) (as mentioned by #Damien_The_Unbeliever in a comment on the question).
Two notes regarding performance:
For the ConvertIpToInt and ConvertIntToIp Scalar UDFs you might be better off using the INET_AddressToNumber and INET_NumberToAddress functions, respectively, that are included in the Free version of the SQL# SQLCLR library (which I wrote, but hey, Free :). The reason for this recommendation is that unlike T-SQL UDFs, deterministic SQLCLR UDFs (and these two are) do not prevent parallel plans.
If you don't want to go the SQLCLR route, then you should, at the very least, keep the ConvertIntToIp function as purely mathematical. There is no reason to do all of those conversions and substrings.
CREATE FUNCTION dbo.IPNumberToAddress(#IPNumber BIGINT)
RETURNS VARCHAR(15)
WITH SCHEMABINDING
AS
BEGIN
DECLARE #Oct1 BIGINT,
#Oct2 INT,
#Oct3 INT;
SET #Oct1 = #IPNumber / (256 * 256 * 256);
SET #IPNumber -= (#Oct1 * (256 * 256 * 256));
SET #Oct2 = #IPNumber / (256 * 256);
SET #IPNumber -= (#Oct2 * (256 * 256));
SET #Oct3 = #IPNumber / 256;
SET #IPNumber -= (#Oct3 * 256);
RETURN CONCAT(#Oct1, '.', #Oct2, '.', #Oct3, '.', #IPNumber);
END;
GO
And then:
SELECT dbo.IPNumberToAddress(3275881995);
-- 195.65.254.11
For the GetCidrIpRange TVF, you would be better off converting that to be an Inline TVF. You can accomplish the multi-step calculations via CTEs in the following manner (you will just need to clean it up a little / finish it):
WITH cte1 AS
(
SELECT 2 AS [Mask] -- replace with real formula
), cte2 AS
(
SELECT 999 AS [Base], -- replace with real formula
POWER(2.0, 32.0 - cte1.[Mask]) - 1 AS [Power],
cte1.[Mask]
FROM cte1
), cte3 AS
(
SELECT SQL#.INET_AddressToNumber(left(#CidrIp, PATINDEX('%/%' , #CidrIp) - 1))
& (cte2.[Base] ^ cte2.[Power]) AS [LowRange],
cte2.[Power],
cte2.[Mask]
FROM cte2
)
SELECT #CidrIp AS [CidrIp],
cte3.[Mask],
cte3.[LowRange],
SQL#.INET_NumberToAddress(cte3.[LowRange]) AS [LowIp],
(cte3.[LowRange] + cte3.[Power]) AS [HighRange],
SQL#.INET_NumberToAddress(cte3.[LowRange] + cte3.[Power]) AS [HighIp],
CONVERT(BIGINT, POWER(2.0, (32.0 - cte3.[Mask]))) AS [AddressQty]
FROM cte3 c;
Below is an example of my data, table RR_Linest:
Portfolio ---- Month_number ---- Collections
A --- --------- 1 --------------------- $100-------------------------------------------------------------------------------------
A-------------- 2 --------------------- $90
A ------------- 3 --------------------- $80--------------------------------------------------------------------------------------
A ------------- 4 --------------------- $70--------------------------------------------------------------------------------------
B ------------- 1 -------------------- $100-------------------------------------------------------------------------------------
B ---- -------- 2 ---------------------- $90 -------------------------------------------------------------------------------------
B - ------------ 3 --------------------- $80
I was able to figure out how to how to get the slope,intercept, RSquare for one portfolio by removing the portfolio column and only selecting the month_Number (x) and collections data (y) for only one selected portfolio (I removed data for portfolio B) and running the code below.
I have been trying to change the function so that when I run it; it gives me the slope, intercept, and R-square by portfolio. Does someone know how to do that? I have tried many ways and I just can't figure it out.
First I created the function:
declare #RegressionInput_A [dbo].[RegressionInput_A]
insert into #RegressionInput_A (x,y)
select
([model month]),log([collection $])
from [dbo].[RR_Linest]
select * from [dbo].LinearRegression_A
GO
drop function dbo.LinearRegression_A
CREATE FUNCTION dbo.LinearRegression_A
(
#RegressionInputs_A AS dbo.RegressionInput_A READONLY
)
RETURNS #RegressionOutput_A TABLE
(
Slope DECIMAL(18, 6),
Intercept DECIMAL(18, 6),
RSquare DECIMAL(18, 6)
)
AS
BEGIN
DECLARE #Xaverage AS DECIMAL(18, 6)
DECLARE #Yaverage AS DECIMAL(18, 6)
DECLARE #slope AS DECIMAL(18, 6)
DECLARE #intercept AS DECIMAL(18, 6)
DECLARE #rSquare AS DECIMAL(18, 6)
SELECT
#Xaverage = AVG(x),
#Yaverage = AVG(y)
FROM
#RegressionInputs_A
SELECT
#slope = SUM((x - #Xaverage) * (y - #Yaverage))/SUM(POWER(x - #Xaverage, 2))
FROM
#RegressionInputs_A
SELECT
#intercept = #Yaverage - (#slope * #Xaverage)
SELECT #rSquare = 1 - (SUM(POWER(y - (#intercept + #slope * x), 2))/(SUM(POWER(y - (#intercept + #slope * x), 2)) + SUM(POWER(((#intercept + #slope * x) - #Yaverage), 2))))
FROM
#RegressionInputs_A
INSERT INTO
#RegressionOutput_A
(
Slope,
Intercept,
RSquare
)
SELECT
#slope,
#intercept,
#rSquare
RETURN
END
GO
Then I run the function
declare #RegressionInput_A [dbo].[RegressionInput_A]
insert into #RegressionInput_A (x,y)
select
([model month]),log([collection $])
from [dbo].[RR_Linest]
select * from [dbo].[LinearRegression_A](#RegressionInput_A)
Wow, this is a real cool example of how to use nested CTE's in a In Line Table Value Function. You want to use a ITVF since they are fast. See Wayne Sheffield’s blog article that attests to this fact.
I always start with a sample database/table if it is really complicated to make sure I give the user a correct solution.
Lets create a database named [test] based on model.
--
-- Create a simple db
--
-- use master
use master;
go
-- delete existing databases
IF EXISTS (SELECT name FROM sys.databases WHERE name = N'Test')
DROP DATABASE Test
GO
-- simple db based on model
create database Test;
go
-- switch to new db
use [Test];
go
Lets create a table type named [InputToLinearReg].
--
-- Create table type to pass data
--
-- Delete the existing table type
IF EXISTS (SELECT * FROM sys.systypes WHERE name = 'InputToLinearReg')
DROP TYPE dbo.InputToLinearReg
GO
-- Create the table type
CREATE TYPE InputToLinearReg AS TABLE
(
portfolio_cd char(1),
month_num int,
collections_amt money
);
go
Okay, here is the multi-layered SELECT statement that uses CTE's. The query analyzer treats this as a SQL statement which can be executed in parallel versus a regular function that can't. See the black box section of Wayne's article.
--
-- Create in line table value function (fast)
--
-- Remove if it exists
IF OBJECT_ID('CalculateLinearReg') > 0
DROP FUNCTION CalculateLinearReg
GO
-- Create the function
CREATE FUNCTION CalculateLinearReg
(
#ParmInTable AS dbo.InputToLinearReg READONLY
)
RETURNS TABLE
AS
RETURN
(
WITH cteRawData as
(
SELECT
T.portfolio_cd,
CAST(T.month_num as decimal(18, 6)) as x,
LOG(CAST(T.collections_amt as decimal(18, 6))) as y
FROM
#ParmInTable as T
),
cteAvgByPortfolio as
(
SELECT
portfolio_cd,
AVG(x) as xavg,
AVG(y) as yavg
FROM
cteRawData
GROUP BY
portfolio_cd
),
cteSlopeByPortfolio as
(
SELECT
R.portfolio_cd,
SUM((R.x - A.xavg) * (R.y - A.yavg)) / SUM(POWER(R.x - A.xavg, 2)) as slope
FROM
cteRawData as R
INNER JOIN
cteAvgByPortfolio A
ON
R.portfolio_cd = A.portfolio_cd
GROUP BY
R.portfolio_cd
),
cteInterceptByPortfolio as
(
SELECT
A.portfolio_cd,
(A.yavg - (S.slope * A.xavg)) as intercept
FROM
cteAvgByPortfolio as A
INNER JOIN
cteSlopeByPortfolio S
ON
A.portfolio_cd = S.portfolio_cd
)
SELECT
A.portfolio_cd,
A.xavg,
A.yavg,
S.slope,
I.intercept,
1 - (SUM(POWER(R.y - (I.intercept + S.slope * R.x), 2)) /
(SUM(POWER(R.y - (I.intercept + S.slope * R.x), 2)) +
SUM(POWER(((I.intercept + S.slope * R.x) - A.yavg), 2)))) as rsquared
FROM
cteRawData as R
INNER JOIN
cteAvgByPortfolio as A ON R.portfolio_cd = A.portfolio_cd
INNER JOIN
cteSlopeByPortfolio S ON A.portfolio_cd = S.portfolio_cd
INNER JOIN
cteInterceptByPortfolio I ON S.portfolio_cd = I.portfolio_cd
GROUP BY
A.portfolio_cd,
A.xavg,
A.yavg,
S.slope,
I.intercept
);
Last but not least, setup a Table Variable and get the answers. Unlike you solution above, it groups by portfolio id.
-- Load data into variable
DECLARE #InTable AS InputToLinearReg;
-- insert data
insert into #InTable
values
('A', 1, 100.00),
('A', 2, 90.00),
('A', 3, 80.00),
('A', 4, 70.00),
('B', 1, 100.00),
('B', 2, 90.00),
('B', 3, 80.00);
-- show data
select * from CalculateLinearReg(#InTable)
go
Here is a picture of the results using your data.
CREATE FUNCTION dbo.LinearRegression
(
#RegressionInputs AS dbo.RegressionInput READONLY
)
RETURNS TABLE AS
RETURN
(
WITH
t1 AS ( --calculate averages
SELECT portfolio, x, y,
AVG(x) OVER(PARTITION BY portfolio) Xaverage,
AVG(y) OVER(PARTITION BY portfolio) Yaverage
FROM #RegressionInputs
),
t2 AS ( --calculate slopes
SELECT portfolio, Xaverage, Yaverage,
SUM((x - Xaverage) * (y - Yaverage))/SUM(POWER(x - Xaverage, 2)) slope
FROM t1
GROUP BY portfolio, Xaverage, Yaverage
),
t3 AS ( --calculate intercepts
SELECT portfolio, slope,
(Yaverage - (slope * Xaverage) ) AS intercept
FROM t2
),
t4 AS ( --calculate rSquare
SELECT t1.portfolio, slope, intercept,
1 - (SUM(POWER(y - (intercept + slope * x), 2))/(SUM(POWER(y - (intercept + slope * x), 2)) + SUM(POWER(((intercept + slope * x) - Yaverage), 2)))) AS rSquare
FROM t1
INNER JOIN t3 ON (t1.portfolio = t3.portfolio)
GROUP BY t1.portfolio
)
SELECT portfolio, slope, intercept, rSquare FROM t4
)
I am grabbing a postcode from a form. I can then convert this postcode to lng,lat coordinates as I have these stored in a table.
SELECT lng, lat from postcodeLngLat WHERE postcode = 'CV1'
I have another table which stores the lng,lat of a selection of venues.
SELECT v.lat, v.lng, v.name, p.lat, p.lng, p.postcode, 'HAVERSINE' AS distance FROM venuepostcodes v, postcodeLngLat p WHERE p.outcode = 'CB6' ORDER BY distance
What I am trying to do is create a datagrid which shows the distance of each venue from the postcode (CV1 in this case). I know that the Haversine formula should do what I am trying to achieve but I'm lost as to where I should start incorporating it into my query. I think the formula needs to go where I've put 'HAVERSINE' in the query above.
Any ideas?
EDIT
SELECT o.outcode AS lead_postcode, v.venue_name, 6371.0E * ( 2.0E *asin(case when 1.0E < (sqrt(square(sin(((RADIANS(CAST(o.lat AS FLOAT)))-(RADIANS(CAST(v.lat AS FLOAT))))/2.0E)) + (cos(RADIANS(CAST(v.lat AS FLOAT))) * cos(RADIANS(CAST(o.lat AS FLOAT))) * square(sin(((RADIANS(CAST(o.lng AS FLOAT)))-(RADIANS(CAST(v.lng AS FLOAT))))/2.0E))))) then 1.0E else (sqrt(square(sin(((RADIANS(CAST(o.lat AS FLOAT)))-(RADIANS(CAST(v.lat AS FLOAT))))/2.0E)) + (cos(RADIANS(CAST(v.lat AS FLOAT))) * cos(RADIANS(CAST(o.lat AS FLOAT))) * square(sin(((RADIANS(CAST(o.lng AS FLOAT)))-(RADIANS(CAST(v.lng AS FLOAT))))/2.0E))))) end )) AS distance FROM venuepostcodes v, outcodepostcodes o WHERE o.outcode = 'CB6' ORDER BY distance
I think you'd do best putting it in a UDF and using that in your query:
SELECT v.lat, v.lng, v.name, p.lat, p.lng, p.postcode, udf_Haversine(v.lat, v.lng, p.lat, p.lng) AS distance FROM venuepostcodes v, postcodeLngLat p WHERE p.outcode = 'CB6' ORDER BY distance
create function dbo.udf_Haversine(#lat1 float, #long1 float, #lat2 float, #long2 float) returns float begin
declare #dlon float, #dlat float, #rlat1 float, #rlat2 float, #rlong1 float, #rlong2 float, #a float, #c float, #R float, #d float, #DtoR float
select #DtoR = 0.017453293
select #R = 3937 --3976
select
#rlat1 = #lat1 * #DtoR,
#rlong1 = #long1 * #DtoR,
#rlat2 = #lat2 * #DtoR,
#rlong2 = #long2 * #DtoR
select
#dlon = #rlong1 - #rlong2,
#dlat = #rlat1 - #rlat2
select #a = power(sin(#dlat/2), 2) + cos(#rlat1) * cos(#rlat2) * power(sin(#dlon/2), 2)
select #c = 2 * atn2(sqrt(#a), sqrt(1-#a))
select #d = #R * #c
return #d
end
Alternatively uou could also use SQL Server 2008 geography datatypes. If you currently store the longitude/latitide as varchar() in the DB, you will have to store them as geograpghy datatype and then use a function like STIntersects() to get the distance.
Are there any Linear Regression Function in SQL Server 2005/2008, similar to the the Linear Regression functions in Oracle ?
To the best of my knowledge, there is none. Writing one is pretty straightforward, though. The following gives you the constant alpha and slope beta for y = Alpha + Beta * x + epsilon:
-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance)
WITH some_table(GroupID, x, y) AS
( SELECT 1, 1, 1 UNION SELECT 1, 2, 2 UNION SELECT 1, 3, 1.3
UNION SELECT 1, 4, 3.75 UNION SELECT 1, 5, 2.25 UNION SELECT 2, 95, 85
UNION SELECT 2, 85, 95 UNION SELECT 2, 80, 70 UNION SELECT 2, 70, 65
UNION SELECT 2, 60, 70 UNION SELECT 3, 1, 2 UNION SELECT 3, 1, 3
UNION SELECT 4, 1, 2 UNION SELECT 4, 2, 2),
-- linear regression query
/*WITH*/ mean_estimates AS
( SELECT GroupID
,AVG(x * 1.) AS xmean
,AVG(y * 1.) AS ymean
FROM some_table
GROUP BY GroupID
),
stdev_estimates AS
( SELECT pd.GroupID
-- T-SQL STDEV() implementation is not numerically stable
,CASE SUM(SQUARE(x - xmean)) WHEN 0 THEN 1
ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev
, SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1)) AS ystdev
FROM some_table pd
INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID
GROUP BY pd.GroupID, pm.xmean, pm.ymean
),
standardized_data AS -- increases numerical stability
( SELECT pd.GroupID
,(x - xmean) / xstdev AS xstd
,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd
FROM some_table pd
INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID
INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID
),
standardized_beta_estimates AS
( SELECT GroupID
,CASE WHEN SUM(xstd * xstd) = 0 THEN 0
ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END AS betastd
FROM standardized_data pd
GROUP BY GroupID
)
SELECT pb.GroupID
,ymean - xmean * betastd * ystdev / xstdev AS Alpha
,betastd * ystdev / xstdev AS Beta
FROM standardized_beta_estimates pb
INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID
INNER JOIN mean_estimates pm ON pm.GroupID = pb.GroupID
Here GroupID is used to show how to group by some value in your source data table. If you just want the statistics across all data in the table (not specific sub-groups), you can drop it and the joins. I have used the WITH statement for sake of clarity. As an alternative, you can use sub-queries instead. Please be mindful of the precision of the data type used in your tables as the numerical stability can deteriorate quickly if the precision is not high enough relative to your data.
EDIT: (in answer to Peter's question for additional statistics like R2 in the comments)
You can easily calculate additional statistics using the same technique. Here is a version with R2, correlation, and sample covariance:
-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance)
WITH some_table(GroupID, x, y) AS
( SELECT 1, 1, 1 UNION SELECT 1, 2, 2 UNION SELECT 1, 3, 1.3
UNION SELECT 1, 4, 3.75 UNION SELECT 1, 5, 2.25 UNION SELECT 2, 95, 85
UNION SELECT 2, 85, 95 UNION SELECT 2, 80, 70 UNION SELECT 2, 70, 65
UNION SELECT 2, 60, 70 UNION SELECT 3, 1, 2 UNION SELECT 3, 1, 3
UNION SELECT 4, 1, 2 UNION SELECT 4, 2, 2),
-- linear regression query
/*WITH*/ mean_estimates AS
( SELECT GroupID
,AVG(x * 1.) AS xmean
,AVG(y * 1.) AS ymean
FROM some_table pd
GROUP BY GroupID
),
stdev_estimates AS
( SELECT pd.GroupID
-- T-SQL STDEV() implementation is not numerically stable
,CASE SUM(SQUARE(x - xmean)) WHEN 0 THEN 1
ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev
, SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1)) AS ystdev
FROM some_table pd
INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID
GROUP BY pd.GroupID, pm.xmean, pm.ymean
),
standardized_data AS -- increases numerical stability
( SELECT pd.GroupID
,(x - xmean) / xstdev AS xstd
,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd
FROM some_table pd
INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID
INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID
),
standardized_beta_estimates AS
( SELECT GroupID
,CASE WHEN SUM(xstd * xstd) = 0 THEN 0
ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END AS betastd
FROM standardized_data
GROUP BY GroupID
)
SELECT pb.GroupID
,ymean - xmean * betastd * ystdev / xstdev AS Alpha
,betastd * ystdev / xstdev AS Beta
,CASE ystdev WHEN 0 THEN 1 ELSE betastd * betastd END AS R2
,betastd AS Correl
,betastd * xstdev * ystdev AS Covar
FROM standardized_beta_estimates pb
INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID
INNER JOIN mean_estimates pm ON pm.GroupID = pb.GroupID
EDIT 2 improves numerical stability by standardizing data (instead of only centering) and by replacing STDEV because of numerical stability issues. To me, the current implementation seems to be the best trade-off between stability and complexity. I could improve stability by replacing my standard deviation with a numerically stable online algorithm, but this would complicate the implementation substantantially (and slow it down). Similarly, implementations using e.g. Kahan(-Babuška-Neumaier) compensations for the SUM and AVG seem to perform modestly better in limited tests, but make the query much more complex. And as long as I do not know how T-SQL implements SUM and AVG (e.g. it might already be using pairwise summation), I cannot guarantee that such modifications always improve accuracy.
This is an alternate method, based off a blog post on Linear Regression in T-SQL, which uses the following equations:
The SQL suggestion in the blog uses cursors though. Here's a prettified version of a forum answer that I used:
table
-----
X (numeric)
Y (numeric)
/**
* m = (nSxy - SxSy) / (nSxx - SxSx)
* b = Ay - (Ax * m)
* N.B. S = Sum, A = Mean
*/
DECLARE #n INT
SELECT #n = COUNT(*) FROM table
SELECT (#n * SUM(X*Y) - SUM(X) * SUM(Y)) / (#n * SUM(X*X) - SUM(X) * SUM(X)) AS M,
AVG(Y) - AVG(X) *
(#n * SUM(X*Y) - SUM(X) * SUM(Y)) / (#n * SUM(X*X) - SUM(X) * SUM(X)) AS B
FROM table
I've actually written an SQL routine using Gram-Schmidt orthoganalization. It, as well as other machine learning and forecasting routines, is available at sqldatamine.blogspot.com
At the suggestion of Brad Larson I've added the code here rather than just direct users to my blog. This produces the same results as the linest function in Excel. My primary source is Elements of Statistical Learning (2008) by Hastie, Tibshirni and Friedman.
--Create a table of data
create table #rawdata (id int,area float, rooms float, odd float, price float)
insert into #rawdata select 1, 2201,3,1,400
insert into #rawdata select 2, 1600,3,0,330
insert into #rawdata select 3, 2400,3,1,369
insert into #rawdata select 4, 1416,2,1,232
insert into #rawdata select 5, 3000,4,0,540
--Insert the data into x & y vectors
select id xid, 0 xn,1 xv into #x from #rawdata
union all
select id, 1,rooms from #rawdata
union all
select id, 2,area from #rawdata
union all
select id, 3,odd from #rawdata
select id yid, 0 yn, price yv into #y from #rawdata
--create a residuals table and insert the intercept (1)
create table #z (zid int, zn int, zv float)
insert into #z select id , 0 zn,1 zv from #rawdata
--create a table for the orthoganal (#c) & regression(#b) parameters
create table #c(cxn int, czn int, cv float)
create table #b(bn int, bv float)
--#p is the number of independent variables including the intercept (#p = 0)
declare #p int
set #p = 1
--Loop through each independent variable and estimate the orthagonal parameter (#c)
-- then estimate the residuals and insert into the residuals table (#z)
while #p <= (select max(xn) from #x)
begin
insert into #c
select xn cxn, zn czn, sum(xv*zv)/sum(zv*zv) cv
from #x join #z on xid = zid where zn = #p-1 and xn>zn group by xn, zn
insert into #z
select zid, xn,xv- sum(cv*zv)
from #x join #z on xid = zid join #c on czn = zn and cxn = xn where xn = #p and zn<xn group by zid, xn,xv
set #p = #p +1
end
--Loop through each independent variable and estimate the regression parameter by regressing the orthoganal
-- resiuduals on the dependent variable y
while #p>=0
begin
insert into #b
select zn, sum(yv*zv)/ sum(zv*zv)
from #z join
(select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on xn=bn group by yid, yv) y
on zid = yid where zn = #p group by zn
set #p = #p-1
end
--The regression parameters
select * from #b
--Actual vs. fit with error
select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid
--R Squared
select 1-sum(power(err,2))/sum(power(yv,2)) from
(select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid) d
There are no linear regression functions in SQL Server. But to calculate a Simple Linear Regression (Y' = bX + A) between pairs of data points x,y - including the calculation of the Correlation Coefficient, Coefficient of Determination (R^2) and Standard Estimate of Error (Standard Deviation), do the following:
For a table regression_data with numeric columns x and y:
declare #total_points int
declare #intercept DECIMAL(38, 10)
declare #slope DECIMAL(38, 10)
declare #r_squared DECIMAL(38, 10)
declare #standard_estimate_error DECIMAL(38, 10)
declare #correlation_coefficient DECIMAL(38, 10)
declare #average_x DECIMAL(38, 10)
declare #average_y DECIMAL(38, 10)
declare #sumX DECIMAL(38, 10)
declare #sumY DECIMAL(38, 10)
declare #sumXX DECIMAL(38, 10)
declare #sumYY DECIMAL(38, 10)
declare #sumXY DECIMAL(38, 10)
declare #Sxx DECIMAL(38, 10)
declare #Syy DECIMAL(38, 10)
declare #Sxy DECIMAL(38, 10)
Select
#total_points = count(*),
#average_x = avg(x),
#average_y = avg(y),
#sumX = sum(x),
#sumY = sum(y),
#sumXX = sum(x*x),
#sumYY = sum(y*y),
#sumXY = sum(x*y)
from regression_data
set #Sxx = #sumXX - (#sumX * #sumX) / #total_points
set #Syy = #sumYY - (#sumY * #sumY) / #total_points
set #Sxy = #sumXY - (#sumX * #sumY) / #total_points
set #correlation_coefficient = #Sxy / SQRT(#Sxx * #Syy)
set #slope = (#total_points * #sumXY - #sumX * #sumY) / (#total_points * #sumXX - power(#sumX,2))
set #intercept = #average_y - (#total_points * #sumXY - #sumX * #sumY) / (#total_points * #sumXX - power(#sumX,2)) * #average_x
set #r_squared = (#intercept * #sumY + #slope * #sumXY - power(#sumY,2) / #total_points) / (#sumYY - power(#sumY,2) / #total_points)
-- calculate standard_estimate_error (standard deviation)
Select
#standard_estimate_error = sqrt(sum(power(y - (#slope * x + #intercept),2)) / #total_points)
From regression_data
Here it is as a function that takes a table type of type: table (Y float, X double) which is
called XYDoubleType and assumes our linear function is of the form AX + B. It returns A and B a Table column just in case you want to have it in a join or something
CREATE FUNCTION FN_GetABForData(
#XYData as XYDoubleType READONLY
) RETURNS #ABData TABLE(
A FLOAT,
B FLOAT,
Rsquare FLOAT )
AS
BEGIN
DECLARE #sx FLOAT, #sy FLOAT
DECLARE #sxx FLOAT,#syy FLOAT, #sxy FLOAT,#sxsy FLOAT, #sxsx FLOAT, #sysy FLOAT
DECLARE #n FLOAT, #A FLOAT, #B FLOAT, #Rsq FLOAT
SELECT #sx =SUM(D.X) ,#sy =SUM(D.Y), #sxx=SUM(D.X*D.X),#syy=SUM(D.Y*D.Y),
#sxy =SUM(D.X*D.Y),#n =COUNT(*)
From #XYData D
SET #sxsx =#sx*#sx
SET #sxsy =#sx*#sy
SET #sysy = #sy*#sy
SET #A = (#n*#sxy -#sxsy)/(#n*#sxx -#sxsx)
SET #B = #sy/#n - #A*#sx/#n
SET #Rsq = POWER((#n*#sxy -#sxsy),2)/((#n*#sxx-#sxsx)*(#n*#syy -#sysy))
INSERT INTO #ABData (A,B,Rsquare) VALUES(#A,#B,#Rsq)
RETURN
END
To add to #icc97 answer, I have included the weighted versions for the slope and the intercept. If the values are all constant the slope will be NULL (with the appropriate settings SET ARITHABORT OFF; SET ANSI_WARNINGS OFF;) and will need to be substituted for 0 via coalesce().
Here is a solution written in SQL:
with d as (select segment,w,x,y from somedatasource)
select segment,
avg(y) - avg(x) *
((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(x*x)) - (Sum(x)*Sum(x))) as intercept,
((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(x*x)) - (sum(x)*sum(x))) AS slope,
avg(y) - ((avg(x*y) - avg(x)*avg(y))/var_samp(X)) * avg(x) as interceptUnstable,
(avg(x*y) - avg(x)*avg(y))/var_samp(X) as slopeUnstable,
(Avg(x * y) - Avg(x) * Avg(y)) / (stddev_pop(x) * stddev_pop(y)) as correlationUnstable,
(sum(y*w)/sum(w)) - (sum(w*x)/sum(w)) *
((sum(w)*sum(x*y*w)) - (sum(x*w)*sum(y*w)))/
((sum(w)*sum(x*x*w)) - (sum(x*w)*sum(x*w))) as wIntercept,
((sum(w)*sum(x*y*w)) - (sum(x*w)*sum(y*w)))/
((sum(w)*sum(x*x*w)) - (sum(x*w)*sum(x*w))) as wSlope,
(count(*) * sum(x * y) - sum(x) * sum(y)) / (sqrt(count(*) * sum(x * x) - sum(x) * sum(x))
* sqrt(count(*) * sum(y * y) - sum(y) * sum(y))) as correlation,
(sum(w) * sum(x*y*w) - sum(x*w) * sum(y*w)) /
(sqrt(sum(w) * sum(x*x*w) - sum(x*w) * sum(x*w)) * sqrt(sum(w) * sum(y*y*w)
- sum(y*w) * sum(y*w))) as wCorrelation,
count(*) as n
from d where x is not null and y is not null group by segment
Where w is the weight. I double checked this against R to confirm the results.
One may need to cast the data from somedatasource to floating point.
I included the unstable versions to warn you against those. (Special thanks goes to Stephan in another answer.)
Update: added weighted correlation
I have translated the Linear Regression Function used in the funcion Forecast in Excel, and created an SQL function that returns a,b, and the Forecast.
You can see the complete teorical explanation in the excel help for FORECAST fuction.
Firs of all you will need to create the table data type XYFloatType:
CREATE TYPE [dbo].[XYFloatType]
AS TABLE(
[X] FLOAT,
[Y] FLOAT)
Then write the follow function:
/*
-- =============================================
-- Author: Me :)
-- Create date: Today :)
-- Description: (Copied Excel help):
--Calculates, or predicts, a future value by using existing values.
The predicted value is a y-value for a given x-value.
The known values are existing x-values and y-values, and the new value is predicted by using linear regression.
You can use this function to predict future sales, inventory requirements, or consumer trends.
-- =============================================
*/
CREATE FUNCTION dbo.FN_GetLinearRegressionForcast
(#PtXYData as XYFloatType READONLY ,#PnFuturePointint)
RETURNS #ABDData TABLE( a FLOAT, b FLOAT, Forecast FLOAT)
AS
BEGIN
DECLARE #LnAvX Float
,#LnAvY Float
,#LnB Float
,#LnA Float
,#LnForeCast Float
Select #LnAvX = AVG([X])
,#LnAvY = AVG([Y])
FROM #PtXYData;
SELECT #LnB = SUM ( ([X]-#LnAvX)*([Y]-#LnAvY) ) / SUM (POWER([X]-#LnAvX,2))
FROM #PtXYData;
SET #LnA = #LnAvY - #LnB * #LnAvX;
SET #LnForeCast = #LnA + #LnB * #PnFuturePoint;
INSERT INTO #ABDData ([A],[B],[Forecast]) VALUES (#LnA,#LnB,#LnForeCast)
RETURN
END
/*
your tests:
(I used the same values that are in the excel help)
DECLARE #t XYFloatType
INSERT #t VALUES(20,6),(28,7),(31,9),(38,15),(40,21) -- x and y values
SELECT *, A+B*30 [Prueba]FROM dbo.FN_GetLinearRegressionForcast#t,30);
*/
I hope the following answer helps one understand where some of the solutions come from. I am going to illustrate it with a simple example, but the generalization to many variables is theoretically straightforward as long as you know how to use index notation or matrices. For implementing the solution for anything beyond 3 variables you'll Gram-Schmidt (See Colin Campbell's answer above) or another matrix inversion algorithm.
Since all the functions we need are variance, covariance, average, sum etc. are aggregation functions in SQL, one can easily implement the solution. I've done so in HIVE to do linear calibration of the scores of a Logistic model - amongst many advantages, one is that you can function entirely within HIVE without going out and back in from some scripting language.
The model for your data (x_1, x_2, y) where your data points are indexed by i, is
y(x_1, x_2) = m_1*x_1 + m_2*x_2 + c
The model appears "linear", but needn't be, For example x_2 can be any non-linear function of x_1, as long as it has no free parameters in it, e.g. x_2 = Sinh(3*(x_1)^2 + 42). Even if x_2 is "just" x_2 and the model is linear, the regression problem isn't. Only when you decide that the problem is to find the parameters m_1, m_2, c such that they minimize the L2 error do you have a Linear Regression problem.
The L2 error is sum_i( (y[i] - f(x_1[i], x_2[i]))^2 ). Minimizing this w.r.t. the 3 parameters (set the partial derivatives w.r.t. each parameter = 0) yields 3 linear equations for 3 unknowns. These equations are LINEAR in the parameters (this is what makes it Linear Regression) and can be solved analytically. Doing this for a simple model (1 variable, linear model, hence two parameters) is straightforward and instructive. The generalization to a non-Euclidean metric norm on the error vector space is straightforward, the diagonal special case amounts to using "weights".
Back to our model in two variables:
y = m_1*x_1 + m_2*x_2 + c
Take the expectation value =>
= m_1* + m_2* + c (0)
Now take the covariance w.r.t. x_1 and x_2, and use cov(x,x) = var(x):
cov(y, x_1) = m_1*var(x_1) + m_2*covar(x_2, x_1) (1)
cov(y, x_2) = m_1*covar(x_1, x_2) + m_2*var(x_2) (2)
These are two equations in two unknowns, which you can solve by inverting the 2X2 matrix.
In matrix form:
...
which can be inverted to yield
...
where
det = var(x_1)*var(x_2) - covar(x_1, x_2)^2
(oh barf, what the heck are "reputation points? Gimme some if you want to see the equations.)
In any case, now that you have m1 and m2 in closed form, you can solve (0) for c.
I checked the analytical solution above to Excel's Solver for a quadratic with Gaussian noise and the residual errors agree to 6 significant digits.
Contact me if you want to do Discrete Fourier Transform in SQL in about 20 lines.