Linear Regression analysis for Date column in SQL Server - sql

I have the following block of code that calculates the formula for a trend line using linear regression (method of least squares). It just find the R-Squared and coefficient of correlation value for X and Y axis.
This will calculate the exact value if X and Y axis are int and float.
CREATE FUNCTION [dbo].[LinearReqression] (#Data AS XML)
RETURNS TABLE AS RETURN (
WITH Array AS (
SELECT x = n.value('#x', 'float'),
y = n.value('#y', 'float')
FROM #Data.nodes('/r/n') v(n)
),
Medians AS (
SELECT xbar = AVG(x), ybar = AVG(y)
FROM Array ),
BetaCalc AS (
SELECT Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
FROM Array
CROSS JOIN Medians
CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
AlphaCalc AS (
SELECT Alpha = ybar - xbar * beta
FROM Medians
CROSS JOIN BetaCalc),
SSCalc AS (
SELECT SS_tot = SUM((y - ybar) * (y - ybar)),
SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
FROM Array
CROSS JOIN Medians
CROSS JOIN AlphaCalc
CROSS JOIN BetaCalc )
SELECT r_squared = CASE WHEN SS_tot = 0 THEN 1.0
ELSE 1.0 - ( SS_err / SS_tot ) END,
Alpha, Beta
FROM AlphaCalc
CROSS JOIN BetaCalc
CROSS JOIN SSCalc
)
Usage:
DECLARE #DataTable TABLE (
SourceID INT,
x Date,
y FLOAT
) ;
INSERT INTO #DataTable ( SourceID, x, y )
SELECT ID = 0, x = 1.2, y = 1.0
UNION ALL SELECT 1, 1.6, 1
UNION ALL SELECT 2, 2.0, 1.5
UNION ALL SELECT 3, 2.0, 1.75
UNION ALL SELECT 4, 2.1, 1.85
UNION ALL SELECT 5, 2.1, 2
UNION ALL SELECT 6, 2.2, 3
UNION ALL SELECT 7, 2.2, 3
UNION ALL SELECT 8, 2.3, 3.5
UNION ALL SELECT 9, 2.4, 4
UNION ALL SELECT 10, 2.5, 4
UNION ALL SELECT 11, 3, 4.5 ;
-- Create and view XML data array
DECLARE #DataXML XML ;
SET #DataXML = (
SELECT -- FLOAT values are formatted in XML like "1.000000000000000e+000", increasing the character count
-- Converting them to VARCHAR first keeps the XML small without sacrificing precision
-- They are unpacked as FLOAT in the function either way
[#x] = CAST(x AS VARCHAR(20)),
[#y] = CAST(y AS VARCHAR(20))
FROM #DataTable
FOR XML PATH('n'), ROOT('r') ) ;
SELECT #DataXML ;
-- Get the results
SELECT * FROM dbo.LinearReqression (#DataXML) ;
In my case X axis may be Date column also? So how can I calculate same regression analysis with date columns?

Short answer is: calculating trend line for dates is pretty much the same as calculating trend line for floats.
For dates you can choose some starting date and use number of days between the starting date and your dates as an X.
I didn't check your function itself and I assume that formulas there are correct.
Also, I don't understand why you generate XML out of the table and parse it back into the table inside the function. It is rather inefficient. You can simply pass the table.
I used your function to make two variants: for processing floats and for processing dates.
I'm using SQL Server 2008 for this example.
At first create a user-defined table type, so we could pass a table into the function:
CREATE TYPE [dbo].[FloatRegressionDataTableType] AS TABLE(
[x] [float] NOT NULL,
[y] [float] NOT NULL
)
GO
Then create the function that accepts such table:
CREATE FUNCTION [dbo].[LinearRegressionFloat] (#ParamData dbo.FloatRegressionDataTableType READONLY)
RETURNS TABLE AS RETURN (
WITH Array AS (
SELECT x,
y
FROM #ParamData
),
Medians AS (
SELECT xbar = AVG(x), ybar = AVG(y)
FROM Array ),
BetaCalc AS (
SELECT Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
FROM Array
CROSS JOIN Medians
CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
AlphaCalc AS (
SELECT Alpha = ybar - xbar * beta
FROM Medians
CROSS JOIN BetaCalc),
SSCalc AS (
SELECT SS_tot = SUM((y - ybar) * (y - ybar)),
SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
FROM Array
CROSS JOIN Medians
CROSS JOIN AlphaCalc
CROSS JOIN BetaCalc )
SELECT r_squared = CASE WHEN SS_tot = 0 THEN 1.0
ELSE 1.0 - ( SS_err / SS_tot ) END,
Alpha, Beta
FROM AlphaCalc
CROSS JOIN BetaCalc
CROSS JOIN SSCalc
)
GO
Very similarly, create a type for table with dates:
CREATE TYPE [dbo].[DateRegressionDataTableType] AS TABLE(
[x] [date] NOT NULL,
[y] [float] NOT NULL
)
GO
And create a function that accepts such table. For each given date it calculates the number of days between 2001-01-01 and the given date x using DATEDIFF and then casts the result to float to make sure that the rest of calculations is correct. You can try to remove the cast to float and you'll see the different result. You can choose any other starting date, it doesn't have to be 2001-01-01.
CREATE FUNCTION [dbo].[LinearRegressionDate] (#ParamData dbo.DateRegressionDataTableType READONLY)
RETURNS TABLE AS RETURN (
WITH Array AS (
SELECT CAST(DATEDIFF(day, '2001-01-01', x) AS float) AS x,
y
FROM #ParamData
),
Medians AS (
SELECT xbar = AVG(x), ybar = AVG(y)
FROM Array ),
BetaCalc AS (
SELECT Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
FROM Array
CROSS JOIN Medians
CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
AlphaCalc AS (
SELECT Alpha = ybar - xbar * beta
FROM Medians
CROSS JOIN BetaCalc),
SSCalc AS (
SELECT SS_tot = SUM((y - ybar) * (y - ybar)),
SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
FROM Array
CROSS JOIN Medians
CROSS JOIN AlphaCalc
CROSS JOIN BetaCalc )
SELECT r_squared = CASE WHEN SS_tot = 0 THEN 1.0
ELSE 1.0 - ( SS_err / SS_tot ) END,
Alpha, Beta
FROM AlphaCalc
CROSS JOIN BetaCalc
CROSS JOIN SSCalc
)
GO
This is how to test the functions:
-- test float data
DECLARE #FloatDataTable [dbo].[FloatRegressionDataTableType];
INSERT INTO #FloatDataTable (x, y)
VALUES
(1.2, 1.0)
,(1.6, 1)
,(2.0, 1.5)
,(2.0, 1.75)
,(2.1, 1.85)
,(2.1, 2)
,(2.2, 3)
,(2.2, 3)
,(2.3, 3.5)
,(2.4, 4)
,(2.5, 4)
,(3, 4.5);
SELECT * FROM dbo.LinearRegressionFloat(#FloatDataTable);
-- test date data
DECLARE #DateDataTable [dbo].[DateRegressionDataTableType];
INSERT INTO #DateDataTable (x, y)
VALUES
('2001-01-13', 1.0)
,('2001-01-17', 1)
,('2001-01-21', 1.5)
,('2001-01-21', 1.75)
,('2001-01-22', 1.85)
,('2001-01-22', 2)
,('2001-01-23', 3)
,('2001-01-23', 3)
,('2001-01-24', 3.5)
,('2001-01-25', 4)
,('2001-01-26', 4)
,('2001-01-31', 4.5);
SELECT * FROM dbo.LinearRegressionDate(#DateDataTable);
Here are two result sets:
r_squared Alpha Beta
----------------------------------------------------------
0.798224907472009 -2.66524390243902 2.46417682926829
r_squared Alpha Beta
----------------------------------------------------------
0.79822490747201 -2.66524390243902 0.246417682926829

Related

Some numbers are getting truncated. I would like to pull all numbers

I'm trying to parse only numbers from a string. My code must be pretty close, but something is off here, because several numbers in the last string are being truncated, although the first two strings seem fine.
Here is my code.
Drop Table SampleData
Create table SampleData
(id int, factor varchar(100))
insert into #source_Policy values (1 ,'AAA 1.058 (Protection Class)')
insert into #source_Policy values (2, 'BBB0.565 (Construction) ')
insert into #source_Policy values ( 3, 'CCCCC 1.04890616 (Building Limit Rel')
Select *
From SampleData
;with processTable as (
select id, factor, num
from SampleData
cross apply (
select (select C + ''
from (select N, substring(factor, N, 1) C from (values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12)) Num(N) where N<=datalength(factor)) t
where PATINDEX('%[0-9.]%',C)> 0
order by N
for xml path(''))
) p0 (num)
)
SELECT id, factor, num
FROM processTable
This is the result that I get.
In the num column, instead of 1.04, I would like to see the full precision, so: 1.04890616
I would think something like this:
select s.*, v2.numstr
from sampledata s cross apply
(values (stuff(factor, 1, patindex('%[0-9]%', factor) - 1, ''))) v(str) cross apply
(values (left(v.str, patindex('%[^0-9.]%', v.str + 'x') - 1))) v2(numstr);
Here is a SQL Fiddle.

Setting multiple variables' value in one IIF() of a SELECT statement

I have a CROSS JOIN query that I am using to see which combination of item quantities yield the best output.
DECLARE #last_found DECIMAL(10, 2) = 0
DECLARE #calculated DECIMAL(10, 2)
DECLARE #n_count INT
DECLARE #tbl1n INT
DECLARE #tbl2n INT
DECLARE #tbl3n INT
DROP TABLE IF EXISTS #tbl1
DROP TABLE IF EXISTS #tbl2
DROP TABLE IF EXISTS #tbl3
;WITH numbers AS (
SELECT ROW_NUMBER() OVER (ORDER BY [value]) AS n
FROM string_split('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20', ',')
)
SELECT n, (n * 10000 * (1 + IIF(n > 1, (0.50/19.00) * (n - 1), 0))) AS price
INTO #tbl1 FROM numbers
;WITH numbers AS (
SELECT ROW_NUMBER() OVER (ORDER BY [value]) AS n
FROM string_split('1,2,3,4,5,6,7,8,9,10,11,12', ',')
)
SELECT n, (n * 15000 * (1 + IIF(n > 1, (0.50/11.00) * (n - 1), 0))) AS price
INTO #tbl2 FROM numbers
;WITH numbers AS (
SELECT ROW_NUMBER() OVER (ORDER BY [value]) AS n
FROM string_split('1,2,3,4,5,6', ',')
)
SELECT n, (n * 20000 * (1 + IIF(n > 1, (0.50/5.00) * (n - 1), 0))) AS price
INTO #tbl3 FROM numbers
SELECT
#n_count = (tbl1.n + tbl2.n + tbl3.n),
#calculated = IIF(#n_count = 10, (tbl1.price + tbl2.price + tbl3.price), 0),
#tbl1n = IIF(#calculated > #last_found, tbl1.n, #tbl1n),
#tbl2n = IIF(#calculated > #last_found, tbl2.n, #tbl2n),
#tbl3n = IIF(#calculated > #last_found, tbl3.n, #tbl3n),
#last_found = IIF(#calculated > #last_found, #calculated, #last_found)
FROM #tbl1 tbl1
CROSS JOIN #tbl2 tbl2
CROSS JOIN #tbl3 tbl3
SELECT #last_found AS highest_value, #tbl1n AS tbl1n, #tbl2n AS tbl2n, #tbl3n AS tbl3n,
t1.price AS tbl1_price, t2.price AS tbl2_price, t3.price AS tbl3_price
FROM #tbl1 t1
INNER JOIN #tbl2 t2 ON t1.n = #tbl1n AND t2.n = #tbl2n
INNER JOIN #tbl3 t3 ON t3.n = #tbl3n
As can be seen, if the query finds a value higher than the previously found highest, it is storing the combination using multiple instances of #itemN = IIF(#calculated > #last_found, tbl.n, #itemN).
Is it possible to assign all #tblXn variables in one go? I could use a CONCAT, but I think it may slow down the query, as it is a string operation.
FYI - 'n' is a value between 0 and 20.
You can use apply :
SELECT n_count, calculated, last_found,
IIF(flag = 1, tbl1.n, #tbl1n) AS tbl1n,
IIF(flag = 1, tbl2.n, #tbl2n) AS tbl2n,
IIF(flag = 1, tbl3.n, #tbl3n) AS tbl3n
FROM tbl1 CROSS JOIN
tbl2 CROSS JOIN
tb CROSS APPLY
( VALUES (tbl1.n + tbl2.n + tbl3.n)
) t(n_count) CROSS APPLY
( VALUES (IIF(n_count = 10, ( (tbl1.n * tbl1.price), (tbl2.n * tbl2.price), (tbl3.n * tbl3.price) ), 0))
) tt(calculated) CROSS APPLY
( VALUES (IIF(calculated > #last_found, calculated, #last_found))
) lst(last_found) CROSS APPLY
( VALUES (IIF(calculated > last_found, 1, 0))
) cc(flag)
Note : You can further assign values to variable.
You shouldn't be using variables for this at all.
This would be much simpler written as below (without relying on undocumented/unguaranteed behaviour of assignment to variables across multiple rows)
SELECT TOP 1 CAST(combined_price AS DECIMAL(10, 2)) AS highest_value,
tbl1.n AS tbl1n,
tbl2.n AS tbl2n,
tbl3.n AS tbl3n,
tbl1.price AS tbl1_price,
tbl2.price AS tbl2_price,
tbl3.price AS tbl3_price
FROM #tbl1 tbl1
CROSS JOIN #tbl2 tbl2
CROSS JOIN #tbl3 tbl3
CROSS APPLY (VALUES (tbl1.price + tbl2.price + tbl3.price,
tbl1.n + tbl2.n + tbl3.n)) CA(combined_price, combined_n)
WHERE combined_n = 10
ORDER BY combined_price DESC

Convert Recursive CTE to Recursive Subquery

How would I convert the following CTE into a recursive subquery? It's an implementation of Newtons Method.
Reasons:
1) I have no permissions to create functions or stored procs in the DB
2) I must do everything in TSQL
3) Not using Oracle
TESTDATA Table
PMT t V
6918.26 6 410000
3636.51 14 460000
3077.98 22 630000
1645.14 18 340000
8591.67 13 850000
Desired Output
PMT t V Newton
6918.26 6 410000 0.066340421
3636.51 14 460000 0.042449138
3077.98 22 630000 0.024132674
1645.14 18 340000 0.004921588
8591.67 13 850000 0.075982984
_
DECLARE #PMT AS FLOAT
DECLARE #t AS FLOAT
DECLARE #V AS FLOAT
--These will be only for 1 example.
SET #PMT = 6918.26740930922
SET #t = 6
SET #V = 410000
;With Newton (n, i,Fi,dFi) AS (
--base
SELECT
1,
CAST(0.1 AS FLOAT)
,#PMT * (1 - POWER((1 + CAST(0.1 AS FLOAT) / 12), (-#t * 12))) - #V * CAST(0.1 AS FLOAT) / 12
,#PMT * #t * 12 * POWER((1 + CAST(0.1 AS FLOAT) / 12), (-#t * 12 - 1)) - #V
UNION ALL
--recursion
SELECT
n + 1
,i - Fi/dFi
,#PMT * (1 - POWER((1 + i / 12), (-#t * 12))) - #V * i / 12
,#PMT * #t * 12 * POWER((1 + i / 12), (-#t * 12 - 1)) - #V
FROM Newton WHERE n < 500)
--to get the desired value for params above
SELECT [x].i
FROM (
SELECT n, i, Fi, dFi
FROM Newton
WHERE n = 500
) [x]
OPTION (MAXRECURSION 500)
_
I want Newton to evaluate on Every record of TestData as a stand alone column.
Any thoughts?

Flat table in SQL query

I have a query that returns two rows:
X Y
20 0.148698
30 0.576208
I also have a function with following signature:
ALTER FUNCTION [dbo].[SomeFunc]
(
#x1 float,
#y1 float,
#x2 float,
#y2 float
)
What is the easiest way to pass params from this query in this function? Now i have a query, that declares four local variables and then I do four queries to fill all these variables and only then pass them into my function. But it seems that there is some better solution. For example I'm looking for something like:
WITH CTE AS (
SELECT X1 = ..., Y1 = ..., X2 = ..., Y2 = ...
)
SELECT TOP 1 SomeFunc(X1, Y1, X2, Y2)
FROM CTE
This is why I called this question Table flat
Entire query is:
DECLARE #value float = 24;
WITH CTE AS (
SELECT X = CAST([name] AS float),
Y = [rank]
FROM [issdss].[dbo].[crit_scale]
WHERE criteria_id = 128
),
CTE2 as (
SELECT CTE.*, LeftDiff = IIF(X <= #value, #value - X, NULL), RightDiff = IIF(X >= #value, X - #value, NULL)
FROM CTE
),
CTE3 as (
SELECT X, Y
FROM CTE2
WHERE LeftDiff = (SELECT MIN(LeftDiff) FROM CTE2)
OR RightDiff = (SELECT MIN(RightDiff) FROM CTE2)
),
-- Some magic here to get X1,Y1,X2,Y2
If there's always 2 rows, you can do something like this with row_number & max:
select
max(case when RN = 1 then X end) as X1,
max(case when RN = 2 then X end) as X2,
max(case when RN = 1 then Y end) as Y1,
max(case when RN = 2 then Y end) as Y2
from (
select row_number () over (order by (select null)) RN, *
from (
select 20 as X, 0.148698 as Y
union all
select 30, 0.576208
) X
) Y
Example in SQL Fiddle

SQL View for top left and top right cell

==> Referring to this Thread!
Referring to the output shown as best solution there, how can I get the boundary cells? That is min(StartX), min(StartY) and max(EndX) and max(EndY) OR in certain cases max(EndX+1) or max(EndY+1) if the column or row be missed out as in the case of 3,10 in the image below (green bordered are my bounding cells)
X Y PieceCells Boundary
1 1 (1,1)(2,1)(2,2)(3,2) (1,1)(3,2)
8 1 (10,1)(8,1)(8,2)(9,1)(9,2)(9,3) (8,1)(10,1)
Well I want like this:
BoundaryStartX, BoundaryStartY, BoundaryEndX, BoundaryEndY
1 1
3 2
8
1
10 3
I was able to do this pretty simply with the geometry data type.
declare #g geometry;
set #g = geometry::STGeomFromText(
'POLYGON( (1 -1, 1 -2, 2 -2, 2 -3, 4 -3, 4 -2, 3 -2, 3 -1, 1 -1) )'
, 0);
select #g, #g.STEnvelope();
Geometry is available starting in SQL2008. Also note that I converted your coordinate system to standard Cartesian (positive x axis to the right of the origin, negative y axis below); you'd do well to consider doing the same.
use tempdb;
if exists (select 1 from sys.tables where name = 'grid')
drop table grid;
if not exists (select 1 from sys.tables where name = 'tally')
begin
create table tally (i int not null);
with
a as (select 1 as [i] union select 0),
b as (select 1 as [i] from a as [a1] cross join a as [a2]),
c as (select 1 as [i] from b as [a1] cross join b as [a2]),
d as (select 1 as [i] from c as [a1] cross join c as [a2]),
e as (select 1 as [i] from d as [a1] cross join d as [a2])
insert into tally
select row_number() over (order by i) from e
create unique clustered index [CI_Tally] on tally (i)
end
create table grid (
x tinyint,
y tinyint,
cell as geometry::STGeomFromText(
'POLYGON( (' +
cast(x as varchar) + ' ' + cast(-1*y as varchar) + ', ' +
cast(x+1 as varchar) + ' ' + cast(-1*y as varchar) + ', ' +
cast(x+1 as varchar) + ' ' + cast(-1*(y+1) as varchar) + ', ' +
cast(x as varchar) + ' ' + cast(-1*(y+1) as varchar) + ', ' +
cast(x as varchar) + ' ' + cast(-1*y as varchar) +
') )'
, 0)
);
insert into grid (x, y)
values
(1,1),
(2,1),
(2,2),
(3,2),
(8,1),
(9,1),
(8,2),
(9,2),
(9,3),
(10,1);
with cte as (
select cell, row_number() over (order by x, y) as [rn]
from grid
),
cte2 as (
select cell, [rn]
from cte
where [rn] = 1
union all
select a.cell.STUnion(b.cell) as [cell], b.rn
from cte2 as a
inner join cte as b
on a.rn + 1 = b.[rn]
), cte3 as (
select cell
from cte2
where [rn] = (select count(*) from grid)
), clusters as (
select i, cell.STGeometryN(t.i) as c
from cte3 as [a]
cross join tally as [t]
where t.i <= cell.STNumGeometries()
)
select *, c.STEnvelope() from clusters
This solution solves both your original problem and this one. I like this because you can still use whatever weird coordinate system you want and it'll do what you want. All you'd have to do is modify the computed column on the grid table accordingly. I'm going to leave the computation of the corners of the envelope as an exercise to the reader. :)
By way of explanation, the computed column makes a 1x1 geometry instance out of the given x and y coordinates. From there, I essentially union all of those together which will yield a multipolygon. From there, I iterate through the individual polygons in the multipolygon to get the individual clusters. The envelope comes along for free. From here, you should be able to wrap that final select (or something very like it) in a view if you so choose.