Need SQL Server Query to solve 3rd Order Polynomial Regression - sql

Can anyone help with some SQL query code to provide estimates of the co-efficients for a 3rd order Polynomial regression?
Please assume that I have a table of X and Y data values and want to estimate a, b and c in:
Y(X) = aX + bX^2 + cX^3 + E

APPROXIMATE but fast solution would be to sample 4 representative points from the data and solve the polynomial equation for these points.
As for the sampling, you can split the data into equal sectors and compute average of X and Y for each sector - the split can be done using quartiles of X-values, averages of X-values, min(x)+(max(x)-min(x))/4 or whatever you think is the most appropriate.
To illustrate the sampling by quartiles (i.e. by row numbers):
As for the solving, i used numberempire.com to solve these* equations for variables k,a,b,c:
k + a*X1 + b*X1^2 + c*X1^3 - Y1 = 0,
k + a*X2 + b*X2^2 + c*X2^3 - Y2 = 0,
k + a*X3 + b*X3^2 + c*X3^3 - Y3 = 0,
k + a*X4 + b*X4^2 + c*X4^3 - Y4 = 0
*Since Y(X) = 0 + ax bx^2 + cx^3 + ϵ implicitly includes [0, 0] point as one of the sample points, it would create bad approximations for data sets that don't include [0, 0]. I took the liberty of solving Y(X) = k + ax bx^2 + cx^3 + ϵ instead.
The actual SQL would go like this:
select
-- returns 1 row with columns labeled K, A, B and C = coefficients in 3rd order polynomial equation for the 4 sample points
-(X1*(X2p2*(X3p3*Y4-X4p3*Y3)+X2p3*(X4p2*Y3-X3p2*Y4)+(X3p2*X4p3-X3p3*X4p2)*Y2)+X1p2*(X2*(X4p3*Y3-X3p3*Y4)+X2p3*(X3*Y4-X4*Y3)+(X3p3*X4-X3*X4p3)*Y2)+X1p3*(X2*(X3p2*Y4-X4p2*Y3)+X2p2*(X4*Y3-X3*Y4)+(X3*X4p2-X3p2*X4)*Y2)+(X2*(X3p3*X4p2-X3p2*X4p3)+X2p2*(X3*X4p3-X3p3*X4)+X2p3*(X3p2*X4-X3*X4p2))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as k,
(X1p2*(X2p3*(Y4-Y3)-X3p3*Y4+X4p3*Y3+(X3p3-X4p3)*Y2)+X2p2*(X3p3*Y4-X4p3*Y3)+X1p3*(X3p2*Y4+X2p2*(Y3-Y4)-X4p2*Y3+(X4p2-X3p2)*Y2)+X2p3*(X4p2*Y3-X3p2*Y4)+(X3p2*X4p3-X3p3*X4p2)*Y2+(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as a,
-(X1*(X2p3*(Y4-Y3)-X3p3*Y4+X4p3*Y3+(X3p3-X4p3)*Y2)+X2*(X3p3*Y4-X4p3*Y3)+X1p3*(X3*Y4+X2*(Y3-Y4)-X4*Y3+(X4-X3)*Y2)+X2p3*(X4*Y3-X3*Y4)+(X3*X4p3-X3p3*X4)*Y2+(X2*(X4p3-X3p3)-X3*X4p3+X3p3*X4+X2p3*(X3-X4))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as b,
(X1*(X2p2*(Y4-Y3)-X3p2*Y4+X4p2*Y3+(X3p2-X4p2)*Y2)+X2*(X3p2*Y4-X4p2*Y3)+X1p2*(X3*Y4+X2*(Y3-Y4)-X4*Y3+(X4-X3)*Y2)+X2p2*(X4*Y3-X3*Y4)+(X3*X4p2-X3p2*X4)*Y2+(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as c
from (select
samples.*,
-- precomputing the powers should give better performance (at least i hope it would)
power(X1,2) X1p2, power(X2,2) X2p2, power(X3,2) X3p2, power(X4,2) X4p2,
power(Y1,3) Y1p3, power(Y2,3) Y2p3, power(Y3,3) Y3p3, power(Y4,3) Y4p3
from (select
avg(case when sector = 1 then x end) X1,
avg(case when sector = 2 then x end) X2,
avg(case when sector = 3 then x end) X3,
avg(case when sector = 4 then x end) X4,
avg(case when sector = 1 then y end) Y1,
avg(case when sector = 2 then y end) Y2,
avg(case when sector = 3 then y end) Y3,
avg(case when sector = 4 then y end) Y4
from (select x, y,
-- splitting to sectors 1 - 4 by row number (SQL Server version)
ceiling(row_number() OVER (ORDER BY x asc) / count(*) * 4) sector
from original_data
)
) samples
)
According to developer.mimer.com, these optional features need to be enabled in SQL Server:
T611, "Elementary OLAP operations"
F591, "Derived tables"

SQL Server has a built-in ranking function NTILE(n) which will more easily create your sectors. I replaced:
ceiling(row_number() OVER (ORDER BY x asc) / count(*) * 4) sector
with:
NTILE(4) OVER(ORDER BY x ASC) [sector]
I also needed to add several "precomputed powers" to allow for the full column range as selected. The full list appears below:
POWER(samples.X1, 2) AS [X1p2],
POWER(samples.X1, 3) AS [X1p3],
POWER(samples.X2, 2) AS [X2p2],
POWER(samples.X2, 3) AS [X2p3],
POWER(samples.X3, 2) AS [X3p2],
POWER(samples.X3, 3) AS [X3p3],
POWER(samples.X4, 2) AS [X4p2],
POWER(samples.X4, 3) AS [X4p3],
POWER(samples.Y1, 3) AS [Y1p3],
POWER(samples.Y2, 3) AS [Y2p3],
POWER(samples.Y3, 3) AS [Y3p3],
POWER(samples.Y4, 3) AS [Y4p3]
Overall, great answer by #Aprillion! Well explained and the numberempire.com h/t was very helpful.

Related

IBM DB2: SQL subtraction doesn't work in denominator

I am working with manufacturing costs and the value of the fabrication still left to go (the "Net Work In Process"). The SQL is straight arithmetic but the query doesn't result in a value if there is a minus sign (subtraction) in the denominator. The database columns:
A = Material issued cost
B = Miscellaneous cost adds
C = Labor
D = Overhead
E = Setup cost
F = Scrap cost
G = Received cost (cost of assemblies completed already)
H = Original Quantity ordered
I = Quantity deviation
J = Quantity split from order
K = Quantity received (number of assemblies completed already)
The Net WIP cost is nothing more than the total cost remaining divided by the total quantity remaining. So in short, I'm simply trying to do this:
select (A + B + C + D + E - F - G) / (H + I - J - K) from MyTable
The subtractions work fine in the numerator but as soon as I subtract in the denominator, the query simply returns no value (blank). I've tried stuff like this:
select (A + B + C + D + E - F - G) / (H + I - (J + K)) from MyTable
select (A + B + C + D + E - F - G) / (H + I + (-J) + (-K)) from MyTable
select (A + B + C + D + E - F - G) / (H + I + (J * -1) + (K * -1)) from MyTable
None of these work. Just curious if anyone has come across this on IBM's DB2 database?
Thanks.
If you are returning "blank" in a numeric calculation, then you have a NULL value somewhere. Try using coalesce():
nullif(coalesce(H, 0) + coalesce(I, 0) - coalesce(J, 0) - coalesce(K, 0), 0)
You have nulls in one of the columns H, I, J, or K. Search for the offending rows using:
select H, I, J, K
from MyTable
where H is null
or I is null
or J is null
or K is null;
Then, you can treat those special cases according you your own logic. Typically you'll replace those nulls with zeroes or other values using COALESCE().
Thanks all for your comments. I did scour the columns for NULLs and there aren't any. There are then plenty of conditions where, let's say, the factory sets the order to 10 and completes (receives) 10 with no splits and no deviation. In that case:
H + I + J + K = 0 (+10 +0 -0 -10) = 0
and I can't divide by zero. So I have a different workaround for that and thanks for everyone's help.

Are there any trend line Function in PL SQL?

I need a function to calculate a trend line. I have a query (part of the function):
select round(sum(nvl(vl_indice, vl_meta))/12, 2) from (
SELECT
SUM (vl_indice) vl_indice, SUM (vl_meta) vl_meta
FROM
(SELECT cd_mes, vl_indice, NULL vl_meta, dt.id_tempo,
fi.id_multi_empresa, fi.id_setor, fi.id_indice
FROM dbadw.fa_indice fi , dbadw.di_tempo dt ,
dbadw.di_multi_empresa dme , dbaportal.organizacao o ,
dbadw.di_indice di
WHERE fi.id_tempo = dt.id_tempo
AND DT.CD_MES BETWEEN TO_NUMBER(TO_CHAR(ADD_MONTHS(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),- 11),'YYYYMM'))
AND PCD_MES
AND DT.ANO = TO_NUMBER(TO_CHAR(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),'YYYY'))
AND fi.id_multi_empresa = dme.id_multi_empresa
AND dme.cd_multi_empresa = NVL(o.cd_multi_empresa_mv2000, o.cd_organizacao)
AND o.cd_organizacao = PCD_ORG
AND fi.id_setor IS NULL
AND fi.id_indice = di.id_indice
AND di.cd_indice = PCD_IVM
UNION ALL
SELECT cd_mes, NULL vl_indice, vl_meta, dt.id_tempo,
fm.id_multi_empresa, fm.id_setor, fm.id_indice
FROM dbadw.fa_meta_indice fm , dbadw.di_tempo dt ,
dbadw.di_multi_empresa dme , dbaportal.organizacao o ,
dbadw.di_indice di
WHERE fm.id_tempo = dt.id_tempo
AND DT.ANO = TO_NUMBER(TO_CHAR(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),'YYYY'))
AND fm.id_multi_empresa = dme.id_multi_empresa
AND dme.cd_multi_empresa = NVL(o.cd_multi_empresa_mv2000, o.cd_organizacao)
AND o.cd_organizacao = PCD_ORG
AND fm.id_setor IS NULL
AND fm.id_indice = di.id_indice
AND di.cd_indice = PCD_IVM
)
GROUP BY cd_mes, id_tempo, id_multi_empresa, id_setor, id_indice
ORDER BY cd_mes);
Where I tried to calculate the trend line on the first line, but is not correct. Please, Can anybody help me?
Its very difficult to work out from a query what you want to fit a "trend line" to - by which I assume you mean to use least square linear regression to find a best fit to the data.
So an example with test data:
Oracle Setup:
CREATE TABLE data ( x, y ) AS
SELECT LEVEL,
230 + DBMS_RANDOM.VALUE(-5,5) - 3.14159 * DBMS_RANDOM.VALUE( 0.95, 1.05 ) * LEVEL
FROM DUAL
CONNECT BY LEVEL <= 1000;
As you can see the data is random but its approximately y = -3.14159x + 230
Query - Get the Least Square Regression y-intercept and gradient:
SELECT REGR_INTERCEPT( y, x ) AS best_fit_y_intercept,
REGR_SLOPE( y, x ) AS best_fit_gradient
FROM data
This will get something like:
best_fit_y_intercept best_fit_gradient
-------------------- -----------------
230.531799878168 -3.143190435415
Query - Get the y co-ordinate of the line of best fit:
SELECT x,
y,
REGR_INTERCEPT( y, x ) OVER () + x * REGR_SLOPE( y, x ) OVER () AS best_fit_y
FROM data
The solution is:
SELECT valor, mes,
((mes * SLOPE) + INTERCEPT) TENDENCIA, SLOPE, INTERCEPT from
( select valor, mes, ROUND(REGR_SLOPE(valor,mes) over (partition by id_multi_empresa),4)SLOPE,
ROUND(REGR_INTERCEPT(valor,mes) over (PARTITION by id_multi_empresa),4) INTERCEPT from( --the initial select

create a histogram with a dynamic number of partitions in sqlite

I have a row x with integers in range 0 < x <= maxX.
To create a histogram with five partitions of equal size I am using the following statement in sqlite
select case
when x > 0 and x <= 1*((maxX+4)/5) then 1
when x > 1*((maxX+4)/5) and x <= 2*((maxX+4)/5) then 2
when x > 2*((maxX+4)/5) and x <= 3*((maxX+4)/5) then 3
when x > 3*((maxX+4)/5) and x <= 4*((maxX+4)/5) then 4
else 5 end as category, count(*) as count
from A,B group by category
Is there a way to make a "dynamic" query for this in the way that I can create a histogram of n partitions without writing n conditions in the case-statement?
You can use arithmetic to divide the values. Here is one method. It essentially takes the ceiling value of maxX / 5 and uses that to define the partitions:
select (case when cast(maxX / params.n as int) = maxX / params.n
then (x - 1) / (maxX / param.n)
else (x - 1) / cast(1 + maxX / params.n as int)
end) as category, count(*)
from (select 5 as n) params cross join
A
group by category;
The -1 is because your numbers start at one rather than zero.

Pearson Correlation SQL Server

I have two tables:
ID,YRMO,Counts
1,Dec 2013,4
1,Jan 2014,6
1,Feb 2014,7
2,Jan,2014,6
2,Feb,2014,8
ID,YRMO,Counts
1,Dec 2013,10
1,Jan 2014,8
1,March 2014,12
2,Jan 2014,6
2,Feb 2014,10
I want to find the pearson corelation coefficient for each sets of ID. There are about more than 200 different IDS.
Pearson correlation is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and −1 inclusive
More can be found here :http://oreilly.com/catalog/transqlcook/chapter/ch08.html
at calculating correlation section
To calculate Pearson Correlation Coefficient; you need to first calculate Mean then standard daviation and then correlation coefficient as outlined below
1. Calculate Mean
insert into tab2 (tab1_id, mean)
select ID, sum([counts]) /
(select count(*) from tab1) as mean
from tab1
group by ID;
2. Calculate standard deviation
update tab2
set stddev = (
select sqrt(
sum([counts] * [counts]) /
(select count(*) from tab1)
- mean * mean
) stddev
from tab1
where tab1.ID = tab2.tab1_id
group by tab1.ID);
3. Finally Pearson Correlation Coefficient
select ID,
((sf.sum1 / (select count(*) from tab1)
- stats1.mean * stats2.mean
)
/ (stats1.stddev * stats2.stddev)) as PCC
from (
select r1.ID,
sum(r1.[counts] * r2.[counts]) as sum1
from tab1 r1
join tab1 r2
on r1.ID = r2.ID
group by r1.ID
) sf
join tab2 stats1
on stats1.tab1_id = sf.ID
join tab2 stats2
on stats2.tab1_id = sf.ID
Which on your posted data results in
See a demo fiddle here http://sqlfiddle.com/#!3/0da20/5
EDIT:
Well refined a bit. You can use the below function to get PCC but I am not getting exact same result as of your but rather getting 0.999996000000000 for ID = 1.
This could be a great entry point for you. You can refine the calculation further from here.
create function calculate_PCC(#id int)
returns decimal(16,15)
as
begin
declare #mean numeric(16,5);
declare #stddev numeric(16,5);
declare #count numeric(16,5);
declare #pcc numeric(16,12);
declare #store numeric(16,7);
select #count = CONVERT(numeric(16,5), count(case when Id=#id then 1 end)) from tab1;
select #mean = convert(numeric(16,5),sum([Counts])) / #count
from tab1 WHERE ID = #id;
select #store = (sum(counts * counts) / #count) from tab1 WHERE ID = #id;
set #stddev = sqrt(#store - (#mean * #mean));
set #pcc = ((#store - (#mean * #mean)) / (#stddev * #stddev));
return #pcc;
end
Call the function like
select db_name.dbo.calculate_PCC(1)
A Single-Pass Solution:
There are two flavors of the Pearson correlation coefficient, one for a Sample and one for an entire Population. These are simple, single-pass, and I believe, correct formulas for both:
-- Methods for calculating the two Pearson correlation coefficients
SELECT
-- For Population
(avg(x * y) - avg(x) * avg(y)) /
(sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y)))
AS correlation_coefficient_population,
-- For Sample
(count(*) * sum(x * y) - sum(x) * sum(y)) /
(sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y)))
AS correlation_coefficient_sample
FROM (
-- The following generates a table of sample data containing two columns with a luke-warm and tweakable correlation
-- y = x for 0 thru 99, y = x - 100 for 100 thru 199, etc. Execute it as a stand-alone to see for yourself
-- x and y are CAST as DECIMAL to avoid integer math, you should definitely do the same
-- Try TOP 100 or less for full correlation (y = x for all cases), TOP 200 for a PCC of 0.5, TOP 300 for one near 0.33, etc.
-- The superfluous "+ 0" is where you could apply various offsets to see that they have no effect on the results
SELECT TOP 200
CAST(ROW_NUMBER() OVER (ORDER BY [object_id]) - 1 + 0 AS DECIMAL) AS x,
CAST((ROW_NUMBER() OVER (ORDER BY [object_id]) - 1) % 100 AS DECIMAL) AS y
FROM sys.all_objects
) AS a
As I noted in the comments, you can try the example with TOP 100 or less for full correlation (y = x for all cases); TOP 200 yields correlations very near 0.5; TOP 300, around 0.33; etc. There is a place ("+ 0") to add an offset if you like; spoiler alert, it has no effect. Make sure you CAST your values as DECIMAL - integer math can significantly impact these calcs.

Multi sparse matrices handling with SQL

I got a model like this:
matrices (
matricesID integer;
x integer;
y integer;
value float;
)
So there will store many matrices data in that table, now I need to get average value for each matrix edge by edge, i.e. if one matrix is 20 * 30 and had value in (5,3), (5,7), (5,15), (12,4), (17,5), (17,10), I need to get four groups of data, one for all values that x=5, one for all values that x=17, one for all values that y=4 and one for all values that y=15, cause they are the max/min for x and y.
Is there any way to perform this with easy SQL?
Any idea will be appreciated.
BR
Edward
I
This is a guess as I don't have much experience in the problem domain:
select matricesID
, (select avg(value) from matrices where matricesID = a.matricesID and x = a.minx) as avgofminx
, (select avg(value) from matrices where matricesID = a.matricesID and x = a.maxx) as avgofmaxx
, (select avg(value) from matrices where matricesID = a.matricesID and y = a.miny) as avgofminy
, (select avg(value) from matrices where matricesID = a.matricesID and y = a.maxy) as avgofmaxy
from (
select matricesID
, min(x) as minx
, max(x) as maxx
, min(y) as miny
, max(y) as maxy
from matrices
group by matricesID
) as a
This is running in SQL Server, but the syntax is simple enough that it hopefully runs in whatever DBMS you are using