Multi sparse matrices handling with SQL - sql

I got a model like this:
matrices (
matricesID integer;
x integer;
y integer;
value float;
)
So there will store many matrices data in that table, now I need to get average value for each matrix edge by edge, i.e. if one matrix is 20 * 30 and had value in (5,3), (5,7), (5,15), (12,4), (17,5), (17,10), I need to get four groups of data, one for all values that x=5, one for all values that x=17, one for all values that y=4 and one for all values that y=15, cause they are the max/min for x and y.
Is there any way to perform this with easy SQL?
Any idea will be appreciated.
BR
Edward
I

This is a guess as I don't have much experience in the problem domain:
select matricesID
, (select avg(value) from matrices where matricesID = a.matricesID and x = a.minx) as avgofminx
, (select avg(value) from matrices where matricesID = a.matricesID and x = a.maxx) as avgofmaxx
, (select avg(value) from matrices where matricesID = a.matricesID and y = a.miny) as avgofminy
, (select avg(value) from matrices where matricesID = a.matricesID and y = a.maxy) as avgofmaxy
from (
select matricesID
, min(x) as minx
, max(x) as maxx
, min(y) as miny
, max(y) as maxy
from matrices
group by matricesID
) as a
This is running in SQL Server, but the syntax is simple enough that it hopefully runs in whatever DBMS you are using

Related

Snowflake table and generator functions does not give expected result

I tried to create a simple SQL to track query_history usage, but got into trouble when creating my timeslots using the table and generator functions (the CTE named x below).
I got no results at all when limiting the query_history using my timeslots, so after a while I hardcoded an SQL to give the same result (the CTE named y below) and this works fine.
Why does not x work? As far as I can see x and y produce identical result?
To test the example first run the code as it is, this produces no result.
Then comment the line x as timeslots and un-comment the line y as timeslots, this will give the desired result.
with
x as (
select
dateadd('min',seq4()*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(seq4()+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
),
y as (
select
dateadd('min',n*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(n+1)*10,dateadd('min',-60,current_timestamp())) t
from (select 0 n union all select 1 n union all select 2 union all select 3
union all select 4 union all select 5)
)
--select * from x;
--select * from y;
select distinct
user_name,
timeslots.f
from snowflake.account_usage.query_history,
x as timeslots
--y as timeslots
where start_time >= timeslots.f
and start_time < timeslots.t
order by timeslots.f desc;
(I know the code is not optimal, this is only meant to illustrate the problem)
SEQ:
Returns a sequence of monotonically increasing integers, with wrap-around. Wrap-around occurs after the largest representable integer of the integer width (1, 2, 4, or 8 byte).
If a fully ordered, gap-free sequence is required, consider using the ROW_NUMBER window function.
For:
with x as (
select
dateadd('min',seq4()*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(seq4()+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
)
SELECT * FROM x;
Should be:
with x as (
select
(ROW_NUMBER() OVER(ORDER BY seq4())) - 1 AS n,
dateadd('min',n*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(n+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
)
SELECT * FROM x;

Are there any trend line Function in PL SQL?

I need a function to calculate a trend line. I have a query (part of the function):
select round(sum(nvl(vl_indice, vl_meta))/12, 2) from (
SELECT
SUM (vl_indice) vl_indice, SUM (vl_meta) vl_meta
FROM
(SELECT cd_mes, vl_indice, NULL vl_meta, dt.id_tempo,
fi.id_multi_empresa, fi.id_setor, fi.id_indice
FROM dbadw.fa_indice fi , dbadw.di_tempo dt ,
dbadw.di_multi_empresa dme , dbaportal.organizacao o ,
dbadw.di_indice di
WHERE fi.id_tempo = dt.id_tempo
AND DT.CD_MES BETWEEN TO_NUMBER(TO_CHAR(ADD_MONTHS(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),- 11),'YYYYMM'))
AND PCD_MES
AND DT.ANO = TO_NUMBER(TO_CHAR(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),'YYYY'))
AND fi.id_multi_empresa = dme.id_multi_empresa
AND dme.cd_multi_empresa = NVL(o.cd_multi_empresa_mv2000, o.cd_organizacao)
AND o.cd_organizacao = PCD_ORG
AND fi.id_setor IS NULL
AND fi.id_indice = di.id_indice
AND di.cd_indice = PCD_IVM
UNION ALL
SELECT cd_mes, NULL vl_indice, vl_meta, dt.id_tempo,
fm.id_multi_empresa, fm.id_setor, fm.id_indice
FROM dbadw.fa_meta_indice fm , dbadw.di_tempo dt ,
dbadw.di_multi_empresa dme , dbaportal.organizacao o ,
dbadw.di_indice di
WHERE fm.id_tempo = dt.id_tempo
AND DT.ANO = TO_NUMBER(TO_CHAR(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),'YYYY'))
AND fm.id_multi_empresa = dme.id_multi_empresa
AND dme.cd_multi_empresa = NVL(o.cd_multi_empresa_mv2000, o.cd_organizacao)
AND o.cd_organizacao = PCD_ORG
AND fm.id_setor IS NULL
AND fm.id_indice = di.id_indice
AND di.cd_indice = PCD_IVM
)
GROUP BY cd_mes, id_tempo, id_multi_empresa, id_setor, id_indice
ORDER BY cd_mes);
Where I tried to calculate the trend line on the first line, but is not correct. Please, Can anybody help me?
Its very difficult to work out from a query what you want to fit a "trend line" to - by which I assume you mean to use least square linear regression to find a best fit to the data.
So an example with test data:
Oracle Setup:
CREATE TABLE data ( x, y ) AS
SELECT LEVEL,
230 + DBMS_RANDOM.VALUE(-5,5) - 3.14159 * DBMS_RANDOM.VALUE( 0.95, 1.05 ) * LEVEL
FROM DUAL
CONNECT BY LEVEL <= 1000;
As you can see the data is random but its approximately y = -3.14159x + 230
Query - Get the Least Square Regression y-intercept and gradient:
SELECT REGR_INTERCEPT( y, x ) AS best_fit_y_intercept,
REGR_SLOPE( y, x ) AS best_fit_gradient
FROM data
This will get something like:
best_fit_y_intercept best_fit_gradient
-------------------- -----------------
230.531799878168 -3.143190435415
Query - Get the y co-ordinate of the line of best fit:
SELECT x,
y,
REGR_INTERCEPT( y, x ) OVER () + x * REGR_SLOPE( y, x ) OVER () AS best_fit_y
FROM data
The solution is:
SELECT valor, mes,
((mes * SLOPE) + INTERCEPT) TENDENCIA, SLOPE, INTERCEPT from
( select valor, mes, ROUND(REGR_SLOPE(valor,mes) over (partition by id_multi_empresa),4)SLOPE,
ROUND(REGR_INTERCEPT(valor,mes) over (PARTITION by id_multi_empresa),4) INTERCEPT from( --the initial select

Pearson Correlation SQL Server

I have two tables:
ID,YRMO,Counts
1,Dec 2013,4
1,Jan 2014,6
1,Feb 2014,7
2,Jan,2014,6
2,Feb,2014,8
ID,YRMO,Counts
1,Dec 2013,10
1,Jan 2014,8
1,March 2014,12
2,Jan 2014,6
2,Feb 2014,10
I want to find the pearson corelation coefficient for each sets of ID. There are about more than 200 different IDS.
Pearson correlation is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and −1 inclusive
More can be found here :http://oreilly.com/catalog/transqlcook/chapter/ch08.html
at calculating correlation section
To calculate Pearson Correlation Coefficient; you need to first calculate Mean then standard daviation and then correlation coefficient as outlined below
1. Calculate Mean
insert into tab2 (tab1_id, mean)
select ID, sum([counts]) /
(select count(*) from tab1) as mean
from tab1
group by ID;
2. Calculate standard deviation
update tab2
set stddev = (
select sqrt(
sum([counts] * [counts]) /
(select count(*) from tab1)
- mean * mean
) stddev
from tab1
where tab1.ID = tab2.tab1_id
group by tab1.ID);
3. Finally Pearson Correlation Coefficient
select ID,
((sf.sum1 / (select count(*) from tab1)
- stats1.mean * stats2.mean
)
/ (stats1.stddev * stats2.stddev)) as PCC
from (
select r1.ID,
sum(r1.[counts] * r2.[counts]) as sum1
from tab1 r1
join tab1 r2
on r1.ID = r2.ID
group by r1.ID
) sf
join tab2 stats1
on stats1.tab1_id = sf.ID
join tab2 stats2
on stats2.tab1_id = sf.ID
Which on your posted data results in
See a demo fiddle here http://sqlfiddle.com/#!3/0da20/5
EDIT:
Well refined a bit. You can use the below function to get PCC but I am not getting exact same result as of your but rather getting 0.999996000000000 for ID = 1.
This could be a great entry point for you. You can refine the calculation further from here.
create function calculate_PCC(#id int)
returns decimal(16,15)
as
begin
declare #mean numeric(16,5);
declare #stddev numeric(16,5);
declare #count numeric(16,5);
declare #pcc numeric(16,12);
declare #store numeric(16,7);
select #count = CONVERT(numeric(16,5), count(case when Id=#id then 1 end)) from tab1;
select #mean = convert(numeric(16,5),sum([Counts])) / #count
from tab1 WHERE ID = #id;
select #store = (sum(counts * counts) / #count) from tab1 WHERE ID = #id;
set #stddev = sqrt(#store - (#mean * #mean));
set #pcc = ((#store - (#mean * #mean)) / (#stddev * #stddev));
return #pcc;
end
Call the function like
select db_name.dbo.calculate_PCC(1)
A Single-Pass Solution:
There are two flavors of the Pearson correlation coefficient, one for a Sample and one for an entire Population. These are simple, single-pass, and I believe, correct formulas for both:
-- Methods for calculating the two Pearson correlation coefficients
SELECT
-- For Population
(avg(x * y) - avg(x) * avg(y)) /
(sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y)))
AS correlation_coefficient_population,
-- For Sample
(count(*) * sum(x * y) - sum(x) * sum(y)) /
(sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y)))
AS correlation_coefficient_sample
FROM (
-- The following generates a table of sample data containing two columns with a luke-warm and tweakable correlation
-- y = x for 0 thru 99, y = x - 100 for 100 thru 199, etc. Execute it as a stand-alone to see for yourself
-- x and y are CAST as DECIMAL to avoid integer math, you should definitely do the same
-- Try TOP 100 or less for full correlation (y = x for all cases), TOP 200 for a PCC of 0.5, TOP 300 for one near 0.33, etc.
-- The superfluous "+ 0" is where you could apply various offsets to see that they have no effect on the results
SELECT TOP 200
CAST(ROW_NUMBER() OVER (ORDER BY [object_id]) - 1 + 0 AS DECIMAL) AS x,
CAST((ROW_NUMBER() OVER (ORDER BY [object_id]) - 1) % 100 AS DECIMAL) AS y
FROM sys.all_objects
) AS a
As I noted in the comments, you can try the example with TOP 100 or less for full correlation (y = x for all cases); TOP 200 yields correlations very near 0.5; TOP 300, around 0.33; etc. There is a place ("+ 0") to add an offset if you like; spoiler alert, it has no effect. Make sure you CAST your values as DECIMAL - integer math can significantly impact these calcs.

Need SQL Server Query to solve 3rd Order Polynomial Regression

Can anyone help with some SQL query code to provide estimates of the co-efficients for a 3rd order Polynomial regression?
Please assume that I have a table of X and Y data values and want to estimate a, b and c in:
Y(X) = aX + bX^2 + cX^3 + E
APPROXIMATE but fast solution would be to sample 4 representative points from the data and solve the polynomial equation for these points.
As for the sampling, you can split the data into equal sectors and compute average of X and Y for each sector - the split can be done using quartiles of X-values, averages of X-values, min(x)+(max(x)-min(x))/4 or whatever you think is the most appropriate.
To illustrate the sampling by quartiles (i.e. by row numbers):
As for the solving, i used numberempire.com to solve these* equations for variables k,a,b,c:
k + a*X1 + b*X1^2 + c*X1^3 - Y1 = 0,
k + a*X2 + b*X2^2 + c*X2^3 - Y2 = 0,
k + a*X3 + b*X3^2 + c*X3^3 - Y3 = 0,
k + a*X4 + b*X4^2 + c*X4^3 - Y4 = 0
*Since Y(X) = 0 + ax bx^2 + cx^3 + ϵ implicitly includes [0, 0] point as one of the sample points, it would create bad approximations for data sets that don't include [0, 0]. I took the liberty of solving Y(X) = k + ax bx^2 + cx^3 + ϵ instead.
The actual SQL would go like this:
select
-- returns 1 row with columns labeled K, A, B and C = coefficients in 3rd order polynomial equation for the 4 sample points
-(X1*(X2p2*(X3p3*Y4-X4p3*Y3)+X2p3*(X4p2*Y3-X3p2*Y4)+(X3p2*X4p3-X3p3*X4p2)*Y2)+X1p2*(X2*(X4p3*Y3-X3p3*Y4)+X2p3*(X3*Y4-X4*Y3)+(X3p3*X4-X3*X4p3)*Y2)+X1p3*(X2*(X3p2*Y4-X4p2*Y3)+X2p2*(X4*Y3-X3*Y4)+(X3*X4p2-X3p2*X4)*Y2)+(X2*(X3p3*X4p2-X3p2*X4p3)+X2p2*(X3*X4p3-X3p3*X4)+X2p3*(X3p2*X4-X3*X4p2))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as k,
(X1p2*(X2p3*(Y4-Y3)-X3p3*Y4+X4p3*Y3+(X3p3-X4p3)*Y2)+X2p2*(X3p3*Y4-X4p3*Y3)+X1p3*(X3p2*Y4+X2p2*(Y3-Y4)-X4p2*Y3+(X4p2-X3p2)*Y2)+X2p3*(X4p2*Y3-X3p2*Y4)+(X3p2*X4p3-X3p3*X4p2)*Y2+(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as a,
-(X1*(X2p3*(Y4-Y3)-X3p3*Y4+X4p3*Y3+(X3p3-X4p3)*Y2)+X2*(X3p3*Y4-X4p3*Y3)+X1p3*(X3*Y4+X2*(Y3-Y4)-X4*Y3+(X4-X3)*Y2)+X2p3*(X4*Y3-X3*Y4)+(X3*X4p3-X3p3*X4)*Y2+(X2*(X4p3-X3p3)-X3*X4p3+X3p3*X4+X2p3*(X3-X4))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as b,
(X1*(X2p2*(Y4-Y3)-X3p2*Y4+X4p2*Y3+(X3p2-X4p2)*Y2)+X2*(X3p2*Y4-X4p2*Y3)+X1p2*(X3*Y4+X2*(Y3-Y4)-X4*Y3+(X4-X3)*Y2)+X2p2*(X4*Y3-X3*Y4)+(X3*X4p2-X3p2*X4)*Y2+(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as c
from (select
samples.*,
-- precomputing the powers should give better performance (at least i hope it would)
power(X1,2) X1p2, power(X2,2) X2p2, power(X3,2) X3p2, power(X4,2) X4p2,
power(Y1,3) Y1p3, power(Y2,3) Y2p3, power(Y3,3) Y3p3, power(Y4,3) Y4p3
from (select
avg(case when sector = 1 then x end) X1,
avg(case when sector = 2 then x end) X2,
avg(case when sector = 3 then x end) X3,
avg(case when sector = 4 then x end) X4,
avg(case when sector = 1 then y end) Y1,
avg(case when sector = 2 then y end) Y2,
avg(case when sector = 3 then y end) Y3,
avg(case when sector = 4 then y end) Y4
from (select x, y,
-- splitting to sectors 1 - 4 by row number (SQL Server version)
ceiling(row_number() OVER (ORDER BY x asc) / count(*) * 4) sector
from original_data
)
) samples
)
According to developer.mimer.com, these optional features need to be enabled in SQL Server:
T611, "Elementary OLAP operations"
F591, "Derived tables"
SQL Server has a built-in ranking function NTILE(n) which will more easily create your sectors. I replaced:
ceiling(row_number() OVER (ORDER BY x asc) / count(*) * 4) sector
with:
NTILE(4) OVER(ORDER BY x ASC) [sector]
I also needed to add several "precomputed powers" to allow for the full column range as selected. The full list appears below:
POWER(samples.X1, 2) AS [X1p2],
POWER(samples.X1, 3) AS [X1p3],
POWER(samples.X2, 2) AS [X2p2],
POWER(samples.X2, 3) AS [X2p3],
POWER(samples.X3, 2) AS [X3p2],
POWER(samples.X3, 3) AS [X3p3],
POWER(samples.X4, 2) AS [X4p2],
POWER(samples.X4, 3) AS [X4p3],
POWER(samples.Y1, 3) AS [Y1p3],
POWER(samples.Y2, 3) AS [Y2p3],
POWER(samples.Y3, 3) AS [Y3p3],
POWER(samples.Y4, 3) AS [Y4p3]
Overall, great answer by #Aprillion! Well explained and the numberempire.com h/t was very helpful.

Early (or re-ordered) re-use of derived columns in a query - is this valid ANSI SQL?

Is this valid ANSI SQL?:
SELECT 1 AS X
,2 * X AS Y
,3 * Y AS Z
Because Teradata (12) can do this, as well as this (yes, crazy isn't it):
SELECT 3 * Y AS Z
,2 * X AS Y
,1 AS X
But SQL Server 2005 requires something like this:
SELECT X
,Y
,3 * Y AS Z
FROM (
SELECT X
,2 * X AS Y
FROM (
SELECT 1 AS X
) AS X
) AS Y
No, it's not valid ANSI. ANSI assumes that all SELECT clause items are evaluated at once.
And I'd've written it in SQL 2005 as:
SELECT *
FROM (SELECT 1 AS X) X
CROSS APPLY (SELECT 2 * X AS Y) Y
CROSS APPLY (SELECT 3 * Y AS Z) Z
;
It doesn't need to be that ugly in SQL Server 2005+. That's why Microsoft introduced CTEs:
WITH T1 AS (SELECT 1 AS X),
T2 AS (SELECT X, 2 * X AS Y FROM T1)
SELECT X, Y, 3 * Y AS Z FROM T2
Or you could use CROSS APPLY as Rob demonstrates - that may or may not work for you depending on the specifics of the query.
I admit that it's not as clean as Teradata's, but it's not nearly as bad as the subquery version, and the original Teradata example in your question is definitely not part of the SQL-92 standard.
I'd also add that in your original example, the X, Y and Z columns are not, technically, derived columns as you call them. At least as far as Microsoft and ANSI are concerned, they are just aliases, and an alias can't refer to another alias until it actually becomes a column (i.e. through a subquery or CTE).