Are there any trend line Function in PL SQL? - sql

I need a function to calculate a trend line. I have a query (part of the function):
select round(sum(nvl(vl_indice, vl_meta))/12, 2) from (
SELECT
SUM (vl_indice) vl_indice, SUM (vl_meta) vl_meta
FROM
(SELECT cd_mes, vl_indice, NULL vl_meta, dt.id_tempo,
fi.id_multi_empresa, fi.id_setor, fi.id_indice
FROM dbadw.fa_indice fi , dbadw.di_tempo dt ,
dbadw.di_multi_empresa dme , dbaportal.organizacao o ,
dbadw.di_indice di
WHERE fi.id_tempo = dt.id_tempo
AND DT.CD_MES BETWEEN TO_NUMBER(TO_CHAR(ADD_MONTHS(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),- 11),'YYYYMM'))
AND PCD_MES
AND DT.ANO = TO_NUMBER(TO_CHAR(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),'YYYY'))
AND fi.id_multi_empresa = dme.id_multi_empresa
AND dme.cd_multi_empresa = NVL(o.cd_multi_empresa_mv2000, o.cd_organizacao)
AND o.cd_organizacao = PCD_ORG
AND fi.id_setor IS NULL
AND fi.id_indice = di.id_indice
AND di.cd_indice = PCD_IVM
UNION ALL
SELECT cd_mes, NULL vl_indice, vl_meta, dt.id_tempo,
fm.id_multi_empresa, fm.id_setor, fm.id_indice
FROM dbadw.fa_meta_indice fm , dbadw.di_tempo dt ,
dbadw.di_multi_empresa dme , dbaportal.organizacao o ,
dbadw.di_indice di
WHERE fm.id_tempo = dt.id_tempo
AND DT.ANO = TO_NUMBER(TO_CHAR(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),'YYYY'))
AND fm.id_multi_empresa = dme.id_multi_empresa
AND dme.cd_multi_empresa = NVL(o.cd_multi_empresa_mv2000, o.cd_organizacao)
AND o.cd_organizacao = PCD_ORG
AND fm.id_setor IS NULL
AND fm.id_indice = di.id_indice
AND di.cd_indice = PCD_IVM
)
GROUP BY cd_mes, id_tempo, id_multi_empresa, id_setor, id_indice
ORDER BY cd_mes);
Where I tried to calculate the trend line on the first line, but is not correct. Please, Can anybody help me?

Its very difficult to work out from a query what you want to fit a "trend line" to - by which I assume you mean to use least square linear regression to find a best fit to the data.
So an example with test data:
Oracle Setup:
CREATE TABLE data ( x, y ) AS
SELECT LEVEL,
230 + DBMS_RANDOM.VALUE(-5,5) - 3.14159 * DBMS_RANDOM.VALUE( 0.95, 1.05 ) * LEVEL
FROM DUAL
CONNECT BY LEVEL <= 1000;
As you can see the data is random but its approximately y = -3.14159x + 230
Query - Get the Least Square Regression y-intercept and gradient:
SELECT REGR_INTERCEPT( y, x ) AS best_fit_y_intercept,
REGR_SLOPE( y, x ) AS best_fit_gradient
FROM data
This will get something like:
best_fit_y_intercept best_fit_gradient
-------------------- -----------------
230.531799878168 -3.143190435415
Query - Get the y co-ordinate of the line of best fit:
SELECT x,
y,
REGR_INTERCEPT( y, x ) OVER () + x * REGR_SLOPE( y, x ) OVER () AS best_fit_y
FROM data

The solution is:
SELECT valor, mes,
((mes * SLOPE) + INTERCEPT) TENDENCIA, SLOPE, INTERCEPT from
( select valor, mes, ROUND(REGR_SLOPE(valor,mes) over (partition by id_multi_empresa),4)SLOPE,
ROUND(REGR_INTERCEPT(valor,mes) over (PARTITION by id_multi_empresa),4) INTERCEPT from( --the initial select

Related

Snowflake table and generator functions does not give expected result

I tried to create a simple SQL to track query_history usage, but got into trouble when creating my timeslots using the table and generator functions (the CTE named x below).
I got no results at all when limiting the query_history using my timeslots, so after a while I hardcoded an SQL to give the same result (the CTE named y below) and this works fine.
Why does not x work? As far as I can see x and y produce identical result?
To test the example first run the code as it is, this produces no result.
Then comment the line x as timeslots and un-comment the line y as timeslots, this will give the desired result.
with
x as (
select
dateadd('min',seq4()*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(seq4()+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
),
y as (
select
dateadd('min',n*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(n+1)*10,dateadd('min',-60,current_timestamp())) t
from (select 0 n union all select 1 n union all select 2 union all select 3
union all select 4 union all select 5)
)
--select * from x;
--select * from y;
select distinct
user_name,
timeslots.f
from snowflake.account_usage.query_history,
x as timeslots
--y as timeslots
where start_time >= timeslots.f
and start_time < timeslots.t
order by timeslots.f desc;
(I know the code is not optimal, this is only meant to illustrate the problem)
SEQ:
Returns a sequence of monotonically increasing integers, with wrap-around. Wrap-around occurs after the largest representable integer of the integer width (1, 2, 4, or 8 byte).
If a fully ordered, gap-free sequence is required, consider using the ROW_NUMBER window function.
For:
with x as (
select
dateadd('min',seq4()*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(seq4()+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
)
SELECT * FROM x;
Should be:
with x as (
select
(ROW_NUMBER() OVER(ORDER BY seq4())) - 1 AS n,
dateadd('min',n*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(n+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
)
SELECT * FROM x;

Recursive Formula based on previous Row's Result

Let's consider the following query:
with
init as (
select 0.1 as y0
),
cte as (
select 1 as i, 1 as x -- x_1
union all
select 2 as i, 10 as x -- x_2
union all
select 3 as i, 100 as x -- x_3
order by i asc
)
select cte.x, init.y0 -- <- ?
from cte
join init
on true
There is a CTE init specifying an inital value y_0 and a CTE cte specifying rows with a value x and an index i.
My question is whether I can write a select which realizes the following simple, recursive formula.
y_n+1 = y_n + x_n+1
So, the result should be 3 rows with values: 1.1, 11.1, 111.1 (for y_1, y_2, y_3).
Would that be possible?
write a select which realizes the following simple, recursive formula.
y_n+1 = y_n + x_n+1
Consider below
select x, y0 + sum(x) over(order by i) as y
from cte, init
if applied to sample data in your question - output is
Note: the expected result you shown in your question - does not match the formula you provided - so obviously above output is different from one in your question :o)
You need to use the “OVER” statement. You can see more documentation about the syntax.
with
init as (
select 0.1 as y0
),
cte as (
select 1 as ts, 1 as i, 1 as x -- x_1
union all
select 2, 2, 10 as x -- x_2
union all
select 3, 3, 100 as x
union all
select 4, 4, 109 as x -- x_3
union all
select 5, 5, 149 as x
order by i asc
)
SELECT *,init.y0 + SUM(i) OVER(
ORDER BY (ts)
) AS res
FROM cte join init
on true

Sql Trend line by departments

I'm using the example of how to create a sql trend line on a report using the below link.
https://www.mssqltips.com/sqlservertip/3432/add-a-linear-trendline-to-a-graph-in-sql-server-reporting-services/
I've got it all up and running but I want to work out the trend by departments also. However its just merging all the data into one final value, I think its the below section of code that needs altering to calculate the sum by each of the departments I add in, but how best do I do this?
-- calculate sample size and the different sums
SELECT
#sample_size = COUNT(*)
,#sumX = SUM(ID)
,#sumY = SUM([OrderQuantity])
,#sumXX = SUM(ID*ID)
,#sumYY = SUM([OrderQuantity]*[OrderQuantity])
,#sumXY = SUM(ID*[OrderQuantity])
FROM #Temp_Regression;
-- output results
SELECT
SampleSize = #sample_size
,SumRID = #sumX
,SumOrderQty =#sumY
,SumXX = #sumXX
,SumYY = #sumYY
,SumXY = #sumXY;
These variables are then used to work out the trend line:
-- calculate the slope and intercept
SET #slope = CASE WHEN #sample_size = 1
THEN 0 -- avoid divide by zero error
ELSE (#sample_size * #sumXY - #sumX * #sumY) / (#sample_size * #sumXX - POWER(#sumX,2))
END;
SET #intercept = (#sumY - (#slope*#sumX)) / #sample_size;
You need to add departments column in SELECT & GROUP BY
SELECT departments,
SampleSize = Count(*),
SumRID = Sum(ID),
SumOrderQty = Sum([OrderQuantity]),
SumXX = Sum(ID * ID),
SumYY = Sum([OrderQuantity] * [OrderQuantity]),
SumXY = Sum(ID * [OrderQuantity])
FROM #Temp_Regression
GROUP BY departments
Here is the easier way to calculate slope & intercept for all departments
;WITH cte
AS (SELECT departments,
sample_size = Count(*),
sumX = Sum(ID),
sumY = Sum([OrderQuantity]),
sumXX = Sum(ID * ID),
sumYY = Sum([OrderQuantity] * [OrderQuantity]),
sumXY = Sum(ID * [OrderQuantity])
FROM #Temp_Regression
GROUP BY departments),
slope
AS (SELECT departments,
Sample_Size,
sumX,
sumY,
slope = CASE
WHEN sample_size = 1 THEN 0 -- avoid divide by zero error
ELSE ( sample_size * sumXY - sumX * sumY ) / ( sample_size * sumXX - Power(sumX, 2) )
END
FROM cte)
SELECT departments,
slope,
intercept = ( sumY - ( slope * sumX ) ) / sample_size
FROM slope

Pearson Correlation SQL Server

I have two tables:
ID,YRMO,Counts
1,Dec 2013,4
1,Jan 2014,6
1,Feb 2014,7
2,Jan,2014,6
2,Feb,2014,8
ID,YRMO,Counts
1,Dec 2013,10
1,Jan 2014,8
1,March 2014,12
2,Jan 2014,6
2,Feb 2014,10
I want to find the pearson corelation coefficient for each sets of ID. There are about more than 200 different IDS.
Pearson correlation is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and −1 inclusive
More can be found here :http://oreilly.com/catalog/transqlcook/chapter/ch08.html
at calculating correlation section
To calculate Pearson Correlation Coefficient; you need to first calculate Mean then standard daviation and then correlation coefficient as outlined below
1. Calculate Mean
insert into tab2 (tab1_id, mean)
select ID, sum([counts]) /
(select count(*) from tab1) as mean
from tab1
group by ID;
2. Calculate standard deviation
update tab2
set stddev = (
select sqrt(
sum([counts] * [counts]) /
(select count(*) from tab1)
- mean * mean
) stddev
from tab1
where tab1.ID = tab2.tab1_id
group by tab1.ID);
3. Finally Pearson Correlation Coefficient
select ID,
((sf.sum1 / (select count(*) from tab1)
- stats1.mean * stats2.mean
)
/ (stats1.stddev * stats2.stddev)) as PCC
from (
select r1.ID,
sum(r1.[counts] * r2.[counts]) as sum1
from tab1 r1
join tab1 r2
on r1.ID = r2.ID
group by r1.ID
) sf
join tab2 stats1
on stats1.tab1_id = sf.ID
join tab2 stats2
on stats2.tab1_id = sf.ID
Which on your posted data results in
See a demo fiddle here http://sqlfiddle.com/#!3/0da20/5
EDIT:
Well refined a bit. You can use the below function to get PCC but I am not getting exact same result as of your but rather getting 0.999996000000000 for ID = 1.
This could be a great entry point for you. You can refine the calculation further from here.
create function calculate_PCC(#id int)
returns decimal(16,15)
as
begin
declare #mean numeric(16,5);
declare #stddev numeric(16,5);
declare #count numeric(16,5);
declare #pcc numeric(16,12);
declare #store numeric(16,7);
select #count = CONVERT(numeric(16,5), count(case when Id=#id then 1 end)) from tab1;
select #mean = convert(numeric(16,5),sum([Counts])) / #count
from tab1 WHERE ID = #id;
select #store = (sum(counts * counts) / #count) from tab1 WHERE ID = #id;
set #stddev = sqrt(#store - (#mean * #mean));
set #pcc = ((#store - (#mean * #mean)) / (#stddev * #stddev));
return #pcc;
end
Call the function like
select db_name.dbo.calculate_PCC(1)
A Single-Pass Solution:
There are two flavors of the Pearson correlation coefficient, one for a Sample and one for an entire Population. These are simple, single-pass, and I believe, correct formulas for both:
-- Methods for calculating the two Pearson correlation coefficients
SELECT
-- For Population
(avg(x * y) - avg(x) * avg(y)) /
(sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y)))
AS correlation_coefficient_population,
-- For Sample
(count(*) * sum(x * y) - sum(x) * sum(y)) /
(sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y)))
AS correlation_coefficient_sample
FROM (
-- The following generates a table of sample data containing two columns with a luke-warm and tweakable correlation
-- y = x for 0 thru 99, y = x - 100 for 100 thru 199, etc. Execute it as a stand-alone to see for yourself
-- x and y are CAST as DECIMAL to avoid integer math, you should definitely do the same
-- Try TOP 100 or less for full correlation (y = x for all cases), TOP 200 for a PCC of 0.5, TOP 300 for one near 0.33, etc.
-- The superfluous "+ 0" is where you could apply various offsets to see that they have no effect on the results
SELECT TOP 200
CAST(ROW_NUMBER() OVER (ORDER BY [object_id]) - 1 + 0 AS DECIMAL) AS x,
CAST((ROW_NUMBER() OVER (ORDER BY [object_id]) - 1) % 100 AS DECIMAL) AS y
FROM sys.all_objects
) AS a
As I noted in the comments, you can try the example with TOP 100 or less for full correlation (y = x for all cases); TOP 200 yields correlations very near 0.5; TOP 300, around 0.33; etc. There is a place ("+ 0") to add an offset if you like; spoiler alert, it has no effect. Make sure you CAST your values as DECIMAL - integer math can significantly impact these calcs.

Multi sparse matrices handling with SQL

I got a model like this:
matrices (
matricesID integer;
x integer;
y integer;
value float;
)
So there will store many matrices data in that table, now I need to get average value for each matrix edge by edge, i.e. if one matrix is 20 * 30 and had value in (5,3), (5,7), (5,15), (12,4), (17,5), (17,10), I need to get four groups of data, one for all values that x=5, one for all values that x=17, one for all values that y=4 and one for all values that y=15, cause they are the max/min for x and y.
Is there any way to perform this with easy SQL?
Any idea will be appreciated.
BR
Edward
I
This is a guess as I don't have much experience in the problem domain:
select matricesID
, (select avg(value) from matrices where matricesID = a.matricesID and x = a.minx) as avgofminx
, (select avg(value) from matrices where matricesID = a.matricesID and x = a.maxx) as avgofmaxx
, (select avg(value) from matrices where matricesID = a.matricesID and y = a.miny) as avgofminy
, (select avg(value) from matrices where matricesID = a.matricesID and y = a.maxy) as avgofmaxy
from (
select matricesID
, min(x) as minx
, max(x) as maxx
, min(y) as miny
, max(y) as maxy
from matrices
group by matricesID
) as a
This is running in SQL Server, but the syntax is simple enough that it hopefully runs in whatever DBMS you are using