Are there any trend line Function in PL SQL?

Are there any trend line Function in PL SQL? - sql

I need a function to calculate a trend line. I have a query (part of the function):
select round(sum(nvl(vl_indice, vl_meta))/12, 2) from (
SELECT
SUM (vl_indice) vl_indice, SUM (vl_meta) vl_meta
FROM
(SELECT cd_mes, vl_indice, NULL vl_meta, dt.id_tempo,
fi.id_multi_empresa, fi.id_setor, fi.id_indice
FROM dbadw.fa_indice fi , dbadw.di_tempo dt ,
dbadw.di_multi_empresa dme , dbaportal.organizacao o ,
dbadw.di_indice di
WHERE fi.id_tempo = dt.id_tempo
AND DT.CD_MES BETWEEN TO_NUMBER(TO_CHAR(ADD_MONTHS(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),- 11),'YYYYMM'))
AND PCD_MES
AND DT.ANO = TO_NUMBER(TO_CHAR(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),'YYYY'))
AND fi.id_multi_empresa = dme.id_multi_empresa
AND dme.cd_multi_empresa = NVL(o.cd_multi_empresa_mv2000, o.cd_organizacao)
AND o.cd_organizacao = PCD_ORG
AND fi.id_setor IS NULL
AND fi.id_indice = di.id_indice
AND di.cd_indice = PCD_IVM
UNION ALL
SELECT cd_mes, NULL vl_indice, vl_meta, dt.id_tempo,
fm.id_multi_empresa, fm.id_setor, fm.id_indice
FROM dbadw.fa_meta_indice fm , dbadw.di_tempo dt ,
dbadw.di_multi_empresa dme , dbaportal.organizacao o ,
dbadw.di_indice di
WHERE fm.id_tempo = dt.id_tempo
AND DT.ANO = TO_NUMBER(TO_CHAR(TO_DATE(TO_CHAR(PCD_MES),'YYYYMM'),'YYYY'))
AND fm.id_multi_empresa = dme.id_multi_empresa
AND dme.cd_multi_empresa = NVL(o.cd_multi_empresa_mv2000, o.cd_organizacao)
AND o.cd_organizacao = PCD_ORG
AND fm.id_setor IS NULL
AND fm.id_indice = di.id_indice
AND di.cd_indice = PCD_IVM
)
GROUP BY cd_mes, id_tempo, id_multi_empresa, id_setor, id_indice
ORDER BY cd_mes);
Where I tried to calculate the trend line on the first line, but is not correct. Please, Can anybody help me?

Its very difficult to work out from a query what you want to fit a "trend line" to - by which I assume you mean to use least square linear regression to find a best fit to the data.
So an example with test data:
Oracle Setup:
CREATE TABLE data ( x, y ) AS
SELECT LEVEL,
230 + DBMS_RANDOM.VALUE(-5,5) - 3.14159 * DBMS_RANDOM.VALUE( 0.95, 1.05 ) * LEVEL
FROM DUAL
CONNECT BY LEVEL <= 1000;
As you can see the data is random but its approximately y = -3.14159x + 230
Query - Get the Least Square Regression y-intercept and gradient:
SELECT REGR_INTERCEPT( y, x ) AS best_fit_y_intercept,
REGR_SLOPE( y, x ) AS best_fit_gradient
FROM data
This will get something like:
best_fit_y_intercept best_fit_gradient
-------------------- -----------------
230.531799878168 -3.143190435415
Query - Get the y co-ordinate of the line of best fit:
SELECT x,
y,
REGR_INTERCEPT( y, x ) OVER () + x * REGR_SLOPE( y, x ) OVER () AS best_fit_y
FROM data

The solution is:
SELECT valor, mes,
((mes * SLOPE) + INTERCEPT) TENDENCIA, SLOPE, INTERCEPT from
( select valor, mes, ROUND(REGR_SLOPE(valor,mes) over (partition by id_multi_empresa),4)SLOPE,
ROUND(REGR_INTERCEPT(valor,mes) over (PARTITION by id_multi_empresa),4) INTERCEPT from( --the initial select

Related

Snowflake table and generator functions does not give expected result

I tried to create a simple SQL to track query_history usage, but got into trouble when creating my timeslots using the table and generator functions (the CTE named x below).
I got no results at all when limiting the query_history using my timeslots, so after a while I hardcoded an SQL to give the same result (the CTE named y below) and this works fine.
Why does not x work? As far as I can see x and y produce identical result?
To test the example first run the code as it is, this produces no result.
Then comment the line x as timeslots and un-comment the line y as timeslots, this will give the desired result.
with
x as (
select
dateadd('min',seq4()*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(seq4()+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
),
y as (
select
dateadd('min',n*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(n+1)*10,dateadd('min',-60,current_timestamp())) t
from (select 0 n union all select 1 n union all select 2 union all select 3
union all select 4 union all select 5)
)
--select * from x;
--select * from y;
select distinct
user_name,
timeslots.f
from snowflake.account_usage.query_history,
x as timeslots
--y as timeslots
where start_time >= timeslots.f
and start_time < timeslots.t
order by timeslots.f desc;
(I know the code is not optimal, this is only meant to illustrate the problem)

SEQ:
Returns a sequence of monotonically increasing integers, with wrap-around. Wrap-around occurs after the largest representable integer of the integer width (1, 2, 4, or 8 byte).
If a fully ordered, gap-free sequence is required, consider using the ROW_NUMBER window function.
For:
with x as (
select
dateadd('min',seq4()*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(seq4()+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
)
SELECT * FROM x;
Should be:
with x as (
select
(ROW_NUMBER() OVER(ORDER BY seq4())) - 1 AS n,
dateadd('min',n*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(n+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
)
SELECT * FROM x;

Recursive Formula based on previous Row's Result

Let's consider the following query:
with
init as (
select 0.1 as y0
),
cte as (
select 1 as i, 1 as x -- x_1
union all
select 2 as i, 10 as x -- x_2
union all
select 3 as i, 100 as x -- x_3
order by i asc
)
select cte.x, init.y0 -- <- ?
from cte
join init
on true
There is a CTE init specifying an inital value y_0 and a CTE cte specifying rows with a value x and an index i.
My question is whether I can write a select which realizes the following simple, recursive formula.
y_n+1 = y_n + x_n+1
So, the result should be 3 rows with values: 1.1, 11.1, 111.1 (for y_1, y_2, y_3).
Would that be possible?

write a select which realizes the following simple, recursive formula.
y_n+1 = y_n + x_n+1
Consider below
select x, y0 + sum(x) over(order by i) as y
from cte, init
if applied to sample data in your question - output is
Note: the expected result you shown in your question - does not match the formula you provided - so obviously above output is different from one in your question :o)

You need to use the “OVER” statement. You can see more documentation about the syntax.
with
init as (
select 0.1 as y0
),
cte as (
select 1 as ts, 1 as i, 1 as x -- x_1
union all
select 2, 2, 10 as x -- x_2
union all
select 3, 3, 100 as x
union all
select 4, 4, 109 as x -- x_3
union all
select 5, 5, 149 as x
order by i asc
)
SELECT *,init.y0 + SUM(i) OVER(
ORDER BY (ts)
) AS res
FROM cte join init
on true

Sql Trend line by departments

I'm using the example of how to create a sql trend line on a report using the below link.
https://www.mssqltips.com/sqlservertip/3432/add-a-linear-trendline-to-a-graph-in-sql-server-reporting-services/
I've got it all up and running but I want to work out the trend by departments also. However its just merging all the data into one final value, I think its the below section of code that needs altering to calculate the sum by each of the departments I add in, but how best do I do this?
-- calculate sample size and the different sums
SELECT
#sample_size = COUNT(*)
,#sumX = SUM(ID)
,#sumY = SUM([OrderQuantity])
,#sumXX = SUM(ID*ID)
,#sumYY = SUM([OrderQuantity]*[OrderQuantity])
,#sumXY = SUM(ID*[OrderQuantity])
FROM #Temp_Regression;
-- output results
SELECT
SampleSize = #sample_size
,SumRID = #sumX
,SumOrderQty =#sumY
,SumXX = #sumXX
,SumYY = #sumYY
,SumXY = #sumXY;
These variables are then used to work out the trend line:
-- calculate the slope and intercept
SET #slope = CASE WHEN #sample_size = 1
THEN 0 -- avoid divide by zero error
ELSE (#sample_size * #sumXY - #sumX * #sumY) / (#sample_size * #sumXX - POWER(#sumX,2))
END;
SET #intercept = (#sumY - (#slope*#sumX)) / #sample_size;

You need to add departments column in SELECT & GROUP BY
SELECT departments,
SampleSize = Count(*),
SumRID = Sum(ID),
SumOrderQty = Sum([OrderQuantity]),
SumXX = Sum(ID * ID),
SumYY = Sum([OrderQuantity] * [OrderQuantity]),
SumXY = Sum(ID * [OrderQuantity])
FROM #Temp_Regression
GROUP BY departments
Here is the easier way to calculate slope & intercept for all departments
;WITH cte
AS (SELECT departments,
sample_size = Count(*),
sumX = Sum(ID),
sumY = Sum([OrderQuantity]),
sumXX = Sum(ID * ID),
sumYY = Sum([OrderQuantity] * [OrderQuantity]),
sumXY = Sum(ID * [OrderQuantity])
FROM #Temp_Regression
GROUP BY departments),
slope
AS (SELECT departments,
Sample_Size,
sumX,
sumY,
slope = CASE
WHEN sample_size = 1 THEN 0 -- avoid divide by zero error
ELSE ( sample_size * sumXY - sumX * sumY ) / ( sample_size * sumXX - Power(sumX, 2) )
END
FROM cte)
SELECT departments,
slope,
intercept = ( sumY - ( slope * sumX ) ) / sample_size
FROM slope

Pearson Correlation SQL Server

I have two tables:
ID,YRMO,Counts
1,Dec 2013,4
1,Jan 2014,6
1,Feb 2014,7
2,Jan,2014,6
2,Feb,2014,8
ID,YRMO,Counts
1,Dec 2013,10
1,Jan 2014,8
1,March 2014,12
2,Jan 2014,6
2,Feb 2014,10
I want to find the pearson corelation coefficient for each sets of ID. There are about more than 200 different IDS.
Pearson correlation is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and −1 inclusive
More can be found here :http://oreilly.com/catalog/transqlcook/chapter/ch08.html
at calculating correlation section

To calculate Pearson Correlation Coefficient; you need to first calculate Mean then standard daviation and then correlation coefficient as outlined below
1. Calculate Mean
insert into tab2 (tab1_id, mean)
select ID, sum([counts]) /
(select count(*) from tab1) as mean
from tab1
group by ID;
2. Calculate standard deviation
update tab2
set stddev = (
select sqrt(
sum([counts] * [counts]) /
(select count(*) from tab1)
- mean * mean
) stddev
from tab1
where tab1.ID = tab2.tab1_id
group by tab1.ID);
3. Finally Pearson Correlation Coefficient
select ID,
((sf.sum1 / (select count(*) from tab1)
- stats1.mean * stats2.mean
)
/ (stats1.stddev * stats2.stddev)) as PCC
from (
select r1.ID,
sum(r1.[counts] * r2.[counts]) as sum1
from tab1 r1
join tab1 r2
on r1.ID = r2.ID
group by r1.ID
) sf
join tab2 stats1
on stats1.tab1_id = sf.ID
join tab2 stats2
on stats2.tab1_id = sf.ID
Which on your posted data results in
See a demo fiddle here http://sqlfiddle.com/#!3/0da20/5
EDIT:
Well refined a bit. You can use the below function to get PCC but I am not getting exact same result as of your but rather getting 0.999996000000000 for ID = 1.
This could be a great entry point for you. You can refine the calculation further from here.
create function calculate_PCC(#id int)
returns decimal(16,15)
as
begin
declare #mean numeric(16,5);
declare #stddev numeric(16,5);
declare #count numeric(16,5);
declare #pcc numeric(16,12);
declare #store numeric(16,7);
select #count = CONVERT(numeric(16,5), count(case when Id=#id then 1 end)) from tab1;
select #mean = convert(numeric(16,5),sum([Counts])) / #count
from tab1 WHERE ID = #id;
select #store = (sum(counts * counts) / #count) from tab1 WHERE ID = #id;
set #stddev = sqrt(#store - (#mean * #mean));
set #pcc = ((#store - (#mean * #mean)) / (#stddev * #stddev));
return #pcc;
end
Call the function like
select db_name.dbo.calculate_PCC(1)

A Single-Pass Solution:
There are two flavors of the Pearson correlation coefficient, one for a Sample and one for an entire Population. These are simple, single-pass, and I believe, correct formulas for both:
-- Methods for calculating the two Pearson correlation coefficients
SELECT
-- For Population
(avg(x * y) - avg(x) * avg(y)) /
(sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y)))
AS correlation_coefficient_population,
-- For Sample
(count(*) * sum(x * y) - sum(x) * sum(y)) /
(sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y)))
AS correlation_coefficient_sample
FROM (
-- The following generates a table of sample data containing two columns with a luke-warm and tweakable correlation
-- y = x for 0 thru 99, y = x - 100 for 100 thru 199, etc. Execute it as a stand-alone to see for yourself
-- x and y are CAST as DECIMAL to avoid integer math, you should definitely do the same
-- Try TOP 100 or less for full correlation (y = x for all cases), TOP 200 for a PCC of 0.5, TOP 300 for one near 0.33, etc.
-- The superfluous "+ 0" is where you could apply various offsets to see that they have no effect on the results
SELECT TOP 200
CAST(ROW_NUMBER() OVER (ORDER BY [object_id]) - 1 + 0 AS DECIMAL) AS x,
CAST((ROW_NUMBER() OVER (ORDER BY [object_id]) - 1) % 100 AS DECIMAL) AS y
FROM sys.all_objects
) AS a
As I noted in the comments, you can try the example with TOP 100 or less for full correlation (y = x for all cases); TOP 200 yields correlations very near 0.5; TOP 300, around 0.33; etc. There is a place ("+ 0") to add an offset if you like; spoiler alert, it has no effect. Make sure you CAST your values as DECIMAL - integer math can significantly impact these calcs.

Multi sparse matrices handling with SQL

I got a model like this:
matrices (
matricesID integer;
x integer;
y integer;
value float;
)
So there will store many matrices data in that table, now I need to get average value for each matrix edge by edge, i.e. if one matrix is 20 * 30 and had value in (5,3), (5,7), (5,15), (12,4), (17,5), (17,10), I need to get four groups of data, one for all values that x=5, one for all values that x=17, one for all values that y=4 and one for all values that y=15, cause they are the max/min for x and y.
Is there any way to perform this with easy SQL?
Any idea will be appreciated.
BR
Edward
I

This is a guess as I don't have much experience in the problem domain:
select matricesID
, (select avg(value) from matrices where matricesID = a.matricesID and x = a.minx) as avgofminx
, (select avg(value) from matrices where matricesID = a.matricesID and x = a.maxx) as avgofmaxx
, (select avg(value) from matrices where matricesID = a.matricesID and y = a.miny) as avgofminy
, (select avg(value) from matrices where matricesID = a.matricesID and y = a.maxy) as avgofmaxy
from (
select matricesID
, min(x) as minx
, max(x) as maxx
, min(y) as miny
, max(y) as maxy
from matrices
group by matricesID
) as a
This is running in SQL Server, but the syntax is simple enough that it hopefully runs in whatever DBMS you are using

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Are there any trend line Function in PL SQL? - sql

The solution is: SELECT valor, mes, ((mes * SLOPE) + INTERCEPT) TENDENCIA, SLOPE, INTERCEPT from ( select valor, mes, ROUND(REGR_SLOPE(valor,mes) over (partition by id_multi_empresa),4)SLOPE, ROUND(REGR_INTERCEPT(valor,mes) over (PARTITION by id_multi_empresa),4) INTERCEPT from( --the initial select

Related

Snowflake table and generator functions does not give expected result

Recursive Formula based on previous Row's Result

Sql Trend line by departments

Pearson Correlation SQL Server

Multi sparse matrices handling with SQL

Categories

Resources