Pearson Correlation SQL Server - sql

I have two tables:
ID,YRMO,Counts
1,Dec 2013,4
1,Jan 2014,6
1,Feb 2014,7
2,Jan,2014,6
2,Feb,2014,8
ID,YRMO,Counts
1,Dec 2013,10
1,Jan 2014,8
1,March 2014,12
2,Jan 2014,6
2,Feb 2014,10
I want to find the pearson corelation coefficient for each sets of ID. There are about more than 200 different IDS.
Pearson correlation is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and −1 inclusive
More can be found here :http://oreilly.com/catalog/transqlcook/chapter/ch08.html
at calculating correlation section

To calculate Pearson Correlation Coefficient; you need to first calculate Mean then standard daviation and then correlation coefficient as outlined below
1. Calculate Mean
insert into tab2 (tab1_id, mean)
select ID, sum([counts]) /
(select count(*) from tab1) as mean
from tab1
group by ID;
2. Calculate standard deviation
update tab2
set stddev = (
select sqrt(
sum([counts] * [counts]) /
(select count(*) from tab1)
- mean * mean
) stddev
from tab1
where tab1.ID = tab2.tab1_id
group by tab1.ID);
3. Finally Pearson Correlation Coefficient
select ID,
((sf.sum1 / (select count(*) from tab1)
- stats1.mean * stats2.mean
)
/ (stats1.stddev * stats2.stddev)) as PCC
from (
select r1.ID,
sum(r1.[counts] * r2.[counts]) as sum1
from tab1 r1
join tab1 r2
on r1.ID = r2.ID
group by r1.ID
) sf
join tab2 stats1
on stats1.tab1_id = sf.ID
join tab2 stats2
on stats2.tab1_id = sf.ID
Which on your posted data results in
See a demo fiddle here http://sqlfiddle.com/#!3/0da20/5
EDIT:
Well refined a bit. You can use the below function to get PCC but I am not getting exact same result as of your but rather getting 0.999996000000000 for ID = 1.
This could be a great entry point for you. You can refine the calculation further from here.
create function calculate_PCC(#id int)
returns decimal(16,15)
as
begin
declare #mean numeric(16,5);
declare #stddev numeric(16,5);
declare #count numeric(16,5);
declare #pcc numeric(16,12);
declare #store numeric(16,7);
select #count = CONVERT(numeric(16,5), count(case when Id=#id then 1 end)) from tab1;
select #mean = convert(numeric(16,5),sum([Counts])) / #count
from tab1 WHERE ID = #id;
select #store = (sum(counts * counts) / #count) from tab1 WHERE ID = #id;
set #stddev = sqrt(#store - (#mean * #mean));
set #pcc = ((#store - (#mean * #mean)) / (#stddev * #stddev));
return #pcc;
end
Call the function like
select db_name.dbo.calculate_PCC(1)

A Single-Pass Solution:
There are two flavors of the Pearson correlation coefficient, one for a Sample and one for an entire Population. These are simple, single-pass, and I believe, correct formulas for both:
-- Methods for calculating the two Pearson correlation coefficients
SELECT
-- For Population
(avg(x * y) - avg(x) * avg(y)) /
(sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y)))
AS correlation_coefficient_population,
-- For Sample
(count(*) * sum(x * y) - sum(x) * sum(y)) /
(sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y)))
AS correlation_coefficient_sample
FROM (
-- The following generates a table of sample data containing two columns with a luke-warm and tweakable correlation
-- y = x for 0 thru 99, y = x - 100 for 100 thru 199, etc. Execute it as a stand-alone to see for yourself
-- x and y are CAST as DECIMAL to avoid integer math, you should definitely do the same
-- Try TOP 100 or less for full correlation (y = x for all cases), TOP 200 for a PCC of 0.5, TOP 300 for one near 0.33, etc.
-- The superfluous "+ 0" is where you could apply various offsets to see that they have no effect on the results
SELECT TOP 200
CAST(ROW_NUMBER() OVER (ORDER BY [object_id]) - 1 + 0 AS DECIMAL) AS x,
CAST((ROW_NUMBER() OVER (ORDER BY [object_id]) - 1) % 100 AS DECIMAL) AS y
FROM sys.all_objects
) AS a
As I noted in the comments, you can try the example with TOP 100 or less for full correlation (y = x for all cases); TOP 200 yields correlations very near 0.5; TOP 300, around 0.33; etc. There is a place ("+ 0") to add an offset if you like; spoiler alert, it has no effect. Make sure you CAST your values as DECIMAL - integer math can significantly impact these calcs.

Related

Oracle SQL : Calculating weighted probability

I'm struggling to retrieve a "weighted probability" from a database table in my SQL statement.
What do I need to do:
I have tabular information of probable financial values like:
Table my_table
ID
P [%]
Value [$]
1
50
200
2
50
200
3
60
100
I need to calculate the weighted probability of reasonable worst case financial value to occur.
The formula is:
P_weighted = 1 - (1 - P_1 * Value_1/Max(Value_1-n) * (1 - P_2 * Value_2/Max(Value_1-n) * ...
i.e.
P_weighted = 1 - Product(1 - P_i * Value_i / Max(Value_1-n)
P_weighted = 1 - (1 - 50% * 200 / 200) * (1 - 50% * 200 / 200) * (1 - 60% * 100 / 200) = 82.5%
I know the is not product function in (Oracle) SQL, and this can be substituted by EXP( SUM LN(x))) ensuring x is always positive.
Hence, if I were only to calculate the combined probability I could (regardless of the value I could do like:
SELECT EXP(SUM(LN(1 - t.P))) FROM FROM my_table t WHERE condition
When I need to include the Max(t.Value) I've got the following problem:
A SELECT list cannot include both a group function, such as AVG, COUNT, MAX, MIN, SUM, STDDEV, or VARIANCE, and an individual column expression, unless the individual column expression is included in a GROUP BY clause.
So I tried the following:
SELECT ROUND(1-EXP(SUM(LN(1 - t.P*t.Value/max(t.Value)))),1) FROM FROM my_table t WHERE condition GROUP BY t.P, t.Value
But this does obviously group the output by probability rather than multiplying it and just returns 0.5 or 50% instead of the product which should be 0.825 or 82.5%.
How do I get the weighted probability from by table above using (Oracle) SQL?
Does this do it:
with da as (select .50 as p, 200 as v from dual union all select .50 , 200 from dual union all select .60,100 from dual),
mx as (select max(v) mx from da)
select exp(sum(ln(1-da.p*da.v/mx))) from da, mx;
EXP(SUM(LN(1-DA.P*DA.V/MX)))
----------------------------
.175
with
test1 as(
select max(value) v_max from my_table
),
test2 as(
select 1-(my.p/100* value/t1.v_max) rez
from my_table my, test1 t1
)
select to_char(round((1-(EXP (SUM (LN (rez)))))*100,2))||'%' "Weighted probability"
from test2
RESULT:
Weighted probability
--------------------
82,5%
If you want the calculation per-row then you can use an analytic SUM:
SELECT id,
ROUND(1 - EXP(SUM(LN(1 - wp)) OVER (ORDER BY id)), 3) AS cwp
FROM (
SELECT id,
p * value / MAX(value) OVER () AS wp
FROM table_name
)
Which, for the sample data:
CREATE TABLE table_name (ID, P, Value) AS
SELECT 1, .50, 200 FROM DUAL UNION ALL
SELECT 2, .50, 200 FROM DUAL UNION ALL
SELECT 3, .60, 100 FROM DUAL;
Outputs the cumulative weighted probabilities:
ID
CWP
1
.5
2
.75
3
.825
If you just want the total weighted probability then:
SELECT ROUND(1 - EXP(SUM(LN(1 - wp))), 3) AS twp
FROM (
SELECT id,
p * value / MAX(value) OVER () AS wp
FROM table_name
)
Which, for the sample data, outputs:
TWP
.825
db<>fiddle here

Snowflake table and generator functions does not give expected result

I tried to create a simple SQL to track query_history usage, but got into trouble when creating my timeslots using the table and generator functions (the CTE named x below).
I got no results at all when limiting the query_history using my timeslots, so after a while I hardcoded an SQL to give the same result (the CTE named y below) and this works fine.
Why does not x work? As far as I can see x and y produce identical result?
To test the example first run the code as it is, this produces no result.
Then comment the line x as timeslots and un-comment the line y as timeslots, this will give the desired result.
with
x as (
select
dateadd('min',seq4()*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(seq4()+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
),
y as (
select
dateadd('min',n*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(n+1)*10,dateadd('min',-60,current_timestamp())) t
from (select 0 n union all select 1 n union all select 2 union all select 3
union all select 4 union all select 5)
)
--select * from x;
--select * from y;
select distinct
user_name,
timeslots.f
from snowflake.account_usage.query_history,
x as timeslots
--y as timeslots
where start_time >= timeslots.f
and start_time < timeslots.t
order by timeslots.f desc;
(I know the code is not optimal, this is only meant to illustrate the problem)
SEQ:
Returns a sequence of monotonically increasing integers, with wrap-around. Wrap-around occurs after the largest representable integer of the integer width (1, 2, 4, or 8 byte).
If a fully ordered, gap-free sequence is required, consider using the ROW_NUMBER window function.
For:
with x as (
select
dateadd('min',seq4()*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(seq4()+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
)
SELECT * FROM x;
Should be:
with x as (
select
(ROW_NUMBER() OVER(ORDER BY seq4())) - 1 AS n,
dateadd('min',n*10,dateadd('min',-60,current_timestamp())) f,
dateadd('min',(n+1)*10,dateadd('min',-60,current_timestamp())) t
from table(generator(rowcount => 6))
)
SELECT * FROM x;

Out of range integer: infinity

So I'm trying to work through a problem thats a bit hard to explain and I can't expose any of the data I'm working with but what Im trying to get my head around is the error below when running the query below - I've renamed some of the tables / columns for sensitivity issues but the structure should be the same
"Error from Query Engine - Out of range for integer: Infinity"
WITH accounts AS (
SELECT t.user_id
FROM table_a t
WHERE t.type like '%Something%'
),
CTE AS (
SELECT
st.x_user_id,
ad.name as client_name,
sum(case when st.score_type = 'Agility' then st.score_value else 0 end) as score,
st.obs_date,
ROW_NUMBER() OVER (PARTITION BY st.x_user_id,ad.name ORDER BY st.obs_date) AS rn
FROM client_scores st
LEFT JOIN account_details ad on ad.client_id = st.x_user_id
INNER JOIN accounts on st.x_user_id = accounts.user_id
--WHERE st.x_user_id IN (101011115,101012219)
WHERE st.obs_date >= '2020-05-18'
group by 1,2,4
)
SELECT
c1.x_user_id,
c1.client_name,
c1.score,
c1.obs_date,
CAST(COALESCE (((c1.score - c2.score) * 1.0 / c2.score) * 100, 0) AS INT) AS score_diff
FROM CTE c1
LEFT JOIN CTE c2 on c1.x_user_id = c2.x_user_id and c1.client_name = c2.client_name and c1.rn = c2.rn +2
I know the query works for sure because when I get rid of the first CTE and hard code 2 id's into a where clause i commented out it returns the data I want. But I also need it to run based on the 1st CTE which has ~5k unique id's
Here is a sample output if i try with 2 id's:
Based on the above number of row returned per id I would expect it should return 5000 * 3 rows = 150000.
What could be causing the out of range for integer error?
This line is likely your problem:
CAST(COALESCE (((c1.score - c2.score) * 1.0 / c2.score) * 100, 0) AS INT) AS score_diff
When the value of c2.score is 0, 1.0/c2.score will be infinity and will not fit into an integer type that you’re trying to cast it into.
The reason it’s working for the two users in your example is that they don’t have a 0 value for c2.score.
You might be able to fix this by changing to:
CAST(COALESCE (((c1.score - c2.score) * 1.0 / NULLIF(c2.score, 0)) * 100, 0) AS INT) AS score_diff

Sql Trend line by departments

I'm using the example of how to create a sql trend line on a report using the below link.
https://www.mssqltips.com/sqlservertip/3432/add-a-linear-trendline-to-a-graph-in-sql-server-reporting-services/
I've got it all up and running but I want to work out the trend by departments also. However its just merging all the data into one final value, I think its the below section of code that needs altering to calculate the sum by each of the departments I add in, but how best do I do this?
-- calculate sample size and the different sums
SELECT
#sample_size = COUNT(*)
,#sumX = SUM(ID)
,#sumY = SUM([OrderQuantity])
,#sumXX = SUM(ID*ID)
,#sumYY = SUM([OrderQuantity]*[OrderQuantity])
,#sumXY = SUM(ID*[OrderQuantity])
FROM #Temp_Regression;
-- output results
SELECT
SampleSize = #sample_size
,SumRID = #sumX
,SumOrderQty =#sumY
,SumXX = #sumXX
,SumYY = #sumYY
,SumXY = #sumXY;
These variables are then used to work out the trend line:
-- calculate the slope and intercept
SET #slope = CASE WHEN #sample_size = 1
THEN 0 -- avoid divide by zero error
ELSE (#sample_size * #sumXY - #sumX * #sumY) / (#sample_size * #sumXX - POWER(#sumX,2))
END;
SET #intercept = (#sumY - (#slope*#sumX)) / #sample_size;
You need to add departments column in SELECT & GROUP BY
SELECT departments,
SampleSize = Count(*),
SumRID = Sum(ID),
SumOrderQty = Sum([OrderQuantity]),
SumXX = Sum(ID * ID),
SumYY = Sum([OrderQuantity] * [OrderQuantity]),
SumXY = Sum(ID * [OrderQuantity])
FROM #Temp_Regression
GROUP BY departments
Here is the easier way to calculate slope & intercept for all departments
;WITH cte
AS (SELECT departments,
sample_size = Count(*),
sumX = Sum(ID),
sumY = Sum([OrderQuantity]),
sumXX = Sum(ID * ID),
sumYY = Sum([OrderQuantity] * [OrderQuantity]),
sumXY = Sum(ID * [OrderQuantity])
FROM #Temp_Regression
GROUP BY departments),
slope
AS (SELECT departments,
Sample_Size,
sumX,
sumY,
slope = CASE
WHEN sample_size = 1 THEN 0 -- avoid divide by zero error
ELSE ( sample_size * sumXY - sumX * sumY ) / ( sample_size * sumXX - Power(sumX, 2) )
END
FROM cte)
SELECT departments,
slope,
intercept = ( sumY - ( slope * sumX ) ) / sample_size
FROM slope

Median depending on other variable in SQL

I want to calculate median conditionally, say separately for men and women in a sex variable as an example. I would like to use the modification of one of the following two alternative codes provided here http://sqlperformance.com/2012/08/t-sql-queries/median as the fastest method for calculating median.
Alternative 1
DECLARE #c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);
SELECT AVG(1.0 * val)
FROM (
SELECT val FROM dbo.EvenRows
ORDER BY val
OFFSET (#c - 1) / 2 ROWS
FETCH NEXT 1 + (1 - #c % 2) ROWS ONLY
) AS x;
Alternative 2
SELECT #Median = AVG(1.0 * val)
FROM
(
SELECT o.val, rn = ROW_NUMBER() OVER (ORDER BY o.val), c.c
FROM dbo.EvenRows AS o
CROSS JOIN (SELECT c = COUNT(*) FROM dbo.EvenRows) AS c
) AS x
WHERE rn IN ((c + 1)/2, (c + 2)/2);
Where to put GROUP statement? Can someone advise me how to make a function with parameters MEDIAN(ColumnToCalculateMedian, ColumnContaingSex)