Finding Covariance using SQL

Finding Covariance using SQL - sql

# dt---------indx_nm1-----indx_val1-------indx_nm2------indx_val2
2009-06-08----ABQI------1001.2------------ACNACTR----------300.05
2009-06-09----ABQI------1002.12 ----------ACNACTR----------341.19
2009-06-10----ABQI------1011.4------------ACNACTR----------382.93
2009-06-11----ABQI------1015.43 ----------ACNACTR----------362.63
I have a table that looks like ^ (but with hundreds of rows that dates from 2009 to 2013). Is there a way that I could calculate the covariance : [(indx_val1 - avg(indx_val1)) * (indx_val2 - avg(indx_val2)] divided by total number of rows for each value of indx_val1 and indx_val2 (loop through the entire table) and return just a simple value for cov(ABQI, ACNACTR)

Since you have aggregates operating over two different groups, you will need two different queries. The main one groups by dt to get your row values per date. The other query has to perform AVG() and COUNT() aggregates across the whole rowset.
To use them both at the same time, you need to JOIN them together. But since there's no actual relation between the two queries, it is a cartesian product and we'll use a CROSS JOIN. Effectively, that joins every row of the main query with the single row retrieved by the aggregate query. You can then perform the arithmetic in the SELECT list, using values from both:
So, building on the query from your earlier question:
SELECT
indxs.*,
((indx_val2 - indx_val2_avg) * (indx_val1 - indx_val1_avg)) / total_rows AS cv
FROM (
SELECT
dt,
MAX(CASE WHEN indx_nm = 'ABQI' THEN indx_nm ELSE NULL END) AS indx_nm1,
MAX(CASE WHEN indx_nm = 'ABQI' THEN indx_val ELSE NULL END) AS indx_val1,
MAX(CASE WHEN indx_nm = 'ACNACTR' THEN indx_nm ELSE NULL END) AS indx_nm2,
MAX(CASE WHEN indx_nm = 'ACNACTR' THEN indx_val ELSE NULL END) AS indx_val2
FROM table1 a
GROUP BY dt
) indxs
CROSS JOIN (
/* Join against a query returning the AVG() and COUNT() across all rows */
SELECT
'ABQI' AS indx_nm1_aname,
AVG(CASE WHEN indx_nm = 'ABQI' THEN indx_val ELSE NULL END) AS indx_val1_avg,
'ACNACTR' AS indx_nm2_aname,
AVG(CASE WHEN indx_nm = 'ACNACTR' THEN indx_val ELSE NULL END) AS indx_val2_avg,
COUNT(*) AS total_rows
FROM table1 b
WHERE indx_nm IN ('ABQI','ACNACTR')
/* And it is a cartesian product */
) aggs
WHERE
indx_nm1 IS NOT NULL
AND indx_nm2 IS NOT NULL
ORDER BY dt
Here's a demo, building on your earlier one: http://sqlfiddle.com/#!6/2ec65/14

Here is a Scalar-valued function to perform a covariance calculation on any two column table formatted to XML.
To Test: Compile the function then execute the Alpha Test
CREATE Function [dbo].[Covariance](#XmlTwoValueSeries xml)
returns float
as
Begin
/*
-- -----------
-- ALPHA TEST
-- -----------
IF object_id('tempdb..#_201610101706') is not null DROP TABLE #_201610101706
select *
into #_201610101706
from
(
select *
from
(
SELECT '2016-01' Period, 1.24 col0, 2.20 col1
union
SELECT '2016-02' Period, 1.6 col0, 3.20 col1
union
SELECT '2016-03' Period, 1.0 col0, 2.77 col1
union
SELECT '2016-04' Period, 1.9 col0, 2.98 col1
) A
) A
DECLARE #XmlTwoValueSeries xml
SET #XmlTwoValueSeries = (
SELECT col0,col1 FROM #_201610101706
FOR
XML PATH('Output')
)
SELECT dbo.Covariance(#XmlTwoValueSeries) Covariance
*/
declare #returnvalue numeric(20,10)
set #returnvalue =
(
SELECT SUM((x - xAvg) *(y - yAvg)) / MAX(n) AS [COVAR(x,y)]
from
(
SELECT 1E * x x,
AVG(1E * x) OVER (PARTITION BY (SELECT NULL)) xAvg,
1E * y y,
AVG(1E * y) OVER (PARTITION BY (SELECT NULL)) yAvg,
COUNT(*) OVER (PARTITION BY (SELECT NULL)) n
FROM
(
SELECT
e.c.value('(col0/text())[1]', 'float' ) x,
e.c.value('(col1/text())[1]', 'FLOAT' ) y
FROM #XmlTwoValueSeries.nodes('Output') e(c)
) A
) A
)
return #returnvalue
end
GO

Related

Oracle Pivot rows to columns pattern matching

I want to rearrange the rows to columns (in tbl2 below) to count the number of occurrences of EXEN for the EXEN col, and any code starting with MPA for the MPACODE column.
SELECT *
FROM (select code from tbl2 where pidm='4062161')
PIVOT (count(*) FOR (code) IN ('EXEN' AS EXEN, 'MPA%' AS MPACODE));
tbl2:
Desired output:
Actual output:

You must perform an intermediate step to transform all MPA%to MPAsee subquery dt2
with dt as (
select 'EXEN' code from dual union all
select 'MPA'||rownum from dual connect by level <= 10),
dt2 as (
select
case when code like 'MPA%' then 'MPA' else code end as code
from dt)
select *
from dt2
pivot (
count(*) for
(code) IN ('EXEN' AS EXEN, 'MPA' AS MPACODE));
EXEN MPACODE
---------- ----------
1 10
PIVOT perform an equal comparison (not LIKE), so this is not valid: 'MPA%' AS MPACODE and the reason why the query fails.

for example:
select
count(case when code='EXEN' then 1 end) exen,
count(case when code like 'MPA%' then 1 end) mpacode
from tbl2 where pidm='4062161';

How to replace nulls with zeros in pivot query sql for fact table in Databricks

I see lots of solutions for how to do this where there is a column being queried including the following...
how to Replace null with zero in pivot SQL query
Oracle 11g SQL - Replacing NULLS with zero where query has PIVOT
Replacing null values in dynamic pivot sql query
etc.,etc.,etc.,
But how do you replace the nulls in a pivot query when your are creating a fact table for the existence of a condition.
For example, in Databricks:
How do I replace the nulls for the following
Setup
drop table if exists patient_dx;
create table patient_dx (patient_id string, dx string);
insert into patient_dx values
('Bob', 'cough'),
('Donna', 'cough'),
('Jerry', 'cough'),
('Bob', 'feaver'),
('Donna', 'head ache')
;
Query:
select * from (
select
patient_id,
dx,
cast (1 as int) cnt
from
patient_dx
)
pivot (
max(cnt)
for dx in ('cough','feaver','head ache')
)
;
Result
I've tried several permutations of:
cast(0 + cast(coalesce(sum(coalesce(cnt,0)),0) as int) as int) as cnt
To no avail

You have to use coalesce or NOT NULL to substitute null values in select query.
Check below if it helps:
Try this:
spark.sql("""
select
patient_id,
CASE
when cough is NOT NULL THEN cough
else 0
END as cough,
CASE
when feaver is NOT NULL THEN feaver
else 0
END as feaver,
CASE
when `head ache` is NOT NULL THEN `head ache`
else 0
END as `head ache`
from (
select * from patient
)
PIVOT(
Count(dx)
for dx in ('cough','feaver','head ache')
)
;
""").show()
The output will be:
patient_id
cough
feaver
head ache
Donna
1
0
1
Jerry
1
0
0
Bob
1
1
0
if you want it to be dynamic
dist=spark.sql("select collect_set(dx) from patient;").toPandas()
val=spark.sql("""
select
patient_id,
coalesce(cough,0) as `cough`,
coalesce(feaver,0) as `feaver`,
coalesce(`head ache`,0) as `head ache`
from (
select * from patient
)
PIVOT(
Count(dx)
for dx in """
+
str(tuple(map(tuple, *dist.values))[0])
+
"""
)
;
""")

Create table with with clause in Hana

I want to create a table using with clause:
For example :
with cte as
(SELECT B.STEUC "HSN",
CASE WHEN A.FORMAT_CD='520' THEN '2' ELSE A.FORMAT_CD END FORMAT_CD,
CASE WHEN A.FORMAT_CD='520' THEN 'DIGITAL' ELSE A.FORMAT_DESC END FORMAT_DESC,
A.ARTICLE,
A.REGION "STATE",
SUM(CASE WHEN BWART IN ('702','704','708','711','713','715','717','551','553','555','903','909','951','Z09') THEN DMBTR ELSE 0 END) -
SUM(CASE WHEN BWART IN ('701','703','707','712','714','716','718','552','554',' 556','904','910','952','Z10') THEN DMBTR ELSE 0 END) "LOSS_VALUE"
FROM "_SYS_BIC"."RRA.DnL/CV_STOCK_MOVEMENT" A INNER JOIN "P22"."MARA" B ON A.ARTICLE=B.MATNR
WHERE posting_date BETWEEN '20181101' AND '20181130' AND
BWART IN ('702','704',' 708',' 711','713','715','717','701','703','707','712','714','716','718','551',
'552','553','554','555','556','903','904','909','910','951','952','Z09','Z10') AND
A.COMPANY_CODE='9008' AND
A.LEVEL2 NOT IN ('10','99') AND
A.LEVEL5 NOT IN ('140601010') AND
A.FORMAT_CD NOT IN ('51','56','62','509')
GROUP BY B.STEUC,A.ARTICLE,A.REGION,A.COMPANY_CODE,A.FORMAT_CD,
A.FORMAT_DESC)
SELECT A.HSN,
A.STATE,
A.FORMAT_CD,
A.FORMAT_DESC,
A.ARTICLE,
A.LOSS_ART,
B.LOSS
FROM (
SELECT A.HSN,
A.STATE,
A.FORMAT_CD,
A.FORMAT_DESC,
A.ARTICLE,
A.LOSS LOSS_ART,
SUM(A.LOSS) OVER (PARTITION BY A.HSN,A.STATE,A.FORMAT_CD ORDER BY LOSS DESC) LOSS
FROM (SELECT A.HSN,A.STATE,A.FORMAT_CD,A.FORMAT_DESC,A.ARTICLE,SUM(LOSS_VALUE) LOSS FROM
--"RR_ANALYST"."REETIKA_LOSS_DATA_1"
cte A
INNER JOIN P22.MARA B ON A.ARTICLE=B.MATNR
WHERE B.ATTYP<>'11'
GROUP BY A.HSN,A.STATE,A.FORMAT_CD,A.FORMAT_DESC,A.ARTICLE
HAVING SUM(LOSS_VALUE)>0 ) A
) A ,
(SELECT A.HSN,A.STATE,A.FORMAT_CD,SUM(LOSS_VALUE) LOSS FROM
cte A
group by A.HSN,A.STATE,A.FORMAT_CD HAVING SUM(LOSS_VALUE)>0) B
WHERE A.HSN=B.HSN AND
A.STATE=B.STATE AND
A.FORMAT_CD=B.FORMAT_CD AND
A.LOSS<=B.LOSS*1
Which i guess is not supported in Hana.
What can be an alternative to the same ?
Does hana support spooling like oracle ?
I know i can export the result set and then create a table accordingly.
But is there any way to achieve and create the table dynamically ?

An alternative is the oldfashioned way - an inline view.
CREATE column TABLE t AS
SELECT *
FROM (SELECT 1 as some_value --> this is your WITH factoring clause
FROM dummy
UNION ALL
SELECT 2
FROM dummy
);

Get every combination of sort order and value of a csv

If I have a string with numbers separated by commas, like this:
Declare #string varchar(20) = '123,456,789'
And would like to return every possible combination + sort order of the values by doing this:
Select Combination FROM dbo.GetAllCombinations(#string)
Which would in result return this:
123
456
789
123,456
456,123
123,789
789,123
456,789
789,456
123,456,789
123,789,456
456,789,123
456,123,789
789,456,123
789,123,456
As you can see not only is every combination returned, but also each combination+sort order as well. The example shows only 3 values separated by commas, but should parse any amount--Recursive.
The logic needed would be somewhere in the realm of using a WITH CUBE statement, but the problem with using WITH CUBE (in a table structure instead of CSV of course), is that it won't shuffle the order of the values 123,456 456,123 etc., and will only provide each combination, which is only half of the battle.
Currently I have no idea what to try. If someone can provide some assistance it would be appreciated.

I use a User Defined Table-valued Function called split_delimiter that takes 2 values: the #delimited_string and the #delimiter_type.
CREATE FUNCTION [dbo].[split_delimiter](#delimited_string VARCHAR(8000), #delimiter_type CHAR(1))
RETURNS TABLE AS
RETURN
WITH cte10(num) AS
(
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)
,cte100(num) AS
(
SELECT 1
FROM cte10 t1, cte10 t2
)
,cte10000(num) AS
(
SELECT 1
FROM cte100 t1, cte100 t2
)
,cte1(num) AS
(
SELECT TOP (ISNULL(DATALENGTH(#delimited_string),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM cte10000
)
,cte2(num) AS
(
SELECT 1
UNION ALL
SELECT t.num+1
FROM cte1 t
WHERE SUBSTRING(#delimited_string,t.num,1) = #delimiter_type
)
,cte3(num,[len]) AS
(
SELECT t.num
,ISNULL(NULLIF(CHARINDEX(#delimiter_type,#delimited_string,t.num),0)-t.num,8000)
FROM cte2 t
)
SELECT delimited_item_num = ROW_NUMBER() OVER(ORDER BY t.num)
,delimited_value = SUBSTRING(#delimited_string, t.num, t.[len])
FROM cte3 t;
Using that I was able to parse the CSV to a table and join it back to itself multiple times and use WITH ROLLUP to get the permutations you are looking for.
WITH Numbers as
(
SELECT delimited_value
FROM dbo.split_delimiter('123,456,789',',')
)
SELECT CAST(Nums1.delimited_value AS VARCHAR)
,ISNULL(CAST(Nums2.delimited_value AS VARCHAR),'')
,ISNULL(CAST(Nums3.delimited_value AS VARCHAR),'')
,CAST(Nums4.delimited_value AS VARCHAR)
FROM Numbers as Nums1
LEFT JOIN Numbers as Nums2
ON Nums2.delimited_value not in (Nums1.delimited_value)
LEFT JOIN Numbers as Nums3
ON Nums3.delimited_value not in (Nums1.delimited_value, Nums2.delimited_value)
LEFT JOIN Numbers as Nums4
ON Nums4.delimited_value not in (Nums1.delimited_value, Nums2.delimited_value, Nums3.delimited_value)
GROUP BY CAST(Nums1.delimited_value AS VARCHAR)
,ISNULL(CAST(Nums2.delimited_value AS VARCHAR),'')
,ISNULL(CAST(Nums3.delimited_value AS VARCHAR),'')
,CAST(Nums4.delimited_value AS VARCHAR) WITH ROLLUP
If you will potentially have more than 3 or 4, you'll want to expand your code accordingly.

Subquery within SubQuery in SQL - DB2

I am having issue when trying to make a the sub query shown in the first filter dynamically based on one of the results returned from the query. Can someone please tell me what I am doing wrong. In the first subquery it worked.
( SELECT
MAX( MAX_DATE - MIN_DATE ) AS NUM_CONS_DAYS
FROM
(
SELECT
MIN(TMP.D_DAT_INDEX_DATE) AS MIN_DATE,
MAX(TMP.D_DAT_INDEX_DATE) AS MAX_DATE,
SUM(INDEX_COUNT) AS SUM_INDEX
FROM
(
SELECT
D_DAT_INDEX_DATE,
INDEX_COUNT,
D_DAT_INDEX_DATE - (DENSE_RANK() OVER(ORDER BY D_DAT_INDEX_DATE)) DAYS AS G
FROM
DWH.MQT_SUMMARY_WATER_READINGS
WHERE
N_COD_METER_CNTX_KEY = 79094
) AS TMP
GROUP BY
TMP.G
ORDER BY
1
) ) AS MAX_NUM_CONS_DAYS
Above is the subquery I am trying to replace 123456 with CTXTKEY or CTXT.N_COD_METER_CNTX_KEY from query. Below is the full code. Please note than in the subquery before "MAX_NUM_CONS_DAYS" it worked. However, it was only one subquery down.
SELECT
N_COD_WM_DWH_KEY,
V_COD_WM_SN_2,
N_COD_SP_ID,
CTXKEY,
V_COD_MIU_SN,
N_COD_POD,
MIU_CAT,
V_COD_SITR_ASSOCIATED,
WO_INST_DATE,
WO_MIU_CAT,
DAYSRECEIVED3,
MAX_NUM_CONS_DAYS,
( CASE WHEN ( DAYSRECEIVED3 = 3 ) THEN 'Y' ELSE 'N' END ) AS GREEN,
( CASE WHEN ( DAYSRECEIVED3 < 3 AND DAYSRECEIVED3 > 0 ) THEN 'Y' ELSE 'N' END ) AS BLUE,
( CASE WHEN ( DAYSRECEIVED3 = 0 AND MAX_NUM_CONS_DAYS >= 5 ) THEN 'Y' ELSE 'N' END ) AS ORANGE,
( CASE WHEN ( DAYSRECEIVED3 = 0 AND MAX_NUM_CONS_DAYS BETWEEN 1 and 4 ) THEN 'Y' ELSE 'N' END ) AS RED
FROM
(
SELECT
WMETER.N_COD_WM_DWH_KEY,
WMETER.V_COD_WM_SN_2,
WMETER.N_COD_SP_ID,
CTXT.N_COD_METER_CNTX_KEY AS CTXKEY,
CTXT.V_COD_MIU_SN,
CTXT.N_COD_POD,
MIU.N_COD_MIU_CATEGORY AS MIU_CAT,
CTXT.V_COD_SITR_ASSOCIATED,
T1.D_DAT_PLAN_INST AS WO_INST_DATE,
T1.N_COD_MIU_CATEGORY AS WO_MIU_CAT,
( SELECT COUNT( DISTINCT D_DAT_INDEX_DATE ) FROM DWH.MQT_SUMMARY_WATER_READINGS WHERE ( N_COD_METER_CNTX_KEY = CTXT.N_COD_METER_CNTX_KEY ) AND D_DAT_INDEX_DATE BETWEEN ( '2013-07-10' ) AND ( '2013-07-12' ) ) AS DAYSRECEIVED3,
( SELECT
MAX( MAX_DATE - MIN_DATE ) AS NUM_CONS_DAYS
FROM
(
SELECT
MIN(TMP.D_DAT_INDEX_DATE) AS MIN_DATE,
MAX(TMP.D_DAT_INDEX_DATE) AS MAX_DATE,
SUM(INDEX_COUNT) AS SUM_INDEX
FROM
(
SELECT
D_DAT_INDEX_DATE,
INDEX_COUNT,
D_DAT_INDEX_DATE - (DENSE_RANK() OVER(ORDER BY D_DAT_INDEX_DATE)) DAYS AS G
FROM
DWH.MQT_SUMMARY_WATER_READINGS
WHERE
N_COD_METER_CNTX_KEY = 79094
) AS TMP
GROUP BY
TMP.G
ORDER BY
1
) ) AS MAX_NUM_CONS_DAYS
FROM DWH.DWH_WATER_METER AS WMETER
LEFT JOIN DWH.DWH_WMETER_CONTEXT AS CTXT
ON WMETER.N_COD_WM_DWH_KEY = CTXT.N_COD_WM_DWH_KEY
LEFT JOIN DWH.DWH_MIU AS MIU
ON CTXT.V_COD_MIU_SN = MIU.V_COD_MIU_SN
LEFT JOIN
( SELECT V_COD_CORR_WAT_METER_SN, D_DAT_PLAN_INST, N_COD_MIU_CATEGORY
FROM DWH.DWH_ORDER_MANAGEMENT_FACT
JOIN DWH.DWH_MIU
ON DWH.DWH_ORDER_MANAGEMENT_FACT.V_COD_MIU_SN = DWH.DWH_MIU.V_COD_MIU_SN
) AS T1
ON WMETER.V_COD_WM_SN_2 = T1.V_COD_CORR_WAT_METER_SN
WHERE
( V_COD_SITR_ASSOCIATED = 'X' )
AND ( ( MIU.N_COD_MIU_CATEGORY <> 4 ) OR ( ( MIU.N_COD_MIU_CATEGORY IS NULL ) AND ( ( T1.N_COD_MIU_CATEGORY <> 4 ) OR ( T1.N_COD_MIU_CATEGORY IS NULL ) ) ) )
)
Error I am getting is:
Error Code: -204, SQL State: 42704

I would say that a good option here would be to use a CTE, or Common Table Expression. You can do something similar to the following:
WITH CTE_X AS(
SELECT VAL_A
,VAL_B
FROM TABLE_A)
,CTE_Y AS(
SELECT VAL_C
,VAL_B
FROM TABLE_B)
SELECT VAL_A
,VAL_B
FROM CTE_X X
JOIN CTE_Y Y
ON X.VAL_A = Y.VAL_C;
While this isn't specific to your example, it does show that CTE's create a sort of temporary "in memory" table that you can access in a subsequent query. This should allow you to issue your inner two subselects as a CTE, and then use the CTE in the "SELECT MAX( MAX_DATE - MIN_DATE ) AS NUM_CONS_DAYS" query.

You cannot reference columns from the outer select in the subselect, no more than 1 level deep anyway. If I correctly understand what you're doing, you'll probably need to join DWH.MQT_SUMMARY_WATER_READINGS and DWH.DWH_WMETER_CONTEXT in the outer select.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Finding Covariance using SQL - sql

Related

Oracle Pivot rows to columns pattern matching

How to replace nulls with zeros in pivot query sql for fact table in Databricks

Create table with with clause in Hana

Get every combination of sort order and value of a csv

Subquery within SubQuery in SQL - DB2

Categories

Resources