How to replace nulls with zeros in pivot query sql for fact table in Databricks - apache-spark-sql

I see lots of solutions for how to do this where there is a column being queried including the following...
how to Replace null with zero in pivot SQL query
Oracle 11g SQL - Replacing NULLS with zero where query has PIVOT
Replacing null values in dynamic pivot sql query
etc.,etc.,etc.,
But how do you replace the nulls in a pivot query when your are creating a fact table for the existence of a condition.
For example, in Databricks:
How do I replace the nulls for the following
Setup
drop table if exists patient_dx;
create table patient_dx (patient_id string, dx string);
insert into patient_dx values
('Bob', 'cough'),
('Donna', 'cough'),
('Jerry', 'cough'),
('Bob', 'feaver'),
('Donna', 'head ache')
;
Query:
select * from (
select
patient_id,
dx,
cast (1 as int) cnt
from
patient_dx
)
pivot (
max(cnt)
for dx in ('cough','feaver','head ache')
)
;
Result
I've tried several permutations of:
cast(0 + cast(coalesce(sum(coalesce(cnt,0)),0) as int) as int) as cnt
To no avail

You have to use coalesce or NOT NULL to substitute null values in select query.
Check below if it helps:
Try this:
spark.sql("""
select
patient_id,
CASE
when cough is NOT NULL THEN cough
else 0
END as cough,
CASE
when feaver is NOT NULL THEN feaver
else 0
END as feaver,
CASE
when `head ache` is NOT NULL THEN `head ache`
else 0
END as `head ache`
from (
select * from patient
)
PIVOT(
Count(dx)
for dx in ('cough','feaver','head ache')
)
;
""").show()
The output will be:
patient_id
cough
feaver
head ache
Donna
1
0
1
Jerry
1
0
0
Bob
1
1
0
if you want it to be dynamic
dist=spark.sql("select collect_set(dx) from patient;").toPandas()
val=spark.sql("""
select
patient_id,
coalesce(cough,0) as `cough`,
coalesce(feaver,0) as `feaver`,
coalesce(`head ache`,0) as `head ache`
from (
select * from patient
)
PIVOT(
Count(dx)
for dx in """
+
str(tuple(map(tuple, *dist.values))[0])
+
"""
)
;
""")

Related

Case statement not supporting horizontal search with column name in query

I am new to ORACLE SQL and I am trying to learn it quickly.
I have following table definition:
Create table Sales_Biodata
(
Saler_Id INTEGER NOT NULL UNIQUE,
Jan_Sales INTEGER NOT NULL,
Feb_Sales INTEGER NOT NULL,
March_Sales INTEGER NOT NULL
);
Insert into Sales_Biodata (SALER_ID,JAN_SALES,Feb_Sales,March_Sales)
values ('101',22,525,255);
Insert into Sales_Biodata (SALER_ID,JAN_SALES,Feb_Sales,March_Sales)
values ('102',22,55,25);
Insert into Sales_Biodata (SALER_ID,JAN_SALES,Feb_Sales,March_Sales)
values ('103',45545,5125,2865);
My objective is the following:
1- Searching the highest sales and second highest sales against each saler_id.
For example in our above case:
For saler_id =101 highest sales is 525 and second highest sales is 255
similary for saler_id=102 highest sales is 55 and second highest sales is 25
For my above approach I am using the following query:
Select Saler_Id,
(
CASE
WHEN JAN_SALES>FEB_SALES AND JAN_SALES>MARCH_SALES THEN JAN_SALES
WHEN FEB_SALES>JAN_SALES AND FEB_SALES>MARCH_SALES THEN FEB_SALES
WHEN MARCH_SALES>JAN_SALES AND MARCH_SALES>FEB_SALES THEN MARCH_SALES
WHEN JAN_SALES=FEB_SALES AND JAN_SALES=MARCH_SALES THEN JAN_SALES
WHEN JAN_SALES=FEB_SALES AND JAN_SALES>MARCH_SALES THEN JAN_SALES
WHEN JAN_SALES=MARCH_SALES AND JAN_SALES>FEB_SALES THEN JAN_SALES
WHEN FEB_SALES=JAN_SALES AND FEB_SALES>MARCH_SALES THEN FEB_SALES
WHEN FEB_SALES=MARCH_SALES AND FEB_SALES>JAN_SALES THEN FEB_SALES
WHEN MARCH_SALES=JAN_SALES AND MARCH_SALES>FEB_SALES THEN MARCH_SALES
WHEN MARCH_SALES=FEB_SALES AND MARCH_SALES>JAN_SALES THEN MARCH_SALES
ELSE 'NEW_CASE_FOUND'
END
) FIRST_HIGHEST,
(
CASE
WHEN JAN_SALES>FEB_SALES AND FEB_SALES>MARCH_SALES THEN FEB_SALES
WHEN FEB_SALES>JAN_SALES AND JAN_SALES>MARCH_SALES THEN JAN_SALES
WHEN JAN_SALES>MARCH_SALES AND MARCH_SALES>FEB_SALES THEN MARCH_SALES
ELSE 'NEW_CASE_FOUND'
END
) SECOND_HIGHEST
from
Sales_Biodata;
but I am getting the following errors:
ORA-00932: inconsistent datatypes: expected NUMBER got CHAR
00932. 00000 - "inconsistent datatypes: expected %s got %s"
*Cause:
*Action:
Error at Line: 60 Column: 6
Please guide me on the following:
1- How to search the data horizontally for maximum and second maximum.
2- Please guide me on alternate approaches for searching data for a row horizontally.
Getting the maximum value is simply:
select greatest(jan_sales, feb_sales, mar_sales)
If you want the second value:
select (case when jan_sales = greatest(jan_sales, feb_sales, mar_sales)
then greatest(feb_sales, mar_sales)
when feb_sales = greatest(jan_sales, feb_sales, mar_sales)
then greatest(jan_sales, mar_sales)
else greatest(jan_sales, feb_sales)
end)
However, this is the wrong approach to the whole problem. The main issues is that you have the wrong data structure. Store values in rows not columns. So, you need to unpivot your data and re-aggregation, such as:
select saler_id,
max(case when seqnum = 1 then sales end) as sales_1,
max(case when seqnum = 2 then sales end) as sales_2,
max(case when seqnum = 3 then sales end) as sales_3
from (select s.*, dense_rank() over (partition by saler_id order by sales desc) as seqnum
from (select saler_id, jan_sales as sales Sales_Biodata union all
select saler_id, feb_sales Sales_Biodata union all
select saler_id, mar_sales Sales_Biodata
) s
) s
group by saler_id;
Your data model is wrong.
The first thing I would do is to unpivot data using this query:
select * from sales_biodata
unpivot (
val for mon in ( JAN_SALES,FEB_SALES,MARCH_SALES )
)
;
and after this, getting two top values is relatively easy:
SELECT *
FROM (
SELECT t.*,
dense_rank() over (partition by saler_id order by val desc ) x
FROM (
select * from sales_biodata
unpivot (
val for mon in ( JAN_SALES,FEB_SALES,MARCH_SALES )
)
) t
)
WHERE x <= 2
the above query will give a result in this format:
SALER_ID MON VAL X
---------- ----------- ---------- ----------
101 FEB_SALES 525 1
101 MARCH_SALES 255 2
102 FEB_SALES 55 1
102 MARCH_SALES 25 2
103 JAN_SALES 45545 1
103 FEB_SALES 5125 2
If you have more month than 3 months, you can easily extend this query changing this part:
val for mon in ( JAN_SALES,FEB_SALES,MARCH_SALES, April_sales, MAY_SALES, JUNE_SALES, JULY_SALES, ...... NOVEMBER_SALES, DECEMBER_SALES )
If you want both two values in one row, you need to pivot data back:
WITH src_data AS(
SELECT saler_id, val, x
FROM (
SELECT t.*,
dense_rank() over (partition by saler_id order by val desc ) x
FROM (
select * from sales_biodata
unpivot (
val for mon in ( JAN_SALES,FEB_SALES,MARCH_SALES )
)
) t
)
WHERE x <= 2
)
SELECT *
FROM src_data
PIVOT(
max(val) FOR x IN ( 1 As "First value", 2 As "Second value" )
);
This gives a result in this form:
SALER_ID First value Second value
---------- ----------- ------------
101 525 255
102 55 25
103 45545 5125
EDIT - why MAX is used in the PIVOT query
The short answer is: because the syntax reuires an aggregate function here.
See this link for the syntax: http://docs.oracle.com/cd/E11882_01/server.112/e41084/statements_10002.htm#CHDCEJJE
A broader answer:
The PIVOT clause is only a syntactic sugar that simplifies a general "classic" pivot query which is using aggregate function and GROUP BY clause, like this:
SELECT id,
max( CASE WHEN some_column = 'X' THEN value END ) As x,
max( CASE WHEN some_column = 'Y' THEN value END ) As y,
max( CASE WHEN some_column = 'Z' THEN value END ) As z
FROM table11
GROUP BY id
More on PIVOT queries you can find on the net, there is a lot of excelent explanations how the pivot query works.
The above pivot query, written in "standard" SQL, is equivalent to this Oracle's query:
SELECT *
FROM table11
PIVOT (
max(value) FOR some_column IN ( 'X', 'Y', 'Z' )
)
These PIVOT queries transform records like this:
ID SOME_COLUMN VALUE
---------- ----------- ----------
1 X 10
1 X 15
1 Y 20
1 Z 30
into one record (for each id) like this:
ID 'X' 'Y' 'Z'
---------- ---------- ---------- ----------
1 15 20 30
Please note, that the source table contains two values for id=1 and some_column='X' -> 10 and 15. PIVOT queries uses aggregate function to support that "general" case, where there could be many source records for one record in the output. In this example 'MAX' function is used to pick greater value 15.
However PIVOT queries supports also your specific case where there is only one source record for each value in the result.
You are coming across the error as the string 'new case found' is added in the else part and the rest of the case statement deals with number . data type in the when and else clause should match.
Coming to alternate approaches you may use unpivot and get the months sales data into a single row and use analytical functions to get the 1st highest or second highest.
As others have said, the problem is that the WHEN clauses in your CASE statement are returning INTEGER values, but the ELSE is returning a character string. I completely agree with the comments regarding normalization but if you really just want to make this query work you'll need to convert the results of each WHEN clause to character, as in:
Select Saler_Id,
(
CASE
WHEN JAN_SALES>FEB_SALES AND JAN_SALES>MARCH_SALES THEN TO_CHAR(JAN_SALES)
WHEN FEB_SALES>JAN_SALES AND FEB_SALES>MARCH_SALES THEN TO_CHAR(FEB_SALES)
WHEN MARCH_SALES>JAN_SALES AND MARCH_SALES>FEB_SALES THEN TO_CHAR(MARCH_SALES)
WHEN JAN_SALES=FEB_SALES AND JAN_SALES=MARCH_SALES THEN TO_CHAR(JAN_SALES)
WHEN JAN_SALES=FEB_SALES AND JAN_SALES>MARCH_SALES THEN TO_CHAR(JAN_SALES)
WHEN JAN_SALES=MARCH_SALES AND JAN_SALES>FEB_SALES THEN TO_CHAR(JAN_SALES)
WHEN FEB_SALES=JAN_SALES AND FEB_SALES>MARCH_SALES THEN TO_CHAR(FEB_SALES)
WHEN FEB_SALES=MARCH_SALES AND FEB_SALES>JAN_SALES THEN TO_CHAR(FEB_SALES)
WHEN MARCH_SALES=JAN_SALES AND MARCH_SALES>FEB_SALES THEN TO_CHAR(MARCH_SALES)
WHEN MARCH_SALES=FEB_SALES AND MARCH_SALES>JAN_SALES THEN TO_CHAR(MARCH_SALES)
ELSE 'NEW_CASE_FOUND'
END
) FIRST_HIGHEST,
(
CASE
WHEN JAN_SALES>FEB_SALES AND FEB_SALES>MARCH_SALES THEN TO_CHAR(FEB_SALES)
WHEN FEB_SALES>JAN_SALES AND JAN_SALES>MARCH_SALES THEN TO_CHAR(JAN_SALES)
WHEN JAN_SALES>MARCH_SALES AND MARCH_SALES>FEB_SALES THEN TO_CHAR(MARCH_SALES)
ELSE 'NEW_CASE_FOUND'
END
) SECOND_HIGHEST
from
Sales_Biodata;
Best of luck.

SQL: I want a row to be return with NULL even if there is no match to my IN clause

I would like my SQL query to return a row even if there is no row matching in my IN clause.
For exemple this query:
SELECT id, foo
FROM table
WHERE id IN (0, 1, 2, 3)
would return:
id|foo
0|bar
1|bar
2|bar
3|null
But instead I have (because no row with id 3):
id|foo
0|bar
1|bar
2|bar
I have been able to find this trick:
SELECT tmpTable.id, table.bar
FROM (
SELECT 0 as id
UNION SELECT 1
UNION SELECT 2
UNION SELECT 3
) tmpTable
LEFT JOIN
(
SELECT table.foo, table.id
FROM table
WHERE table.id IN (0, 1, 2, 3)
) table
on table.id = tmpTable.id
Is there a better way?
Bonus: How to make it work with myBatis's list variable?
overslacked is right. Most SQL developers use an auxiliary table that stores integers (and one that stores dates). This is outlined in an entire chapter of Joe Celko's "SQL for Smarties".
Example:
CREATE TABLE numeri ( numero INTEGER PRIMARY KEY )
DECLARE #x INTEGER
SET #x = 0
WHILE #x < 1000
BEGIN
INSERT INTO numeri ( numero ) VALUES ( #x )
SET #x = #x + 1
END
SELECT
numero AS id,
foo
FROM
numeri
LEFT OUTER JOIN my_table
ON my_table.id = numero
WHERE
numero BETWEEN 0 AND 3
Main Goal of Programming minimal code high performance no need this things just remove id 3 from in clause
What about just saying:
SELECT id, foo
FROM table
WHERE id >= 0 AND <= 3

Select records where column has n character occurrences

I was wondering if this is possible in sqlite.
SELECT * FROM tbl WHERE substr_count(f, '*') = 5
It should return records that have 5 asterisks in the "f" column, like
a*b**c**
****a*
and so on
SELECT * FROM tbl WHERE length(f)-replace(f,'*','') = 5
This solution is easy if you have a tally or numbers table which simply contains a sequential list of integers. This would be a table you populated once but has many uses. With that you have:
Create Table Tally ( N int );
Insert Tally( N )
...
Select Z.<PrimaryKeyCol>, Sum( Z.Val )
From (
Select <PrimaryKeyCol>, 1 As Val
From tbl
Cross Join Tally As T
Where substr( tbl.f, T.N, 1 ) = '*'
) As Z
Group By Z.<PrimaryKeyCol>
Having Sum( Z.Val ) = 5

Finding Covariance using SQL

# dt---------indx_nm1-----indx_val1-------indx_nm2------indx_val2
2009-06-08----ABQI------1001.2------------ACNACTR----------300.05
2009-06-09----ABQI------1002.12 ----------ACNACTR----------341.19
2009-06-10----ABQI------1011.4------------ACNACTR----------382.93
2009-06-11----ABQI------1015.43 ----------ACNACTR----------362.63
I have a table that looks like ^ (but with hundreds of rows that dates from 2009 to 2013). Is there a way that I could calculate the covariance : [(indx_val1 - avg(indx_val1)) * (indx_val2 - avg(indx_val2)] divided by total number of rows for each value of indx_val1 and indx_val2 (loop through the entire table) and return just a simple value for cov(ABQI, ACNACTR)
Since you have aggregates operating over two different groups, you will need two different queries. The main one groups by dt to get your row values per date. The other query has to perform AVG() and COUNT() aggregates across the whole rowset.
To use them both at the same time, you need to JOIN them together. But since there's no actual relation between the two queries, it is a cartesian product and we'll use a CROSS JOIN. Effectively, that joins every row of the main query with the single row retrieved by the aggregate query. You can then perform the arithmetic in the SELECT list, using values from both:
So, building on the query from your earlier question:
SELECT
indxs.*,
((indx_val2 - indx_val2_avg) * (indx_val1 - indx_val1_avg)) / total_rows AS cv
FROM (
SELECT
dt,
MAX(CASE WHEN indx_nm = 'ABQI' THEN indx_nm ELSE NULL END) AS indx_nm1,
MAX(CASE WHEN indx_nm = 'ABQI' THEN indx_val ELSE NULL END) AS indx_val1,
MAX(CASE WHEN indx_nm = 'ACNACTR' THEN indx_nm ELSE NULL END) AS indx_nm2,
MAX(CASE WHEN indx_nm = 'ACNACTR' THEN indx_val ELSE NULL END) AS indx_val2
FROM table1 a
GROUP BY dt
) indxs
CROSS JOIN (
/* Join against a query returning the AVG() and COUNT() across all rows */
SELECT
'ABQI' AS indx_nm1_aname,
AVG(CASE WHEN indx_nm = 'ABQI' THEN indx_val ELSE NULL END) AS indx_val1_avg,
'ACNACTR' AS indx_nm2_aname,
AVG(CASE WHEN indx_nm = 'ACNACTR' THEN indx_val ELSE NULL END) AS indx_val2_avg,
COUNT(*) AS total_rows
FROM table1 b
WHERE indx_nm IN ('ABQI','ACNACTR')
/* And it is a cartesian product */
) aggs
WHERE
indx_nm1 IS NOT NULL
AND indx_nm2 IS NOT NULL
ORDER BY dt
Here's a demo, building on your earlier one: http://sqlfiddle.com/#!6/2ec65/14
Here is a Scalar-valued function to perform a covariance calculation on any two column table formatted to XML.
To Test: Compile the function then execute the Alpha Test
CREATE Function [dbo].[Covariance](#XmlTwoValueSeries xml)
returns float
as
Begin
/*
-- -----------
-- ALPHA TEST
-- -----------
IF object_id('tempdb..#_201610101706') is not null DROP TABLE #_201610101706
select *
into #_201610101706
from
(
select *
from
(
SELECT '2016-01' Period, 1.24 col0, 2.20 col1
union
SELECT '2016-02' Period, 1.6 col0, 3.20 col1
union
SELECT '2016-03' Period, 1.0 col0, 2.77 col1
union
SELECT '2016-04' Period, 1.9 col0, 2.98 col1
) A
) A
DECLARE #XmlTwoValueSeries xml
SET #XmlTwoValueSeries = (
SELECT col0,col1 FROM #_201610101706
FOR
XML PATH('Output')
)
SELECT dbo.Covariance(#XmlTwoValueSeries) Covariance
*/
declare #returnvalue numeric(20,10)
set #returnvalue =
(
SELECT SUM((x - xAvg) *(y - yAvg)) / MAX(n) AS [COVAR(x,y)]
from
(
SELECT 1E * x x,
AVG(1E * x) OVER (PARTITION BY (SELECT NULL)) xAvg,
1E * y y,
AVG(1E * y) OVER (PARTITION BY (SELECT NULL)) yAvg,
COUNT(*) OVER (PARTITION BY (SELECT NULL)) n
FROM
(
SELECT
e.c.value('(col0/text())[1]', 'float' ) x,
e.c.value('(col1/text())[1]', 'FLOAT' ) y
FROM #XmlTwoValueSeries.nodes('Output') e(c)
) A
) A
)
return #returnvalue
end
GO

How can I use PIVOT to show simultationly average and count in its cells?

Looking at the syntax I get the strong impression, that PIVOT doesn't support anything beyond a single aggregate function to be calculated for a cell.
From statistical view showing just some averages without giving the number of cases an average refers to is very unsatisfying ( that is the polite version ).
Is there some nice pattern to evaluate pivots based on avg and pivots based on count and mix them together to give a nice result?
Yes you need to use the old style cross tab for this. The PIVOT is just syntactic sugar that resolves to pretty much the same approach.
SELECT AVG(CASE WHEN col='foo' THEN col END) AS AvgFoo,
COUNT(CASE WHEN col='foo' THEN col END) AS CountFoo,...
If you have many aggregates you could always use a CTE
WITH cte As
(
SELECT CASE WHEN col='foo' THEN col END AS Foo...
)
SELECT MAX(Foo),MIN(Foo), COUNT(Foo), STDEV(Foo)
FROM cte
Simultaneous.. in its cells. So you mean within the same cell, therefore as a varchar?
You could calc the avg and count values in an aggregate query before using the pivot, and concatenate them together as text.
The role of the PIVOT operator here would only be to transform rows to columns, and some aggregate function (e.g. MAX/MIN) would be used only because it is required by the syntax - your pre-calculated aggregate query would only have one value per pivoted column.
EDIT
Following bernd_k's oracle/mssql solution, I would like to point out another way to do this in SQL Server. It requires streamlining the multiple columns into a single column.
SELECT MODULE,
modus + '_' + case which when 1 then 'AVG' else 'COUNT' end AS modus,
case which when 1 then AVG(duration) else COUNT(duration) end AS value
FROM test_data, (select 1 as which union all select 2) x
GROUP BY MODULE, modus, which
SELECT *
FROM (
SELECT MODULE,
modus + '_' + case which when 1 then 'AVG' else 'COUNT' end AS modus,
case which when 1 then CAST(AVG(1.0*duration) AS NUMERIC(10,2)) else COUNT(duration) end AS value
FROM test_data, (select 1 as which union all select 2) x
GROUP BY MODULE, modus, which
) P
PIVOT (MAX(value) FOR modus in ([A_AVG], [A_COUNT], [B_AVG], [B_COUNT])
) AS pvt
ORDER BY pvt.MODULE
In the example above, AVG and COUNT are compatible (count - int => numeric). If they are not, convert both explicitly to a compatible type.
Note - The first query shows AVG for M2/A as 2, due to integer averaging. The 2nd (pivoted) query shows the actual average taking into account decimals.
Solution for Oracle 11g + :
create table test_data (
module varchar2(30),
modus varchar2(30),
duration Number(10)
);
insert into test_data values ('M1', 'A', 5);
insert into test_data values ('M1', 'A', 5);
insert into test_data values ('M1', 'B', 3);
insert into test_data values ('M2', 'A', 1);
insert into test_data values ('M2', 'A', 4);
select *
FROM (
select *
from test_data
)
PIVOT (
AVG(duration) avg , count(duration) count
FOR modus in ( 'A', 'B')
) pvt
ORDER BY pvt.module;
I do not like the column names containing apostrophes, but the result contains what I want:
MODULE 'A'_AVG 'A'_COUNT 'B'_AVG 'B'_COUNT
------------------------------ ---------- ---------- ---------- ----------
M1 5 2 3 1
M2 2.5 2 0
I really wonder what the Microsoft boys did, when they only allowed one aggregate function within pivot. I call evaluation avgs without accompanying counts statistical lies.
SQL-Server 2005 + (based on Cyberwiki):
CREATE TABLE test_data (
MODULE VARCHAR(30),
modus VARCHAR(30),
duration INTEGER
);
INSERT INTO test_data VALUES ('M1', 'A', 5);
INSERT INTO test_data VALUES ('M1', 'A', 5);
INSERT INTO test_data VALUES ('M1', 'B', 3);
INSERT INTO test_data VALUES ('M2', 'A', 1);
INSERT INTO test_data VALUES ('M2', 'A', 4);
SELECT MODULE, modus, ISNULL(LTRIM(STR(AVG(duration))), '') + '|' + ISNULL(LTRIM(STR(COUNT(duration))), '') RESULT
FROM test_data
GROUP BY MODULE, modus;
SELECT *
FROM (
SELECT MODULE, modus, ISNULL(LTRIM(STR(AVG(duration))), '') + '|' + ISNULL(LTRIM(STR(COUNT(duration))), '') RESULT
FROM test_data
GROUP BY MODULE, modus
) T
PIVOT (
MAX(RESULT)
FOR modus in ( [A], [B])
) AS pvt
ORDER BY pvt.MODULE
result:
MODULE A B
------------------------------ --------------------- ---------------------
M1 5|2 3|1
M2 2|2 NULL