Teradata SQL : OR conditions vs Union all - sql

My understanding was that most OR conditions can be replaced where performance benefits with UNION ALL. But for these 2 queries the Count ('1') is not the same. Why is it so- am I missing something here. Should it not be the same . Can someone explain the disparity
SQL # 1
sel
D1.COL_1_CD, D1.COL_1_DESC,
D2.COL_2_CD, D2.COL_2_DESC, D3.COL_3_CD, D3.COL_3_DESC, D4.COL_4_CD,
D4.COL_4_DESC, D5.COL_5_CD, D5.COL_5_DESC,
d1.COL_1_CD_SYS_ID,
d2.COL_2_CD_SYS_ID,
d3.COL_3_CD_SYS_ID,
d4.COL_4_CD_SYS_ID,
d5.COL_5_CD_SYS_ID
from
D1,
D2, D3, D4,
D5
where
D1.COL_1_CD1=D2.COL_2_CD1
and
D2.COL_2_CD1=D3.COL_3_CD1
and
D4.COL_4_CD1=D5.COL_5_CD1
and
(D1.COL_1_CD in ('707')
or D2.COL_2_CD in ('707')
or D3.COL_3_CD in ('707')
or D4.COL_4_CD in ('707')
or D5.COL_5_CD in ('707') )
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
SQL # 2
sel
D1.COL_1_CD, D1.COL_1_DESC,
D2.COL_2_CD, D2.COL_2_DESC, D3.COL_3_CD, D3.COL_3_DESC, D4.COL_4_CD,
D4.COL_4_DESC, D5.COL_5_CD, D5.COL_5_DESC,
d1.COL_1_CD_SYS_ID,
d2.COL_2_CD_SYS_ID,
d3.COL_3_CD_SYS_ID,
d4.COL_4_CD_SYS_ID,
d5.COL_5_CD_SYS_ID
from
D1,
D2, D3, D4,
D5
where
D1.COL_1_CD1=D2.COL_2_CD1
and
D2.COL_2_CD1=D3.COL_3_CD1
and
D4.COL_4_CD1=D5.COL_5_CD1
and
D1.COL_1_CD in ('707')
UNION ALL
sel
D1.COL_1_CD, D1.COL_1_DESC,
D2.COL_2_CD, D2.COL_2_DESC, D3.COL_3_CD, D3.COL_3_DESC, D4.COL_4_CD,
D4.COL_4_DESC, D5.COL_5_CD, D5.COL_5_DESC,
d1.COL_1_CD_SYS_ID,
d2.COL_2_CD_SYS_ID,
d3.COL_3_CD_SYS_ID,
d4.COL_4_CD_SYS_ID,
d5.COL_5_CD_SYS_ID
from
D1,
D2, D3, D4,
D5
where
D1.COL_1_CD1=D2.COL_2_CD1
and
D2.COL_2_CD1=D3.COL_3_CD1
and
D4.COL_4_CD1=D5.COL_5_CD1
and
D2.COL_2_CD in ('707')
UNION ALL
.....<same query>
D3.COL_3_CD in ('707')
UNION ALL
.....<same query>
D4.COL_4_CD in ('707')
UNION ALL
.....<same query>
D5.COL_5_CD in ('707')
row counts are not same . What kind of OR logic can be converted as equivalent UNION ALL.

Related

How to pivot Oracle SQL query result?

I have an ORACLE SQL query that is as follows :
SELECT B1 "Categories",
B2 "Category1",
B3 "Category2",
B4 "Category3",
B5 "Category4",
B6 "Total"
FROM (
SELECT C_NAME B1,
SUM(BUCKET_1) B2,
SUM(BUCKET_2) B3,
SUM(BUCKET_3) B4,
SUM(BUCKET_4) + SUM(BUCKET_5) + SUM(BUCKET_6) B5,
SUM(BUCKET) B6
FROM BUCKETTABLE
WHERE ID = bucketID
GROUP BY C_NAME
ORDER BY C_NAME
);
The result I get from the query :
The result I want to obtain :
How do I get this result?

How to UPIVOT all columns in a table and aggregate into Data Quality/ Validation Metrics? SQL SNOWFLAKE

I have a table with 60+ columns in it that I would like to UNPIVOT so that each column becomes a row and then find the fill rate, min value and max value of each entry.
For Example
ID
START_DATE
END_DATE
EVENT_ID
PROVIDER_CODE
01
01/23/21
03/14/21
0023401
0012323
02
06/04/21
09/20/21
0025906
0023454
03
07/20/21
12/02/21
0027093
0034983
And I want the output to look like
Column_Name
Fill_Rate
Min
Max
ID
0.7934
01
03
Start_Date
0.6990
01/23/21
07/20/21
End_Date
0.9089
03/14/21
12/02/21
Event_ID
1.0000
0023401
0027093
Struggling to get the desired output, especially because of different data types in the different columns
i tried doing the following, but it doesn't allow taking the agg functions within the unpivot
select *
from "DSVC_MERCKPAN_PROD"."COHORTS_LATEST"."MEDICAL_HEADERS"
UNPIVOT (
max(code) as max_value,
min(code) as min_value,
avg(code) as fill_rate,
code as column_name
)
For fill rate, I was trying to use this logic as ID is always populated so it has the total number of rows, however the other columns can be null
(COUNT_IF(start_date is not null))/(COUNT_IF(ID is not null))) as FILL_RATE,
I have 2 ideas to implement the report.
The first way is casting all values to VARCHAR and then using UNPIVOT:
-- Generate dummy data
create or replace table t1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 int, c7 int, c8 int, c9 int, c10 int) as
select
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null)
from table(generator(rowcount => 1000000000))
;
-- Query
with
cols as (
select column_name, ordinal_position
from information_schema.columns
where table_catalog = current_database()
and table_schema = current_schema()
and table_name = 'T1'
),
stringified as (
select
c1::varchar c1, c2::varchar c2, c3::varchar c3, c4::varchar c4, c5::varchar c5,
c6::varchar c6, c7::varchar c7, c8::varchar c8, c9::varchar c9, c10::varchar c10
from t1
),
data as (
select column_name, column_value
from stringified
unpivot(column_value for column_name in (c1, c2, c3, c4, c5, c6, c7, c8, c9, c10))
)
select
c.column_name,
count(d.column_value)/(select count(*) from t1) fill_rate,
min(d.column_value) min,
max(d.column_value) max
from cols c
left join data d using (column_name)
group by c.column_name, c.ordinal_position
order by c.ordinal_position
;
/*
COLUMN_NAME FILL_RATE MIN MAX
C1 0.500000 -1000000069270747870 999999972962694409
C2 0.499980 -1000000027928146782 999999946877079818
C3 0.499996 -1000000012155323098 999999942281548701
C4 0.500017 -1000000056353213091 999999946421698482
C5 0.500015 -1000000015608859996 999999993977648967
C6 0.500003 -1000000007081089270 999999998851014730
C7 0.499987 -100000008605944993 999999968272328033
C8 0.499992 -1000000042470913027 999999977402822725
C9 0.500011 -1000000058928465662 999999969060696774
C10 0.500029 -1000000011306371004 99999996061390938
*/
It's a straightforward way, but it still needs to list up all column names twice and it's a bit tough in the case the number of columns is very massive (but I believe it's much better than a huge UNION ALL query).
Another solution is a bit tricky, but you can unpivot a table by using OBJECT_CONSTRUCT(*) aggregation if the row length doesn't exceed a VARIANT value limit (16 MiB):
with
cols as (
select column_name, ordinal_position
from information_schema.columns
where table_catalog = current_database()
and table_schema = current_schema()
and table_name = 'T1'
),
data as (
select f.key column_name, f.value::varchar column_value
from (select object_construct(*) rec from t1) up,
lateral flatten(up.rec) f
)
select
c.column_name,
count(d.column_value)/(select count(*) from t1) fill_rate,
min(d.column_value) min,
max(d.column_value) max
from cols c
left join data d using (column_name)
group by c.column_name, c.ordinal_position
order by c.ordinal_position
;
/*
COLUMN_NAME FILL_RATE MIN MAX
C1 0.500000 -1000000069270747870 999999972962694409
C2 0.499980 -1000000027928146782 999999946877079818
C3 0.499996 -1000000012155323098 999999942281548701
C4 0.500017 -1000000056353213091 999999946421698482
C5 0.500015 -1000000015608859996 999999993977648967
C6 0.500003 -1000000007081089270 999999998851014730
C7 0.499987 -100000008605944993 999999968272328033
C8 0.499992 -1000000042470913027 999999977402822725
C9 0.500011 -1000000058928465662 999999969060696774
C10 0.500029 -1000000011306371004 99999996061390938
*/
OBJECT_CONSTRUCT(*) aggregation is a special usage of the OBJECT_CONSTRUCT function that extracts column names as a key of each JSON object. As far as I know, this is the only way to extract column names from a table along with values in a programmatic way.
Since OBJECT_CONSTRUCT is relatively a heavy operation, it usually takes a longer time than the first solution, but you don't need to write all column names with this trick.

Selects all rows (and columns) where one value in a column is the highest

I have a table in the following form:
index, ingestion_id, a, b, c, d
0, '2020-04-22-1600', 0a, 0b, 0c, 0d
1, '2020-04-22-1700', 0a, 0b, 0c, 0d
2, '2020-04-22-1600', 1a, 1b, 1c, 1d
3, '2020-04-22-1700', 1a, 1b, 1c, 1d
4, '2020-04-22-1800', 1a, 1b, 1c, 1d
...
I would like extract all the rows and columns where the ingestion_id is the highest. Thus it should return index 1 and index 4 for all rows and columns.
I found some examples, but they require that we pre-define the columns that we want to select. I don't know the columns in advance, but I do know that the table will have a column named ingestion_id. Here is an example:
SELECT *
FROM (
SELECT MAX(ingestion_id) as ingestion_id, a, b, c, d
FROM table as t
GROUP BY a, b, c, d
ORDER BY a
)
How can I select all columns where the ingestion_id is the highest and group by all columns except for the ingestion_id?
BONUS
Imagine the table now having the form:
index, ingestion_id, a, b, c, d
0, '2020-04-22-1600', 0a, 0b, 0c, 0d
1, '2020-04-22-1700', 0a, 0b, 0c, 0d
2, '2020-04-22-1600', 1a, 1b, 1c, 1d
3, '2020-04-22-1700', 1a, 1b, 1c, 1d
4, '2020-04-26-1800', 2a, 2b, 2c, 2d
5, '2020-04-26-1900', 2a, 2b, 2c, 2d
...
The answer provided by Gordon Linoff (as of 2020/04/26) will in this case only filter out row 5 as its the highest ingestion_id. We also need however row 1 and row 3 as the values (except for the column ingestion_id) are unique in the other columns.
This answers the original version of the question.
I would like extract all the rows and columns where the ingestion_id is the highest.
If I understand correctly, you can use window a functions:
select t.* except (seqnum)
from (select t.*, rank() over (order by ingestion_id desc) as seqnum
from `t` t
) t
where seqnum = 1;
You can select all corresponding rows as:
select t.* except (seqnum, grpid, min_grpid_seqnum)
from (select t.*,
min(seqnum) over (partition by grpid) as min_grpid_seqnum
from (select t.*, rank() over (order by ingestion_id desc) as seqnum,
dense_rank() over (partition by a, b, c, d) as grpid
from `t` t
) t
) t
where min_grpid_seqnum = 1;
How can I select all columns where the ingestion_id is the highest and group by all columns except for the ingestion_id?
Each source has a different set of columns with different names
Below is for BigQuery Standard SQL and has no dependency on the naming for the rest of columns at all
#standardSQL
SELECT ARRAY_AGG(t ORDER BY ingestion_id DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table` t
GROUP BY TO_JSON_STRING((SELECT AS STRUCT * EXCEPT(ingestion_id) FROM UNNEST([t])))
If to apply to sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT '2020-04-22-1600' ingestion_id, '0a' a, '0b' b, '0c'c, '0d' d UNION ALL
SELECT '2020-04-22-1700', '0a', '0b', '0c', '0d' UNION ALL
SELECT '2020-04-22-1600', '1a', '1b', '1c', '1d' UNION ALL
SELECT '2020-04-22-1700', '1a', '1b', '1c', '1d' UNION ALL
SELECT '2020-04-22-1800', '1a', '1b', '1c', '1d'
)
SELECT ARRAY_AGG(t ORDER BY ingestion_id DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table` t
GROUP BY TO_JSON_STRING((SELECT AS STRUCT * EXCEPT(ingestion_id) FROM UNNEST([t])))
output is
Row ingestion_id a b c d
1 2020-04-22-1700 0a 0b 0c 0d
2 2020-04-22-1800 1a 1b 1c 1d
Below is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 0 index, '2020-04-22-1600' ingestion_id, '0a' a, '0b' b, '0c'c, '0d' d UNION ALL
SELECT 1, '2020-04-22-1700', '0a', '0b', '0c', '0d' UNION ALL
SELECT 2, '2020-04-22-1600', '1a', '1b', '1c', '1d' UNION ALL
SELECT 3, '2020-04-22-1700', '1a', '1b', '1c', '1d' UNION ALL
SELECT 4, '2020-04-26-1800', '2a', '2b', '2c', '2d' UNION ALL
SELECT 5, '2020-04-26-1900', '2a', '2b', '2c', '2d'
)
SELECT ARRAY_AGG(t ORDER BY ingestion_id DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table` t
GROUP BY TO_JSON_STRING((SELECT AS STRUCT * EXCEPT(index, ingestion_id) FROM UNNEST([t])))
with output
Row index ingestion_id a b c d
1 1 2020-04-22-1700 0a 0b 0c 0d
2 3 2020-04-22-1700 1a 1b 1c 1d
3 5 2020-04-26-1900 2a 2b 2c 2d
You've asked for "all the rows with the highest ingestion_id.
According to your sample-data, you only have one value row with a highest value for ingestion_id
So, in order to present your data with the highest value you can use MAX() within a subquery and simply use SELECT * because you don't know all the columns that may exist, this would look something like this, in it's simplest format;
SELECT * FROM table
WHERE IngestionID = (SELECT MAX(IngestionID) FROM table);
Bonus Answer
DECLARE #columns NVARCHAR(MAX)
DECLARE #result NVARCHAR(MAX)
SELECT #columns = STUFF(
(
SELECT ',' + z.COLUMN_NAME FROM information_schema.columns z WHERE z.table_name = 'datatable'
AND z.COLUMN_NAME NOT IN ('Index_ID','Ingestion_ID')
FOR xml path('')
)
, 1
, 1
, '')
SET #result = 'SELECT MAX(Ingestion_ID) [Ingestion ID],' + (SELECT #columns) + ' FROM datatable GROUP BY ' + (SELECT #columns);
EXEC(#result)
Note: I've changed the table name to datatable to avoid SQL reserved keywords
(same for index -> Index_ID)
Outputs
Ingestion ID a b c d
2020-04-22-1700 0a 0b 0c 0d
2020-04-22-1700 1a 1b 1c 1d
2020-04-26-1900 2a 2b 2c 2d
I suggest not including the index because this is always unique and will just cause it to return every row, but looking at your questions and your original script, you aren't looking to include it so I believe this script will do exactly what you need.
Tested against the following;
Column Name DataType
Index_ID int
Ingestion_ID varchar(15)
a varchar(2)
b varchar(2)
c varchar(2)
d varchar(2)
This can be done in standard SQL as follows.
I am assuming your data to reside in a temp table.
WITH temp AS (
SELECT 0 index, '2020-04-22-1600' ingestion_id, '0a' a, '0b' b, '0c'c, '0d' d UNION ALL
SELECT 1, '2020-04-22-1700', '0a', '0b', '0c', '0d' UNION ALL
SELECT 2, '2020-04-22-1600', '1a', '1b', '1c', '1d' UNION ALL
SELECT 3, '2020-04-22-1700', '1a', '1b', '1c', '1d' UNION ALL
SELECT 4, '2020-04-26-1800', '2a', '2b', '2c', '2d' UNION ALL
SELECT 5, '2020-04-26-1900', '2a', '2b', '2c', '2d'
)
select index,ingestion_id,a,b,c,d from (select index,ingestion_id,a,b,c,d,row_number() over(partition by a,b,c,d order ingestion_id desc) top from temp ) where top = 1
It will produce the following output:
index ingestion_id a b c d
1 2020-04-22-1700 0a 0b 0c 0d
3 2020-04-22-1700 1a 1b 1c 1d
5 2020-04-26-1900 2a 2b 2c 2d

oracle select mulitple records to 1 on conditions

Table : CODES_TABLE
Serial - Code - DateTime
A123 B2 01/01/17:14:00
A124 B2 01/01/17:14:00
A123 B3 01/01/17:14:05
A123 B4 01/01/17:14:08
A124 B3 01/01/17:14:00
A128 B2 03/01/17:14:00
A129 B2 03/01/17:14:00
A129 B4 02/01/17:14:00
What Im trying to get is a list of all Serials which have generated a code B2, B3 and B4 And have generated it in a given order – i.e B2 first, then B3 then B4 So In this example – only Serial A123
Assuming, from your input data, that every code may only occure once for a serial, this could be a way:
/* test case */
with testTable(Serial,Code, DateTime) as (
select 'A123', 'B2', to_date('01/01/17:14:00', 'dd/mm/yy:hh24:mi') from dual union all
select 'A124', 'B2', to_date('01/01/17:14:00', 'dd/mm/yy:hh24:mi') from dual union all
select 'A123', 'B3', to_date('01/01/17:14:05', 'dd/mm/yy:hh24:mi') from dual union all
select 'A123', 'B4', to_date('01/01/17:14:08', 'dd/mm/yy:hh24:mi') from dual union all
select 'A124', 'B3', to_date('01/01/17:14:00', 'dd/mm/yy:hh24:mi') from dual union all
select 'A128', 'B2', to_date('03/01/17:14:00', 'dd/mm/yy:hh24:mi') from dual union all
select 'A129', 'B2', to_date('03/01/17:14:00', 'dd/mm/yy:hh24:mi') from dual union all
select 'A129', 'B4', to_date('02/01/17:14:00', 'dd/mm/yy:hh24:mi') from dual
)
/* the query */
select serial
from testTable
group by serial
having listagg( case when code in ('B2', 'B3', 'B4') then code end) within group ( order by dateTime) like '%B2B3B4%'
The idea here is to aggregate by serial, building for each serial a string that contains the codes, ordered by dateTime.
Assuming that every code can only appear once for a serial the only serials that match your condition will have strings containing 'B2B3B4'.
The CASE is used to handle the case you need to check if a serial has B2, B3, B5 where even B4 may occur.
This should better explain how this should work:
select serial, listagg( case when code in ('B2', 'B3', 'B4') then code end) within group ( order by dateTime) as string
from testTable
group by serial;
SERI STRING
---- ---------------
A123 B2B3B4
A124 B2B3
A128 B2
A129 B4B2

Prolog - Formatting crossword

I've written a predicate called solve_crossword that looks like this:
solve_crossword(X,C):-
C= [A1,A2,A3,A4,A5,
B1,' ',B3, ' ', B5,
C1, C2,C3,C4,C5,
D1,' ',D3,' ', D5,
E1,E2,E3,E4,E5],
member([A1, A2, A3, A4, A5], X),
member([C1, C2, C3, C4, C5], X),
member([E1, E2, E3, E4, E5], X),
member([A1, B1, C1, D1, E1], X),
member([A3, B3, C3, D3, E3], X),
member([A5, B5, C5, D5, E5], X).
Now, I want to write a predicate called write_crossword that formats the crossword. If I have a list of words I want it to look like this:
| ?- words(X), solve_crossword(X, C), write_crossword(C).
DITCH
O U O
DITTO
G O E
EARLY
C = [[68,73,84,67,72],[79,32,85,32,79],...
X = [[68,73,83,84,82],[68,73,84,67,72],...
With
words([
"DISTR",
"DITCH",
"DITTO",
"DITTY",
"DODGE",
"EARED",
"EARLY",
"EARTH",
"EASEL",
"HONOR",
"HOOEY",
"HORDE",
"TUQUE",
"TURPS",
"TUTOR",
"TWAIN"
]).
Rows 1, 3, 5 and columns 1, 3, 5 are supposed to be words.
You can try something like this (note I corrected your code for solve_crossword):
solve_crossword(X,C):-
C= [[A1,A2,A3,A4,A5],
[B1,Space,B3, Space,B5],
[C1,C2,C3,C4,C5],
[D1,Space,D3,Space,D5],
[E1,E2,E3,E4,E5]],
atom_codes(' ', [Space]),
member([A1, A2, A3, A4, A5], X),
member([C1, C2, C3, C4, C5], X),
member([E1, E2, E3, E4, E5], X),
member([A1, B1, C1, D1, E1], X),
member([A3, B3, C3, D3, E3], X),
member([A5, B5, C5, D5, E5], X).
write_crossword([]).
write_crossword([Line|Lines]):-
atom_codes(SLine, Line),
write(SLine),
nl,
write_crossword(Lines).
atom_codes/2 converts between an atom and a list of character codes.