How can I do a distinct sum? - sql

I am trying to create a "score" statistic which is derived from the value of a certain column, calculated as the sum of a case expression. Unfortunately, the query structure needs to be a full outer join (this is simplified from the actual query, and the join structure survives from the original code), and thus the sum is incorrect, since each row may occur many times. I could group by the unique key; however, that breaks other aggregate functions that are in the same query.
What I really want to do is sum (case when ... distinct claim_id) which of course does not exist; is there an approach that will do what I need? Or does this have to be two queries?
This is on redshift, in case it matters.
create table t1 (id int, proc_date date, claim_id int, proc_code char(1));
create table t2 (id int, diag_date date, claim_id int);
insert into t1 (id, proc_date, claim_id, proc_code)
values (1, '2012-01-01', 0, 'a'),
(2, '2009-02-01', 1, 'b'),
(2, '2019-02-01', 2, 'c'),
(2, '2029-02-01', 3, 'd'),
(3, '2016-04-02', 4, 'e'),
(4, '2005-01-03', 5, 'f'),
(5, '2008-02-03', 6, 'g');
insert into t2 (id, diag_date, claim_id)
values (4, '2004-01-01', 20),
(5, '2010-02-01', 21),
(6, '2007-04-02', 22),
(5, '2011-02-01', 23),
(6, '2008-04-02', 24),
(5, '2012-02-01', 25),
(6, '2009-04-02', 26),
(7, '2002-01-03', 27),
(8, '2001-02-03', 28);
select id, sum(case when proc_code='a' then 5
when proc_code='b' then 10
when proc_code='c' then 15
when proc_code='d' then 20
when proc_code='e' then 25
when proc_code='f' then 30
when proc_code='g' then 35 end), count(distinct t1.claim_id) as proc_count, min(proc_date) as min_proc_date
from t1 full outer join t2 using (id) group by id order by id;

You can separate out your conditional aggregates into a cte or subquery and use OVER(PARTITION BY id) to get an id level aggregate without grouping, something like this:
with cte AS (SELECT *,sum(case when proc_code='a' then 5
when proc_code='b' then 10
when proc_code='c' then 15
when proc_code='d' then 20
when proc_code='e' then 25
when proc_code='f' then 30
when proc_code='g' then 35 end) OVER(PARTITION BY id) AS Some_Sum
, min(proc_date) OVER(PARTITION BY id) as min_proc_date
FROM t1
)
select id
, Some_Sum
, count(distinct cte.claim_id) as proc_count
, min_proc_date
from cte
full outer join t2 using (id)
group by id,Some_Sum,min_proc_Date
order by id;
Demo: SQL Fiddle
Note that you'll have to add these aggregates to the GROUP BY in the outer query, and the fields in your PARTITION BY should match the t1 fields you previously used in GROUP BY, in this case just id, but if your full query had other t1 fields in the GROUP BY be sure to add them to the PARTITION BY

You can use a subquery (by id and id_claim) and then regroup:
with base as (
select id, avg(case when proc_code='a' then 5
when proc_code='b' then 10
when proc_code='c' then 15
when proc_code='d' then 20
when proc_code='e' then 25
when proc_code='f' then 30
when proc_code='g' then 35 end) as value_proc,
t1.claim_id , min(proc_date) as min_proc_date
from t1 full outer join t2 using (id) group by id, t1.claim_id order by id, t1.claim_id)
select id, sum(value_proc), count(distinct claim_id) as proc_count, min(min_proc_date) as min_proc_date
from base
group by id
order by id;
See that I sugest avg for the internal subquery, but if you are sure that the same claim_id have the same letter you can use max or min and that was integer. If not is prefer this.

Related

How to filter a table based on queried ids from another table in Snowflake

I'm trying to filter a table based on the queried result from another table.
create temporary table test_table (id number, col_a varchar);
insert into test_table values
(1, 'a'),
(2, 'b'),
(3, 'aa'),
(4, 'a'),
(6, 'bb'),
(7, 'a'),
(8, 'c');
create temporary table test_table_2 (id number, col varchar);
insert into test_table_2 values
(1, 'aa'),
(2, 'bb'),
(3, 'cc'),
(4, 'dd'),
(6, 'ee'),
(7, 'ff'),
(8, 'gg');
Here I want to find out all the id's in test_table with value "a" in col_a, and then I want to filter for rows with one of these id's in test_table_2. I've tried this below way, but got an error: SQL compilation error: syntax error line 6 at position 39 unexpected 'cte'.
with cte as
(
select id from test_table
where col_a = 'a'
)
select * from test_table_2 where id in cte;
This approach below does work, but with large tables, it tends to be very slow. Is there a better more efficient way to scale to very large tables?
with cte as
(
select id from test_table
where col_a = 'a'
)
select t2.* from test_table_2 t2 join cte on t2.id=cte.id;
I would express this using exists logic:
SELECT id
FROM test_table_2 t2
WHERE EXISTS (
SELECT 1
FROM test_table t1
WHERE t2.id = t1.id AND
t1.col_a = 'a'
);
This has one advantage over a join in that Snowflake can stop scanning the test_table_2 table as soon as it finds a match.
your first error can be fixed as below. Joins are usually better suited for lookups compared to exists or in clause if you have a large table.
with cte as
(
select id from test_table
where col_a = 'a'
)
select * from test_table_2 where id in (select distinct id from cte);

How to filter DISTINCT records and ordering them using LISTAGG function

SELECT
s_id
,CASE WHEN LISTAGG(X.item_id, ',') WITHIN GROUP (ORDER BY TRY_TO_NUMBER(Z.item_pg_nbr))= '' THEN NULL
ELSE LISTAGG (X.item_id, ',') WITHIN GROUP (ORDER BY TRY_TO_NUMBER(Z.item_pg_nbr))
END AS item_id_txt
FROM table_1 X
JOIN table_2 Z
ON Z.cmn_id = X.cmn_id
WHERE s_id IN('38301','40228')
GROUP BY s_id;
When I run the above query, I'm getting the same values repeated for ITEM_ID_TXT column. I want to display only the DISTINCT values.
S_ID ITEM_ID_TXT
38301 618444,618444,618444,618444,618444,618444,36184
40228 616162,616162,616162,616162,616162,616162,616162
I also want the concatenated values to be ordered by item_pg_nbr
I can use DISTINCT in the LISTAGG function but that won't give the result ordered by item_pg_nbr.
Need your inputs on this.
Since you cannot use different columns for the distinct and order by within group, one approach would be:
1 Deduplicate while grabbing the minimum item_pg_nbr.
2 listagg and order by the minimum item_pg_nbr.
create or replace table T1(S_ID int, ITEM_ID int, ITEM_PG_NBR int);
insert into T1 (S_ID, ITEM_ID, ITEM_PG_NBR) values
(1, 1, 3),
(1, 2, 9), -- Adding a non-distinct ITEM_ID within group
(1, 2, 2),
(1, 3, 1),
(2, 1, 1),
(2, 2, 2),
(2, 3, 3);
with X as
(
select S_ID, ITEM_ID, min(ITEM_PG_NBR) MIN_PG_NBR
from T1 group by S_ID, ITEM_ID
)
select S_ID, listagg(ITEM_ID, ',') within group (order by MIN_PG_NBR)
from X group by S_ID
;
I guess the question then becomes what happens when you have duplicates within group? It would seem logical that the minimum item_pg_nbr should be used for the order by, but you could just as easily use the max or some other value.

highest consecutive values in a column SQL

I have below table
create table test (Id int, Name char);
insert into test values
(1, 'A'),
(2, 'A'),
(3, 'B'),
(4, 'B'),
(5, 'B'),
(6, 'B'),
(7, 'C'),
(8, 'B'),
(9, 'B');
I want to print the Name that appears consecutively for at least four times
Expected Output:
Name
B
I have tried in different ways similar to below SQL (resulted in two values B & C) but nothing worked.
My sql attempt:
select Name from
(select t.*, row_number() over (order by Id asc) as grpcnt,
row_number() over (partition by Name order by Id) as grpcnt1 from t) test
where (grpcnt-grpcnt1)>=3
group by Name,(grpcnt-grpcnt1) ;
Try removing the where clause and applying your filter in a having clause based on the counts. Moreover, since you are interested in at least four times, your filter should be >=4. See eg using your modified query:
select
Name
from (
select
*,
row_number() over (order by Id asc) as grpcnt,
row_number() over (partition by Name order by Id) as grpcnt1
from test
) t
group by Name,(grpcnt-grpcnt1)
HAVING COUNT(Name)>=4;
View working demo on db fiddle
If your id is your counter you can do this:
select *
from test t
where exists (
select count(*)
from test
where name='B'
and id <= t.id and id > (t.id - 4)
having count(*) = 4
);

Selecting minimal dates, or nulls in SQL

This is grossly oversimplified, but:
I have a table, something like the following:
CREATE TABLE Table1
([ID] int, [USER] varchar(5), [DATE] date)
;
INSERT INTO Table1
([ID], [USER], [DATE])
VALUES
(1, 'A', '2018-10-01'),
(2, 'A', '2018-09-01'),
(3, 'A', NULL),
(4, 'B', '2018-05-03'),
(5, 'B', '2017-04-01'),
(6, 'C', NULL)
;
And for each user, I wish to retrieve the whole row of details where the DATE variable is minimal.
SELECT T.USER FROM TABLE1 T
WHERE T.DATE = (SELECT MIN(DATE) FROM TABLE1 T1 WHERE T1.USER = T.USER)
Works great, however in the instance there is no row with a populated DATE field, there will be a row with a NULL, like the final row of my table above, which I also wish to select.
So my ideal output in this case is:
(2, 'A', '2018-09-01'),
(5, 'B', '2017-04-01'),
(6, 'C', NULL)
SQL fiddle: http://www.sqlfiddle.com/#!9/df42b5/6
I think something could be done using an EXCLUDE statement but it gets complex very quickly.
You may try with row_number()
demo
select * from
(select *, row_number() over(partition by [user] order by [user],case when
[date] is null then 0 else 1 end desc,[date]) as rn
from Table1)x where rn=1
use union and and co-related sub-query with min() function
CREATE TABLE Table1 (ID int, usr varchar(50), DATE1 date)
;
INSERT INTO Table1 VALUES
(1, 'A', '2018-10-01'),
(2, 'A', '2018-09-01'),
(3, 'A', NULL),
(4, 'B', '2018-05-03'),
(5, 'B', '2017-04-01'),
(6, 'C', NULL)
;
select * from Table1 t where
DATE1= (select min(date1) from Table1 t1 where t1.usr=t.usr
) and date1 is not null
union
select * from Table1 t where date1 is null
and t.usr not in ( select usr from Table1 where date1 is not null)
DEMO
ID usr DATE1
2 A 01/09/2018 00:00:00
5 B 01/04/2017 00:00:00
6 C
You can use GROUP BY and JOIN to output the desired results.
select t.Id
, x.[User]
, x.[MinDate] as [Date]
from
(select [User]
, min([Date]) as MinDate
from table1
group by [User]) x
inner join table1 t on t.[User] = x.[User] and (t.[Date] = x.[MinDate] or x.[MinDate] is null)
You can use a Common Table Expression:
;WITH chronology AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY [USER]
ORDER BY ISNULL([DATE], '2900-01-01') ASC
) Idx
FROM TABLE1
)
SELECT ID, [USER], [DATE]
FROM chronology
WHERE Idx=1;
Using a CTE in this solution simplifies the query improving its readability, maintainability and extensibility. Furthermore, I expect this approach to be optimal in terms of performance.

SQL: get all pairs and triples from single column and count their frequences over another column

A simple table of user_id, item_id (both text data) on input.
The question is: what is the way to extract all pairs and triples combinations from item_id column and count their frequences over user_id (i.e. 1% percent of all users have (1, 2) item_id pair)
I've tried some barbarism:
select FirstID, SecondID, count(user_id)
from
(
SELECT
t1.item_id as FirstID,
t2.item_id as SecondID
FROM
(
SELECT item_id, ROW_NUMBER()OVER(ORDER BY item_id) as Inc
FROM t1
) t1
LEFT JOIN
(
SELECT item_id, ROW_NUMBER()OVER(ORDER BY item_id)-1 as Inc
FROM t1
) t2 ON t2.Inc = t1.Inc
) t3 join upg_log on t3.FirstID = upg_log.item_id and t3.SecondID = upg_log.item_id
group by FirstID, SecondID
but got nothing
This particular task belongs to the type which is easier to write than to execute:
declare #t table (
UserId int not null,
ItemId int not null
);
insert into #t
values
(1, 1),
(1, 2),
(1, 3),
(2, 1),
(2, 2),
(3, 2),
(3, 3),
(4, 1),
(4, 4),
(5, 4);
-- Pairs
select t1.ItemId as [Item1], t2.ItemId as [Item2], count(*) as [UserCount]
from #t t1
inner join #t t2 on t1.UserId = t2.UserId and t1.ItemId < t2.ItemId
group by t1.ItemId, t2.ItemId
order by UserCount desc, t1.ItemId, t2.ItemId;
As you can see, there is a semi-Cartesian (triangular) join here, which means that performance will drop quickly with the number of records growing. And, of course, proper indices will be crucial for this kind of query.
In theory, you can easily extend this approach to identify triples, but it might prove to be unfeasible on your actual data. Ideally, such things should be calculated using per-row approach, and results cached.