Case Statement and Aggregation within a Group By classe in proc sql - sql

I am having some trouble aggregating and using case within a group.
The objective is to check Indicator for each transaction key. If '1' indicator exists then we have to select the max(Change_Date). If all zeros then min(Change_Date). Along with that the Initial_key associated with that Change_date has to be populated as a Final_key.
The output looks like this

You can get the last two columns using aggregation. If I understand correctly:
select trxn_key,
coalesce(max(case when indicator = 1 then change_date end),
min(change_date)
) as final_date,
coalesce(max(case when indicator = 1 then initial_key end),
min(initial_key)
) as final_key
from t
group by trxn_key;
Then join this in:
proc sql;
select t.*, tt.final_date, tt.final_key
from t join
(select trxn_key,
coalesce(max(case when indicator = 1 then change_date end),
min(change_date)
) as final_date,
coalesce(max(case when indicator = 1 then initial_key end),
min(initial_key)
) as final_key
from t
group by trxn_key
) tt
on tt.trxn_key = t.trxn_key;

Could you try with below,
If i observe the test data provided by you
First we try to find the the max(indicator) within group by name and trxn_key.
Second based on the value above , we decide whether to take min(change_date) and min(initial_key) or max(change_date) and max(initial_key)
Because you don't need aggregated result we need to use analytic function which will not affect the final output rows.
SELECT t1.name
,t1.initial_key
,t1.change_date
,t1.indicator
,t1.trxn_key
,t1.trxn_date
,CASE
WHEN max_ind = 1
THEN
MAX(CASE WHEN indicator = 1 THEN change_date END) OVER (PARTITION BY NAME,trxn_key)
WHEN max_ind = 0
THEN
MIN(CASE WHEN indicator = 0 THEN change_date END) OVER (PARTITION BY NAME,trxn_key)
END final_date
,CASE
WHEN max_ind = 1
THEN
MAX(CASE WHEN indicator = 1 THEN initial_key END) OVER (PARTITION BY NAME,trxn_key)
WHEN max_ind = 0
THEN
MIN(CASE WHEN indicator = 0 THEN initial_key END) OVER (PARTITION BY NAME,trxn_key)
END final_key
FROM
(
SELECT NAME
,initial_key
,change_date
,indicator
,trxn_key
,trxn_date
,MAX(indicator) OVER (PARTITION BY NAME,trxn_key) max_ind
FROM table1
) t1
ORDER BY trxn_key,trxn_date,initial_key,change_date;

You can process groups with DOW loops (do until loops with SET & BY statement inside)
A DATA Step program with serial DOW loops (two in one step) can have the first loop process the group, measuring it in most anyway desired, and the second loop output records with values computed in the first loop.
Example:
data have;
input name $ initial_key change_date indicator trxn_key trxn_date;
attrib change_date trxn_date informat=date9. format=date9.;
datalines;
ABC 1 17feb20 0 1 16feb20
ABC 2 21feb20 0 1 16feb20
ABC 3 25feb20 0 1 16feb20
ABC 1 17feb20 1 2 20feb20
ABC 2 21feb20 0 2 20feb20
ABC 3 25feb20 0 2 20feb20
ABC 1 17feb20 1 3 22feb20
ABC 2 21feb20 1 3 22feb20
ABC 3 25feb20 0 3 22feb20
ABC 1 17feb20 1 4 26feb20
ABC 2 21feb20 1 4 26feb20
ABC 3 25feb20 1 4 26feb20
;
data want;
* first dow loop, compute min and max_ associated values;
do until (last.trxn_key);
set have;
by name trxn_key;
if missing(min_date) or change_date < min_date then do;
min_date = change_date;
min_key = initial_key;
end;
if missing(max_date) or change_date > max_date then
if indicator then do;
max_date = change_date;
max_key = initial_key;
max_flag = 1;
end;
end;
* compute final values per business rules;
if max_flag then do;
final_date = max_date;
final_key = max_key;
end;
else do;
final_date = min_date;
final_key = min_key;
end;
* second dow loop, output with final values;
do until (last.trxn_key);
set have;
by name trxn_key;
OUTPUT;
end;
format final_date min_date max_date date9.;
drop min_: max_:;
run;

Related

Select the greatest occurence from a column, based on date is frequencies are the same

I have the following dataset with let's say ID = {1,[...],5} and Col1 = {a,b,c,Null} :
ID
Col1
Date
1
a
01/10/2022
1
a
02/10/2022
1
a
03/10/2022
2
b
01/10/2022
2
c
02/10/2022
2
c
03/10/2022
3
a
01/10/2022
3
b
02/10/2022
3
Null
03/10/2022
4
c
01/10/2022
5
b
01/10/2022
5
Null
02/10/2022
5
Null
03/10/2022
I would like to group my rows by ID, compute new columns to show the number of occurences and compute a new column that would show a string of characters, depending on the frequency of Col1. With most a = Hi, most b = Hello, most c = Welcome, most Null = Unknown. If multiple modalities except Null have the same frequency, the most recent one based on date wins.
Here is the dataset I need :
ID
nb_a
nb_b
nb_c
nb_Null
greatest
1
3
0
0
0
Hi
2
0
1
2
0
Welcome
3
1
1
0
1
Hello
4
0
0
1
0
Welcome
5
0
1
0
2
Unknown
I have to do this in a compute recipe in Dataiku. The group by is handled by the group by section of the recipe while the rest of the query needs to be done in the "custom aggregations" section of the recipe. I'm having troubles with the if equality then most recent part of the code.
My SQL code looks like this :
CASE WHEN SUM(CASE WHEN Col1 = a THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = b THEN 1 ELSE 0)
AND SUM(CASE WHEN Col1 = a THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = c THEN 1 ELSE 0)
THEN 'Hi'
CASE WHEN SUM(CASE WHEN Col1 = b THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = a THEN 1 ELSE 0)
AND SUM(CASE WHEN Col1 = b THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = c THEN 1 ELSE 0)
THEN 'Hello'
CASE WHEN SUM(CASE WHEN Col1 = c THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = a THEN 1 ELSE 0)
AND SUM(CASE WHEN Col1 = c THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = b THEN 1 ELSE 0)
THEN 'Welcome'
Etc, etc, repeat for other cases.
But surely there must be a better way to do this right? And I have no idea how to include the most recent one when frequencies are the same.
Thank you for your help and sorry if my message isn't clear.
I tried to repro this in Azure Synapse using SQL script. Below is the approach.
Sample Table is created as in below image.
Create table tab1 (id int, col1 varchar(50), date_column date)
Insert into tab1 values(1,'a','2021-10-01')
Insert into tab1 values(1,'a','2021-10-02')
Insert into tab1 values(1,'a','2021-10-03')
Insert into tab1 values(2,'b','2021-10-01')
Insert into tab1 values(2,'c','2021-10-02')
Insert into tab1 values(2,'c','2021-10-03')
Insert into tab1 values(3,'a','2021-10-01')
Insert into tab1 values(3,'b','2021-10-02')
Insert into tab1 values(3,'Null','2021-10-03')
Insert into tab1 values(4,'c','2021-10-01')
Insert into tab1 values(5,'b','2021-10-01')
Insert into tab1 values(5,'Null','2021-10-02')
Insert into tab1 values(5,'Null','2021-10-03')
Step:1
Query is written to find the count of values within the group id,col1 and maximum date value within each combination of id, col1.
select
distinct id,col1,
count(*) over (partition by id,col1) as count,
case when col1='Null' then null else max(date_column) over (partition by id,col1) end as max_date
from tab1
Step:2
Row number is calculated within each id, col1 group on the decreasing order of count and max_date columns. This is done when two or more values have same frequency, then to assign value based on latest date.
select *, row_number() over (partition by id order by count desc, max_date desc) as row_num from
(select
distinct id,col1,
count(*) over (partition by id,col1) as count,
case when col1='Null' then null else max(date_column) over (partition by id,col1) end as max_date
from tab1)q1
Step:3
Line items with row_num=1 are filtered and values for the greatest column is assigned with the logic
most a = Hi, most b = Hello, most c = Welcome, most Null = Unknown.
Full Query
select id,
[greatest]=case when col1='a' then 'Hi'
when col1='b' then 'Hello'
when col1='c' then 'Welcome'
else 'Unknown'
end
from
(select *, row_number() over (partition by id order by count desc, max_date desc) as row_num from
(select
distinct id,col1,
count(*) over (partition by id,col1) as count,
case when col1='Null' then null else max(date_column) over (partition by id,col1) end as max_date
from tab1)q1
)q2 where row_num=1
Output
By this approach, even when the frequencies are same, based on the most recent date, required values can be updated.

Flag=1/0 based on multiple criteria on same column

I have a temp table that is being created, we will say that column 1 is YearMonth, column2 as user_id, Column 3 is Type.
YearMonth User_id Type
200101 1 x
200101 2 y
200101 2 z
200102 1 x
200103 2 x
200103 2 p
200103 2 q
I want to count userids based on flag based on type. Hence I am trying to set flag to 1 and 0 but it always results in 0.
So for e.g. when the type contains x or y or z AND type contains P or Q then flag=1 by YearMonth.
I am trying something like
SELECT count (distinct t1.user_id) as count,
t1.YearMonth,
case when t1.type in ('x','y','z')
and
t1.type in ('p','q') then 1 else 0 end as flag
FROM table t1
group by 2,3;
I would like to know why it doesn't give output as below:
count YearMonth Flag
0 200001 1
2 200001 0
1 200002 1
1 200002 0
What am I missing here? Thanks
If I follow you correctly, you can use two levels of aggregation:
select yearmonth, flag, count(*) cnt
from (
select yearmonth, id,
case when max(case when t1.type in ('x', 'y', 'z') then 1 else 0 end) = 1
and max(case when t1.type in ('p', 'q') then 1 else 0 end) = 1
then 1
else 0
end as flag
from mytable
group by yearmonth, id
) t
group by yearmonth, flag
This first flags users for each month, using conditional aggregation, then aggregates by flag and month.
If you also want to display 0 for flags that do not appear for a given month, then you can generate the combinations with a cross join first, then brin the above resultset with a left join:
select y.yearmonth, f.flag, count(t.id) cnt
from (select distinct yearmonth from mytable) y
cross join (values (0), (1)) f(flag)
left join (
select yearmonth, id,
case when max(case when t1.type in ('x', 'y', 'z') then 1 else 0 end) = 1
and max(case when t1.type in ('p', 'q') then 1 else 0 end) = 1
then 1
else 0
end as flag
from mytable
group by yearmonth, id
) t on t.yearmonth = y.yearmonth and t.flag = f.flag
group by y.yearmonth, f.flag
I thought a very similar idea as GMB, however, like him, I don't get the expected results. Likely, however, we both are assuming the expected results are wrong:
SELECT COUNT(DISTINCT UserID) AS [Count],
YearMonth,
CASE WHEN COUNT(CASE WHEN [Type] IN ('x','y','z') THEN 1 END) > 0
AND COUNT(CASE WHEN [Type] IN ('p','q') THEN 1 END) > 0 THEN 1 ELSE 0
END AS Flag
FROM (VALUES(200101,1,'x'),
(200101,2,'y'),
(200101,2,'z'),
(200102,1,'x'),
(200103,2,'x'),
(200103,2,'p'),
(200103,2,'q')) V(YearMonth,UserID,[Type])
GROUP BY YearMonth;

Spliting GROUP BY results into different columns

I have a column containing date ranges and the number of days passed associated to a specific ID (one to many), based on the number of records associated to it, I want those results split into columns instead of individual rows, so from this:
id_hr dd beg end
----------------------------------------
1 10 05/01/2019 15/01/2019
1 5 03/02/2019 08/02/2019
2 8 07/03/2019 15/03/2019
Could become this:
id_hr dd beg end dd beg end
--------------------------------- ---------------------
1 10 05/01/2019 15/01/2019 5 03/02/2019 08/02/2019
2 8 07/03/2019 15/03/2019
I did the same in a worksheet (pivot table) but the table became as slow as it could get, so I'm looking for a more friendly approach in SQL, I did a CTE which number the associated rows and then select each one and display them in new columns.
;WITH CTE AS(
SELECT PER_PRO, ID_HR, NOM_INC, rut_dv, dias_dur, INI, FIN,
ROW_NUMBER()OVER(PARTITION BY ID_HR ORDER BY SUBIDO) AS RN
FROM dbo.inf_vac WHERE PER_PRO = 201902
)
SELECT ID_HR, NOM_INC, rut_dv,
(case when rn = 1 then DIAS_DUR end) as DIAS_DUR1,
(case when rn = 1 then INI end) as INI1,
(case when rn = 1 then FIN end) as FIN1,
(case when rn = 2 then DIAS_DUR end) as DIAS_DUR2,
(case when rn = 2 then INI end) as INI2,
(case when rn = 2 then FIN end) as FIN2,
(case when rn = 3 then DIAS_DUR end) as DIAS_DUR3,
(case when rn = 3 then INI end) as INI3,
(case when rn = 3 then FIN end) as FIN3
FROM CTE
Which gets me each column on where it should be but not grouped. Using GROUP BY displays an error on the CTE select.
rn id_hr dd beg end dd beg end
----------------------------------- ------------------------
1 1 10 05/01/2019 15/01/2019 NULL NULL NULL
2 1 NULL NULL NULL 5 03/02/2019 08/02/2019
1 2 8 07/03/2019 15/03/2019 NULL NULL NULL
Is there any way to group them on the second select?
You have additional columns in the result set that are not in the query. However, this should work:
SELECT ID_HR,
max(case when rn = 1 then DIAS_DUR end) as DIAS_DUR1,
max(case when rn = 1 then INI end) as INI1,
max(case when rn = 1 then FIN end) as FIN1,
max(case when rn = 2 then DIAS_DUR end) as DIAS_DUR2,
max(case when rn = 2 then INI end) as INI2,
max(case when rn = 2 then FIN end) as FIN2,
max(case when rn = 3 then DIAS_DUR end) as DIAS_DUR3,
max(case when rn = 3 then INI end) as INI3,
max(case when rn = 3 then FIN end) as FIN3
FROM CTE
GROUP BY ID_HR;
Yes, you can GROUP BY all the non-CASE columns, and apply MAX to each of the CASE-expression columns.

calculating percent change over time

I've created the structure and sample data here. I'm not sure how to go about calculating the change over time.
My desired result set is:
a | % growth
abc | 4.16
def | 0.83
hig | -0.2
The % change being (last value - first value) / days:
a | % growth
abc | (30-5) / 6
def | (6-1) / 6
hig | (4-5) / 5
I'm trying:
SELECT a.*,
b.val,
c.val
FROM (SELECT a,
Min(dt) AS lowerDt,
Max(dt) AS upperDt
FROM tt
GROUP BY a) a
LEFT JOIN tt b
ON b.dt = a.lowerdt
AND b.a = a.a
LEFT JOIN tt c
ON c.dt = a.upperdt
AND b.a = a.a
If possible, I'd like to avoid a CTE.
You don't want min and max, you really want first and last.
One way I do that is to use ROW_NUMBER() to tell me the position from the begining or the end. Then use MAX(CASE WHEN pos=1 THEN x ELSE null END) to get the values I want.
SELECT
a,
MAX(CASE WHEN pos_from_first = 1 THEN dt ELSE NULL END) AS first_date,
MAX(CASE WHEN pos_from_final = 1 THEN dt ELSE NULL END) AS final_date,
MAX(CASE WHEN pos_from_first = 1 THEN val ELSE NULL END) AS first_value,
MAX(CASE WHEN pos_from_final = 1 THEN val ELSE NULL END) AS final_value,
100
*
CAST(MAX(CASE WHEN pos_from_final = 1 THEN val ELSE NULL END) AS DECIMAL(9,6))
/
CAST(MAX(CASE WHEN pos_from_first = 1 THEN val ELSE NULL END) AS DECIMAL(9,6))
-
100 AS perc_change
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY a ORDER BY dt ASC) AS pos_from_first,
ROW_NUMBER() OVER (PARTITION BY a ORDER BY dt DESC) AS pos_from_final,
*
FROM
tt
)
AS ordered
GROUP BY
a
http://sqlfiddle.com/#!6/ad95d/11

SQL help on group by query with inline query

I've these 2 tables. I want to count number of con_id which has remark '1' continuously for the last period(s).
ex: 2 for A1, 1 for A3, but 0 for A2 and B1 as they don't have '1' continuously for the latest result(s) for the following table.
t_conmast
con_id [pk]
off_code
con_id off_code
A1 1
A2 1
B1 2
A3 1
t_readbak
con_id [fk]
counter
remark
timestamp [not shown in the table; auto inserted by system]
con_id counter remark timestamp
A1 1 0
A1 3 1
A1 6 1
B1 1 1
B1 2 0
A2 1 0
A2 2 1
A2 3 0
A3 1 1
what I tried and failed (I added the off_code just to get result for a single office)
select con_id,
count(con_id)
from t_readbak
where remark=1 and timestamp > (select max(timestamp)
from t_readbak
where remark=0
group by con_id)
and con_id in (select con_id from t_conmast where off_code=1)
Expected output
con_id count(con_id)
A1 2
A2 0
A3 1
B1 0
This is the approach that I took to solving this. First, calculate a cumulative sum of remark going backwards for each con_id. Then, the first time that you hit a row where remark = 0, use the value on that row. You can find the first such row using row_number().
The complication is when you have no remarks with a value of 0. In that case, you just take the total number.
The following query combines this logic into SQL:
select rb.con_id,
(case when NumZeros = 0 then numRemarks else cumsum end) as count1
from (select rb.*,
SUM(remark) over (partition by con_id order by counter desc) as cumsum,
ROW_NUMBER() over (partition by con_id, remark order by counter desc) as remark_counter,
SUM(case when remark = 0 then 1 else 0 end) as NumZeros,
SUM(remark) over (partition by con_id) as numRemarks
from t_readbak rb
) rb
where (remark_counter = 1 and remark = 0) or
(NumZeros = 0 and remark_counter = 1)
A left self join might work. Something like this:
select con_id, count(*) records
from t_readback t1 left join t_readback t2 using (con_id, remark)
where remark = 1
and t1.counter < t2.counter
group by con_id
If you mean that you only want to include con_id counts if every remark in the period is 1, you can do something like this:
SELECT
con_id,
COUNT(CASE remark = 1 THEN 1 END) AS Remark1Count,
COUNT(CASE remark <> 1 THEN 1 END) AS RemarkNot1Count
FROM t_conmast
INNER JOIN t_readbak ON t_conmast.con_id = t_readbak.con_id
WHERE your-timestamp-condition
GROUP BY con_id
HAVING COUNT(CASE remark <> 1 THEN 1 END) = 0
The HAVING will filter out any con_id that has a remark <> 1.
get the maximum timestamp for each con_id where remark is 0.
thereafter, again for each con_id, count items with younger timestamps. remark is set to 1 in these records by construction:
select con_id
, count(*)
from t_readbak master
inner join t_conmast office on ( office.off_code = 1
and office.con_id = master.con_id )
inner join (
select con_id con_id
, max(timestamp) ts
from (
select con_id
, remark
, timestamp
from t_readbak
where remark = 0
) noremark
group by con_id
) cutoff
on ( master.con_id = cutoff.con_id )
where master.timestamp > cutoff.ts
group by master.con_id
;
replace timestamp ( max(timestamp) ) by counter ( min(counter)) and change the comparison operator if you can't trust your timestamp ordering.