Calc aggregates across continuous groups of values in Redshift

Calc aggregates across continuous groups of values in Redshift - sql

This is something that would probably be pretty easy to code a solution, but harder to accomplish in straight SQL. I may have to give up and code a routine that scans through the table.
I have a table of user status values with start and end dates that is like this:
create table #t (userid int4, status varchar(15), start_time date, end_time date);
insert into #t values
(1, 'Active', '2019-08-15', '2019-08-20'),
(1, 'Active', '2019-08-20', '2019-08-22'),
(1, 'Active', '2019-08-22', '2019-09-22'),
(1, 'Inactive', '2019-09-22', '2019-10-22'),
(1, 'At Risk', '2019-10-22', '2019-11-22'),
(1, 'Lapsed', '2019-11-22', '2019-12-08'),
(1, 'Active', '2019-12-08', '2019-12-18'),
(1, 'Active', '2019-12-18', '2020-01-11'),
(1, 'Active', '2020-01-11', '2020-01-15'),
(1, 'Active', '2020-01-15', '2020-02-15'),
(1, 'Inactive', '2020-02-15', '2020-03-15')
;
I'm trying to summarized to min/max dates for each continuous group of status values (when sorted by start_time), as shown below:
I've been trying to get there by using window functions in Redshift, but I cannot partition based on status as that seems to group the statuses together and I end up with "Active" from 2019-08-15 to 2020-02-15.

This is a so-called gaps-and-islands approach. Written on my phone so untested. But you should be able to search SO for that key-phrase.
WITH
sorted AS
(
SELECT
*,
ROW_NUMBER()
OVER (
PARTITION BY userid
ORDER BY start
)
AS row_userid_start,
ROW_NUMBER()
OVER (
PARTITION BY userid, status
ORDER BY start
)
AS row_userid_status_start
FROM
#t
)
SELECT
userid,
status,
MIN(start) AS start,
MAX(end) AS end
FROM
sorted
GROUP BY
userid,
status,
row_userid_status_start - row_userid_start

Related

I want to find the date intervals at which the employee comes on a regular basis

Imagine a employee who works in a company whos having a contract to work on a specific task, he comes in and goes on start and end date respectively. I want to get the interval at which the employee comes to office without any absence.
Example Data:
DECLARE #TimeClock TABLE (PunchID INT IDENTITY, EmployeeID INT, PunchinDate DATE)
INSERT INTO #TimeClock (EmployeeID, PunchInDate) VALUES
(1, '2020-01-01'), (1, '2020-01-02'), (1, '2020-01-03'), (1, '2020-01-04'),
(1, '2020-01-05'), (1, '2020-01-06'), (1, '2020-01-07'), (1, '2020-01-08'),
(1, '2020-01-09'), (1, '2020-01-10'), (1, '2020-01-11'), (1, '2020-01-12'),
(1, '2020-01-13'), (1, '2020-01-14'), (1, '2020-01-16'),
(1, '2020-01-17'), (1, '2020-01-18'), (1, '2020-01-19'), (1, '2020-01-20'),
(1, '2020-01-21'), (1, '2020-01-22'), (1, '2020-01-23'), (1, '2020-01-24'),
(1, '2020-01-25'), (1, '2020-01-26'), (1, '2020-01-27'), (1, '2020-01-28'),
(1, '2020-01-29'), (1, '2020-01-30'), (1, '2020-01-31'),
(1, '2020-02-01'), (1, '2020-02-02'), (1, '2020-02-03'), (1, '2020-02-04'),
(1, '2020-02-05'), (1, '2020-02-06'), (1, '2020-02-07'), (1, '2020-02-08'),
(1, '2020-02-09'), (1, '2020-02-10'), (1, '2020-02-12'),
(1, '2020-02-13'), (1, '2020-02-14'), (1, '2020-02-15'), (1, '2020-02-16');
--the output shall look like this '2020-01-01 to 2020-02-10' as this is the interval at which the employee comes without any leave
SELECT 1 AS ID, FORMAT( getdate(), '2020-01-01') as START_DATE, FORMAT( getdate(), '2020-01-10') as END_DATE union all
SELECT 1 AS ID, FORMAT( getdate(), '2020-01-11') as START_DATE, FORMAT( getdate(), '2020-01-15') as END_DATE union all
SELECT 1 AS ID, FORMAT( getdate(), '2020-01-21') as START_DATE, FORMAT( getdate(), '2020-01-31') as END_DATE union all
SELECT 1 AS ID, FORMAT( getdate(), '2020-02-01') as START_DATE, FORMAT( getdate(), '2020-02-10') as END_DATE
--the output shall look like this '2020-01-01 to 2020-01-15' and '2020 01-21 to 2020-02-10'as these are the intervals at which the employee comes without any leave

Using the example data provided we can query the table like this:
;WITH iterate AS (
SELECT *, DATEADD(DAY,1,PunchinDate) AS NextDate
FROM #TimeClock
), base AS (
SELECT *
FROM (
SELECT *, CASE WHEN DATEADD(DAY,-1,PunchInDate) = LAG(PunchinDate,1) OVER (PARTITION BY EmployeeID ORDER BY PunchinDate) THEN PunchInDate END AS s
FROM iterate
) a
WHERE s IS NULL
), rCTE AS (
SELECT EmployeeID, PunchInDate AS StartDate, PunchInDate AS EndDate, NextDate
FROM base
UNION ALL
SELECT a.EmployeeID, a.StartDate, r.PunchInDate, r.NextDate
FROM rCTE a
INNER JOIN iterate r
ON a.NextDate = r.PunchinDate
AND a.EmployeeID = r.EmployeeID
)
SELECT EmployeeID, StartDate, MAX(EndDate) AS EndDate, DATEDIFF(DAY,StartDate,MAX(EndDate)) AS Streak
FROM rCTE
GROUP BY rCTE.EmployeeID, rCTE.StartDate
This is known as a recursive common table expression, and allows us to compare values between related rows. In this case we're looking for rows where they follow a streak, and we want o re-start that streak anytime we encounter a break. We're using a windowed function called LAG to look back a row to the previous value, and compare it to the one we have now. If it's not yesterday, then we start a new streak.
EmployeeID StartDate EndDate Streak
------------------------------------------
1 2020-01-01 2020-01-15 14
1 2020-01-17 2020-02-10 24
1 2020-02-12 2020-02-16 4

SQL Pivot Half of table

I have a table that consists of time information. It's basically:
Employee, Date, Seq, Time In, Time Out.
They can clock out multiple times a day, so I'm trying to get all of the clock outs in a day on one row. My result would be something like:
Employee, Date, TimeIn1, TimeOut1, TimeIn2, TimeOut2, TimeIn3, TimeOut3....
Where the 1, 2, and 3 are the sequence numbers. I know I could just do a bunch of left joins to the table itself based on employee=employee, date=date, and seq=seq+1, but is there a way to do it in a pivot? I don't want to pivot the employee and date fields, just the time in and time out.

The short answer is: Yes, it's possible.
The exact code will be updated if/when you provide sample data to clarify some points, but you can absolutely pivot the times out while leaving the employee/work date alone.
Sorry for the wall of code; none of the fiddle sites are working from my current computer
declare #test table (
pk int,
workdate date,
seq int,
tIN time,
tOUT time
)
insert into #test values
(1, '2020-11-25', 1, '08:00', null),
(1, '2020-11-25', 2, null, '11:00'),
(1, '2020-11-25', 3, '11:32', null),
(1, '2020-11-25', 4, null, '17:00'),
(2, '2020-11-25', 5, '08:00', null),
(2, '2020-11-25', 6, null, '09:00'),
(2, '2020-11-25', 7, '09:15', null),
-- new date
(1, '2020-11-27', 8, '08:00', null),
(1, '2020-11-27', 9, null, '08:22'),
(1, '2020-11-27', 10, '09:14', null),
(1, '2020-11-27', 11, null, '12:08'),
(1, '2020-11-27', 12, '01:08', null),
(1, '2020-11-27', 13, null, '14:40'),
(1, '2020-11-27', 14, '14:55', null),
(1, '2020-11-27', 15, null, '17:00')
select *
from (
/* this just sets the column header names and condenses their values */
select
pk,
workdate,
colName = case when tin is not null then 'TimeIn' + cast(empDaySEQ as varchar) else 'TimeOut' + cast(empDaySEQ as varchar) end,
colValue = coalesce(tin, tout)
from (
/* main query */
select
pk,
workdate,
/* grab what pair # this clock in or out is; reset by employee & date */
empDaySEQ = (row_number() over (partition by pk, workdate order by seq) / 2) + (row_number() over (partition by pk, workdate order by seq) % 2),
tin,
tout
from #test
) i
) a
PIVOT (
max(colValue)
for colName
IN ( /* replace w/ dynamic if you don't know upper boundary of max in/out pairs */
[TimeIn1],
[TimeOut1],
[TimeIn2],
[TimeOut2],
[TimeIn3],
[TimeOut3],
[TimeIn4],
[TimeOut4]
)
) mypivotTable
generates these results.
(I would provide a fiddle demo but they're not working for me today)

Split Table into Windows with Recurring Attributes

My title is awful, because I am not sure how to describe the challenge. I would love an edit if someone can think of a more descriptive title. Hopefully my input/desired output will help explain. Here is some sample input data:
create table #input (
num varchar(10),
code varchar(10),
event_date date
)
insert into #input (num, code, event_date)
values('123456', 'Active', '2007-09-10'),
('123456', 'Active', '2010-09-15'),
('123456', 'Active', '2010-09-24'),
('123456', 'Inactive', '2018-09-17'),
('123456', 'Inactive', '2019-01-01'),
('123456', 'Active', '2019-02-08')
select *
from #input
order by event_date
I want to tag each record for each group of num + code with the same number. However, I want the time periods to stay separate. Here is the desired result:
create table #result (
num varchar(10),
code varchar(10),
event_date date,
tag int
)
insert into #result (num, code, event_date, tag)
values('123456', 'Active', '2007-09-10', 1),
('123456', 'Active', '2010-09-15', 1),
('123456', 'Active', '2010-09-24', 1),
('123456', 'Inactive', '2018-09-17', 2),
('123456', 'Inactive', '2019-01-01', 2),
('123456', 'Active', '2019-02-08', 3)
select *
from #result
order by event_date
Obviously normal window partitions like this...
select *, row_number() over(partition by num, code order by event_date) rn
from #input
order by event_date
...don't work, because there is no field on which to partition that would split the two "Active" groups (two groups, because they happen during two time frames). How would I reach my desired result? I have a hunch that a series of lag() and lead() functions might work, but I couldn't get anywhere meaningful.
Alternatively, how would I achieve the results so the categories overlap by one?
create table #result_new (
num varchar(10),
code varchar(10),
event_date date,
tag int
)
insert into #result (num, code, event_date, tag)
values('123456', 'Active', '2007-09-10', 1),
('123456', 'Active', '2010-09-15', 1),
('123456', 'Active', '2010-09-24', 1),
('123456', 'Inactive', '2018-09-17', 1),
('123456', 'Inactive', '2019-01-01', 2),
('123456', 'Active', '2019-02-08', 2)
select *
from #result_new
order by event_date

LAG gets your half way there, but not the whole way. You can use LAG to check the value of the last row, and create (what I have called) a switch. You can then use a SUM window function, with a ROWs BETWEEN clause to get the value for tag:
WITH CTE AS(
SELECT num,
code,
event_date,
CASE WHEN code = LAG(code) OVER (PARTITION BY num ORDER BY event_date) THEN 0 ELSE 1 END AS Switch
FROM #input)
SELECT num,
code,
event_date,
SUM(Switch) OVER (PARTITION BY num ORDER BY event_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS tag
FROM CTE;

Contiguous Date Functionality: SAP HANA

We are trying to find continuous date from table
The expected output is attached in the image below:
Expected Output
create column table "PS_CMP_TIME_ANALYTICS"."Temp2" (
"ID" integer,
"Period" date);
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (4, '2010-04-03');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (5, '2010-04-07');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (2, '2010-04-10');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (3, '2010-04-15');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (6, '2010-04-16');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (7, '2010-04-17');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (3, '2010-04-22');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (4, '2010-04-24');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (7, '2010-04-30');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (2, '2010-05-01');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (5, '2010-05-02');
INSERT INTO "PS_CMP_TIME_ANALYTICS"."Temp2" VALUES (3, '2010-05-03');
The query we tried in SAP HANA and output we are getting is mentioned below:
SELECT MIN("Period") AS BeginRange,
MAX("Period") AS EndRange
FROM (
SELECT "Period",
--DATEDIFF(D, ROW_NUMBER() OVER(ORDER BY "Period"), "Period") AS DtRange
cast(ROW_NUMBER() OVER(ORDER BY "Period") as date) as xyz,
days_between(to_date(cast(ROW_NUMBER() OVER(ORDER BY "Period") as
Date),'YYYY-MM-DD'), "Period") AS DtRange
FROM "PS_CMP_TIME_ANALYTICS"."Temp2") AS dt
GROUP BY DtRange;
But we are not getting the output as expected find the attached out we got using SAP HANA SQL The END date should be changed
Our Output
How can we achieve in SAP HANA SQL

You Can below Code in SAP HANA
SELECT MIN("Period") as "Start_Date",MAX("Period") as "End_Date",
(days_between(Min("Period"),Max("Period")) + 1) as "Days"
FROM
(select "Period",add_days("Period" ,- ROW_NUMBER() OVER(ORDER BY "Period"))rn
from "SchemaName"."Temp2")
GROUP BY rn

Actually you are looking for upper bound of data islands in your ordered table columns which is similar to finding lower limits of gaps
An other solution can be following SQLScript where I used SQL LEAD() function to calculate the next value in date field
with cte as (
select
"ID", "Period", LEAD("Period", 1) over (order by "Period") NextDate
from "Temp2"
)
select "ID", "Period"
from cte
where IFNULL(DAYS_BETWEEN("Period",NextDate),9) > 1
I also used SQL CTE expressions which might me a new syntax for many HANA developers coming from ABAP programming experience.
But I use it a lot and is especially more powerful with multiple CTE in single SELECT statements.
Output is as follows for the above SQLScript query

How to retrieve WTD,YTD,MTD users from a user traffic table in the same query?

In a user traffic table as below, I would like to compute the week to date (WTD), month to date ( MTD ), year to date ( YTD ) user and returned user counts.
Test data :
create table user_traffic (session_id number(6), session_day date,
user_id number(6), product_id number(6));
insert into user_traffic values ( 1, date '2016-09-07', 101, 1);
insert into user_traffic values ( 2, date '2016-09-07', 101, 4);
insert into user_traffic values ( 3, date '2016-09-07', 102, 1);
insert into user_traffic values ( 4, date '2016-09-08', 101, 2);
insert into user_traffic values ( 5, date '2016-09-08', 101, 4);
insert into user_traffic values ( 6, date '2016-09-09', 102, 1);
insert into user_traffic values ( 7, date '2016-09-10', 102, 1);
insert into user_traffic values ( 8, date '2016-09-10', 103, 3);
insert into user_traffic values ( 9, date '2016-09-25', 104, 3);
insert into user_traffic values ( 10, date '2016-10-01', 103, 1);
insert into user_traffic values ( 11, date '2016-10-02', 104, 3);
Expected Output :-
Week_Start_Day, WTD_new_cnt, WTD_returned_cnt
Month_Start_Day, MTD_new_cnt, MTD_returned_cnt
Year_Start_Day, YTD_new_cnt, YTD_returned_cnt
Comments :-
For eg: In the above user traffic table userid=104 visited on Oct 02nd and the WTD,MTD,YTD new/returned counts would be as below.
WTD,new,return
2016-09-26(Mon)(Week start day ), 1,0 ( For userid = 104 )
MTD,new,return
2016-09,1,1
2016-10,0,1
YTD,new,return
2016,0,1
What I have tried?
select session_day,
COUNT( distinct user_id ) AS user_cnt,
count(distinct user_id) - lag(count(distinct user_id))
over (order by session_day) gain,
count(newu) AS newu, count(returnu) AS returnu
from
(
select session_id,
session_day,
user_id,
CASE WHEN
count(*) over ( partition by user_id ORDER BY
session_day,session_id ROWS
BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW
)
= 1
THEN 1
END
AS newu,
CASE WHEN
lag( session_day,1 ) over ( partition by user_id ORDER
BY session_day,session_id
)
<>
lag( session_day,1 ) over ( order by
session_day,session_id
)
THEN 1
END AS returnu
from user_traffic u
)
group by session_day
order by session_day;
I have built this sql in computing the new/returned users from the user traffic table at sessionday level.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Calc aggregates across continuous groups of values in Redshift - sql

Related

I want to find the date intervals at which the employee comes on a regular basis

SQL Pivot Half of table

Split Table into Windows with Recurring Attributes

Contiguous Date Functionality: SAP HANA

How to retrieve WTD,YTD,MTD users from a user traffic table in the same query?

Categories

Resources