Interested in a challenging SQL problem, read ahead:
For the data set below, I'm trying to find a logic which identifies the commencement date of a new project for each employee.
Data Set
The logic to identify commencement date of new project is that:
An employee will not have any date record prior to the present one in a 14 day time frame.
Project windows only last 14 days after the commencement. The first record falling outside such a window will be counted as the start of the next project.
What is needed
Both Redshift/ Postgres solutions accepted.
Please note Redshift doesn't support recursive CTEs or RANGE keyword in window frame.
Thanks for reading.
For Postgresql, including the CTE (DataSet) for the dataset, here you go:
WITH RECURSIVE TimeLine(Employee, ProjectID, ProjectStartDate, Date, DateRank) AS (
SELECT Employee, 1, Date, Date, DateRank
FROM DataSetWithRank
WHERE DateRank = 1
UNION ALL
SELECT T.Employee,
T.ProjectID + CASE When D.Date >= T.ProjectStartDate+14 THEN 1 Else 0 END,
CASE When D.Date >= T.ProjectStartDate+14 THEN D.Date Else T.ProjectStartDate END,
D.Date, D.DateRank
FROM TimeLine T
JOIN DataSetWithRank D ON D.Employee = T.Employee AND D.DateRank = T.DateRank + 1
), DataSet(Employee,Date) AS (
SELECT UNNEST(ARRAY['Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1','Employee1']),
UNNEST(ARRAY['2018-01-01','2018-01-03','2018-01-05','2018-01-08','2018-01-11','2018-01-13','2018-01-14','2018-01-16','2018-01-18','2018-01-21','2018-01-22','2018-01-24','2018-01-25','2018-01-27','2018-01-29']::date[])
UNION
SELECT UNNEST(ARRAY['Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2','Employee2']),
UNNEST(ARRAY['2018-01-03','2018-01-05','2018-01-07','2018-01-10','2018-01-13','2018-01-15','2018-01-16','2018-01-18','2018-01-20','2018-01-23','2018-01-24','2018-01-26','2018-01-27','2018-01-29','2018-01-31']::date[])
), DataSetWithRank AS (
SELECT *, DENSE_RANK() OVER (PARTITION BY Employee ORDER BY Date) AS DateRank
FROM DataSet
)
SELECT Employee,
'Project ' || ProjectID AS "Project #",
Date,
DENSE_RANK() OVER (PARTITION BY Employee, ProjectID ORDER BY Date) AS Rank,
CASE WHEN Date = ProjectStartDate THEN 'Y' ELSE NULL END AS Is_New
FROM TimeLine
Related
It does not meter if you offer me a solution in Oracle or SQL Server or MySQL or PostgreSQL.
All I need is a different approach, another way of thinking. Thank you
The start date and end date of each project are required along with the number of days - how long each project lasted.
Projects do not overlap.
I tried to solve it using lag and lead but I didn't succeed.
Usually I look for several solutions to a problem but unfortunately I only found one solution to this problem on google.
I'm interested in other approaches if that's the only way I'm learning.
This is the solution I found on google
WITH STARTDATES AS (
SELECT startdate
FROM project
WHERE startdate NOT IN (SELECT enddate FROM project) ),
-- get end dates not present in start date column (these are “true” project end dates)
ENDDATES AS (
SELECT enddate
FROM project
WHERE enddate NOT IN (SELECT startdate FROM project) ),
-- filter to plausible start-end pairs (start < end), then find correct end date for each start date (the minimum end date, since there are no overlapping projects)
t3 AS (
SELECT startdate, min(enddate) AS enddate
FROM STARTDATES, ENDDATES
WHERE startdate < enddate
GROUP BY startdate )
SELECT startdate, enddate, enddate - startdate AS project_duration
FROM t3
ORDER BY 3,1;
Thank you in advance
You can try to use LAG window function get the next row of STARTDATE then compare subtract ENDDATE
then use condition aggregate function get grp make grouping.
SELECT MIN(STARTDATE) STARTDATE,
MAX(ENDDATE) ENDDATE,
COUNT(*) DURATION
FROM (
SELECT t1.*,SUM(CASE WHEN t1.daydiff = 0 THEN 0 ELSE 1 END) OVER(ORDER BY STARTDATE) grp
FROM (
SELECT T.*,LAG(ENDDATE) OVER(ORDER BY STARTDATE) - STARTDATE daydiff
FROM T
) t1
) t1
GROUP BY grp;
sqlfiddle
select min(start_date) as start_date, max(end_date) as end_date, count(*) as Project_Duration
from
(select *,sum(case when t1.date_diff =0 THEN 0 ELSE 1 END) OVER ( ORDER BY start_date) as days
from
(select *, lag(end_date) over ( order by end_date)-start_date as date_diff from dataset) as t1) as t2
group by days;
I am trying to create a customer journey for period they were active for. The base data is unordered and looks as:
I want to look first for the date when the Status=Active and then the succeeding date when Status=Inactive and the pull in the period and repeat for the next instance of the appearances. the output I am looking for is to create a table that looks as below:
Any pointers on how to do in Teradata SQL would be highly helpful.
As long as there's always a matching 'Inactive' row for every 'Active' row and only the final 'Inactive' might be missing:
select customerid, dt,
-- next 'Inactive' row, "until changed" for last row
lead(case when status = 'Inactive' then dt end, 1, date '9999-12-31')
over (partition by customerid
order by dt)
from cust
-- only return the 'Active' rows
qualify status = 'Active';
Here is the SQL (dbfiddle link):
WITH active_status
AS (SELECT customerid,
Row_number()
OVER (
partition BY customerid
ORDER BY start_date) id,
start_date
FROM (SELECT customerid,
CASE
WHEN status = 'Active' THEN dt
END "start_date"
FROM cust) x
WHERE start_date IS NOT NULL),
inactive_status
AS (SELECT customerid,
Row_number()
OVER (
partition BY customerid
ORDER BY end_date) id,
end_date
FROM (SELECT customerid,
CASE
WHEN status = 'Inactive' THEN dt
END "end_date"
FROM cust) x
WHERE end_date IS NOT NULL)
SELECT acta.customerid,
acta.start_date,
COALESCE(inacta.end_date, '2099-12-31') end_date
FROM active_status acta
LEFT OUTER JOIN inactive_status inacta
ON acta.customerid = inacta.customerid
AND acta.id = inacta.id
ORDER BY start_date;
Since Teradata SQL dialect is similar to postgreSQL , I have used it in dbfiddle link . dbfiddle does not support teradata for testing.
row_number() analytical function and coalesce() which are the only significant pieces are available in teradata.
I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.
This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;
Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date
Complicated to summarize the issue in the title, and in text so check the IMG for visual explanation.
I've got an issue joining two tables. The date from the first table (startdatetable) should get the next/closest date in the other table (enddatetable). This is to 99% easy done with rank, because most rows have a startdate that can find the next enddate, and before another enddate is available there is a new startdate.
However, if there are two startdates and only one enddate available the startdates will join the same enddate.
What I'm trying to do is, that if a date has been used in the row before, it should not be used in the next row.
The rows I want are highlighted.
The SQL I started out with looked like
select *
from (
select rank() over (partition by tid order by enddate asc) as rnk
, id, startdate, enddate
from startdateTable
inner join enddateTable on startdateTable.ID = enddateTable.id
and enddateTable.enddate > startdateTable.startdate
) q
where q.rnk = 1
This gets me the following result. The last row should instead get the 2100 date, since the 2020-06 date has been used in the previous row.
If you have two tables and you want to align them, then you can use row_number():
select s.id, s.startdate, e.enddate
from (select s.*, row_number() over (partition by id order by startdate) as seqnum
from startdateTable
) s join
(select e.*, row_number() over (partition by id order by enddate) as seqnum
from enddateTable
) e
on e.id = s.id and e.seqnum = s.seqnum
I have in the past written queries that give me counts by date (hires, terminations, etc...) as follows:
SELECT per.date_start AS "Date",
COUNT(peo.EMPLOYEE_NUMBER) AS "Hires"
FROM hr.per_all_people_f peo,
hr.per_periods_of_service per
WHERE per.date_start BETWEEN peo.effective_start_date AND peo.EFFECTIVE_END_DATE
AND per.date_start BETWEEN :PerStart AND :PerEnd
AND per.person_id = peo.person_id
GROUP BY per.date_start
I was now looking to create a count of active employees by date, however I am not sure how I would date the query as I use a range to determine active as such:
SELECT COUNT(peo.EMPLOYEE_NUMBER) AS "CT"
FROM hr.per_all_people_f peo
WHERE peo.current_employee_flag = 'Y'
and TRUNC(sysdate) BETWEEN peo.effective_start_date AND peo.EFFECTIVE_END_DATE
Here is a simple way to get started. This works for all the effective and end dates in your data:
select thedate,
SUM(num) over (order by thedate) as numActives
from ((select effective_start_date as thedate, 1 as num from hr.per_periods_of_service) union all
(select effective_end_date as thedate, -1 as num from hr.per_periods_of_service)
) dates
It works by adding one person for each start and subtracting one for each end (via num) and doing a cumulative sum. This might have duplicates dates, so you might also do an aggregation to eliminate those duplicates:
select thedate, max(numActives)
from (select thedate,
SUM(num) over (order by thedate) as numActives
from ((select effective_start_date as thedate, 1 as num from hr.per_periods_of_service) union all
(select effective_end_date as thedate, -1 as num from hr.per_periods_of_service)
) dates
) t
group by thedate;
If you really want all dates, then it is best to start with a calendar table, and use a simple variation on your original query:
select c.thedate, count(*) as NumActives
from calendar c left outer join
hr.per_periods_of_service pos
on c.thedate between pos.effective_start_date and pos.effective_end_date
group by c.thedate;
If you want to count all employees who were active during the entire input date range
SELECT COUNT(peo.EMPLOYEE_NUMBER) AS "CT"
FROM hr.per_all_people_f peo
WHERE peo.[EFFECTIVE_START_DATE] <= :StartDate
AND (peo.[EFFECTIVE_END_DATE] IS NULL OR peo.[EFFECTIVE_END_DATE] >= :EndDate)
Here is my example based on Gordon Linoff answer
with a little modification, because in SUBSTRACT table all records were appeared with -1 in NUM, even if no date was in END DATE = NULL.
use AdventureWorksDW2012 --using in MS SSMS for choosing DATABASE to work with
-- and may be not work in other platforms
select
t.thedate
,max(t.numActives) AS "Total Active Employees"
from (
select
dates.thedate
,SUM(dates.num) over (order by dates.thedate) as numActives
from
(
(
select
StartDate as thedate
,1 as num
from DimEmployee
)
union all
(
select
EndDate as thedate
,-1 as num
from DimEmployee
where EndDate IS NOT NULL
)
) AS dates
) AS t
group by thedate
ORDER BY thedate
worked for me, hope it will help somebody
I was able to get the results I was looking for with the following:
--Active Team Members by Date
SELECT "a_date",
COUNT(peo.EMPLOYEE_NUMBER) AS "CT"
FROM hr.per_all_people_f peo,
(SELECT DATE '2012-04-01'-1 + LEVEL AS "a_date"
FROM dual
CONNECT BY LEVEL <= DATE '2012-04-30'+2 - DATE '2012-04-01'-1
)
WHERE peo.current_employee_flag = 'Y'
AND "a_date" BETWEEN peo.effective_start_date AND peo.EFFECTIVE_END_DATE
GROUP BY "a_date"
ORDER BY "a_date"