Related
I have a table in the following format
Id StartDate EndDate Type
1 2012-02-18 2012-03-18 1
1 2012-03-17 2012-06-29 1
1 2012-06-27 2012-09-27 1
1 2014-08-23 2014-09-24 3
1 2014-09-23 2014-10-24 3
1 2014-10-23 2014-11-24 3
2 2015-07-04 2015-08-06 1
2 2015-08-04 2015-09-06 1
3 2013-11-01 2013-12-01 0
3 2018-01-09 2018-02-09 0
I found similar questions here, but not something that could help me solve my problem. I want to merge rows that has the same Id, Type and overlapping date periods.
The result from the above table should be
Id StartDate EndDate Type
1 2012-02-18 2012-09-27 1
1 2014-08-23 2014-11-24 3
2 2015-07-04 2015-09-06 1
3 2013-11-01 2013-12-01 0
3 2018-01-09 2018-02-09 0
In another server, I was able to do it with the following restrictions and the query below:
Didn't care about the Type column, but just the Id
Had a newer version of SQL Server (2012), but now I have 2008 which the code is not compatible
SELECT Id
, MIN(StartDate) AS StartDate
, MAX(EndDate) AS EndDate
FROM (
SELECT *
, SUM(CASE WHEN a.EndDate = a.StartDate THEN 0
ELSE 1
END
) OVER (ORDER BY Id, StartDate) sm
FROM (
SELECT Id
, StartDate
, EndDate
, LAG(EndDate, 1, NULL) OVER (PARTITION BY Id ORDER BY Id, EndDate) EndDate
FROM #temptable
) a
) b
GROUP BY Id, sm
Any advice how I can
Include Type on the process
Make it work on SQL Server 2008
This approach uses an additional temp table to identify the groups of overlapping dates, and then performs a quick aggregate based on the groupings.
SELECT *, ROW_NUMBER() OVER (ORDER BY Id, Type) AS UID,
ROW_NUMBER() OVER (ORDER BY Id, Type) AS GroupId INTO #G FROM #TempTable
WHILE ##ROWCOUNT <> 0 BEGIN
UPDATE T1 SET
GroupId = T2.GroupId
FROM #G T1
INNER JOIN (
SELECT T1.UID, CASE WHEN T1.GroupId < T2.GroupId THEN T1.GroupId ELSE T2.GroupId END
FROM #G T1
LEFT OUTER JOIN #G T2
ON T1.Id = T2.Id AND T1.Type = T2.Type AND T1.GroupId <> T2.GroupId
AND T1.StartDate <= T2.EndDate AND T2.StartDate <= T1.EndDate
) T2 (UID, GroupId)
ON T1.UID = T2.UID
WHERE T1.GroupId <> T2.GroupId
END
SELECT Id, MIN(StartDate) AS StartDate, MAX(EndDate) AS EndDate, Type
FROM #G G GROUP BY GroupId, Id, Type
This returns the expected values
Id StartDate EndDate Type
----------- ---------- ---------- -----------
1 2012-02-18 2012-09-27 1
1 2014-08-23 2014-11-24 3
2 2015-07-04 2015-09-06 1
3 2013-11-01 2013-12-01 0
3 2018-01-09 2018-02-09 0
This is 2008 compatible. A CTE really is the best way to link up all overlapping records in my opinion. The date overlap logic came from this thread: SO Date Overlap
I added extra data that's more complex to make sure that it's working as expected.
DECLARE #Data table (Id INT, StartDate DATE, EndDate DATE, Type INT)
INSERT INTO #data
SELECT 1,'2/18/2012' ,'3/18/2012', 1 UNION ALL
select 1,'3/17/2012','6/29/2012',1 UNION ALL
select 1,'6/27/2012','9/27/2012',1 UNION ALL
select 1,'8/23/2014','9/24/2014',3 UNION ALL
select 1,'9/23/2014','10/24/2014',3 UNION ALL
select 1,'10/23/2014','11/24/2014',3 UNION ALL
select 2,'7/4/2015','8/6/2015',1 UNION ALL
select 2,'8/4/2015','9/6/2015',1 UNION ALL
select 3,'11/1/2013','12/1/2013',0 UNION ALL
select 3,'1/9/2018','2/9/2018',0 UNION ALL
select 4,'1/1/2018','1/2/2018',0 UNION ALL --many non overlapping dates
select 4,'1/4/2018','1/5/2018',0 UNION ALL
select 4,'1/7/2018','1/9/2018',0 UNION ALL
select 4,'1/11/2018','1/13/2018',0 UNION ALL
select 4,'2/7/2018','2/8/2018',0 UNION ALL --many overlapping dates
select 4,'2/8/2018','2/9/2018',0 UNION ALL
select 4,'2/9/2018','2/10/2018',0 UNION all
select 4,'2/10/2018','2/11/2018',0 UNION all
select 4,'2/11/2018','2/12/2018',0 UNION all
select 4,'2/12/2018','2/13/2018',0 UNION all
select 4,'3/7/2018','3/8/2018',0 UNION ALL --many overlapping dates, second instance of id 4, type 0
select 4,'3/8/2018','3/9/2018',0 UNION ALL
select 4,'3/9/2018','3/10/2018',0 UNION all
select 4,'3/10/2018','3/11/2018',0 UNION all
select 4,'3/11/2018','3/12/2018',0 UNION all
select 4,'3/12/2018','3/13/2018',0
;
WITH cdata
AS (SELECT Id,
d.Type,
d.StartDate,
d.EndDate,
CurrentStart = d.StartDate
FROM #Data d
WHERE
NOT EXISTS (
SELECT * FROM #Data x WHERE x.StartDate < d.StartDate AND d.StartDate <= x.EndDate AND d.EndDate >= x.StartDate AND d.Id = x.Id AND d.Type = x.Type --get first records for overlapping ranges
)
UNION ALL
SELECT d.Id,
d.Type,
StartDate = CASE WHEN d2.StartDate < d.StartDate THEN d2.StartDate ELSE d.StartDate END,
EndDate = CASE WHEN d2.EndDate > d.EndDate THEN d2.EndDate ELSE d.EndDate END,
CurrentStart = d2.StartDate
FROM cdata d
INNER JOIN #Data d2
ON (
d.StartDate <= d2.EndDate
AND d.EndDate >= d2.StartDate
)
AND d2.Id = d.Id
AND d2.Type = d.Type
AND d2.StartDate > d.CurrentStart)
SELECT cdata.Id, cdata.Type, cdata.StartDate, EndDate = MAX(cdata.EndDate)
FROM cdata
GROUP BY cdata.Id, cdata.Type, cdata.StartDate
This looks like a Packing Intervals problem. See the post by Itzik Ben-Gan for all the details and what indexes he recommends to make it work efficiently. He presents a solution without recursive CTE.
Two notes.
The query below assumes that intervals are [closed; open), i.e. StartDate is inclusive and EndDate is exclusive. This way to represent such data is often the most convenient. (in the same sense as having arrays as zero-based instead of 1-based is usually more convenient in programming languages).
I added a RowID column to have unambiguous sorting.
Sample data
DECLARE #T TABLE
(
RowID int IDENTITY,
id int,
StartDate date,
EndDate date,
tp int
);
INSERT INTO #T(Id, StartDate, EndDate, tp) VALUES
(1, '2012-02-18', '2012-03-18', 1),
(1, '2012-03-17', '2012-06-29', 1),
(1, '2012-06-27', '2012-09-27', 1),
(1, '2014-08-23', '2014-09-24', 3),
(1, '2014-09-23', '2014-10-24', 3),
(1, '2014-10-23', '2014-11-24', 3),
(2, '2015-07-04', '2015-08-06', 1),
(2, '2015-08-04', '2015-09-06', 1),
(3, '2013-11-01', '2013-12-01', 0),
(3, '2018-01-09', '2018-02-09', 0);
-- Make EndDate an opened interval, make it exclusive
-- [Start; End)
UPDATE #T
SET EndDate = DATEADD(day, 1, EndDate)
;
Recommended indexes
-- indexes to support solutions
CREATE UNIQUE INDEX idx_start_id ON T(id, tp, StartDate, RowID);
CREATE UNIQUE INDEX idx_end_id ON T(id, tp, EndDate, RowID);
Query
Read the Itzik's post to understand what is going on. He has nice illustrations there. In short, each timestamp (start or end) is treated as an event. Each event has a + or - type. Each time we encounter a + event (some interval starts) we increase the running counter. Each time we encounter a - event (some interval ends) we decrease the running counter. When the running counter is 0 it means that the streak of overlapping intervals is over.
I took Itzik's query as is and simply changed the column names to match your names.
WITH C1 AS
-- let e = end ordinals, let s = start ordinals
(
SELECT
RowID, id, tp, StartDate AS ts, +1 AS EventType,
NULL AS e,
ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY StartDate, RowID) AS s
FROM #T
UNION ALL
SELECT
RowID, id, tp, EndDate AS ts, -1 AS EventType,
ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY EndDate, RowID) AS e,
NULL AS s
FROM #T
),
C2 AS
-- let se = start or end ordinal, namely, how many events (start or end) happened so far
(
SELECT C1.*,
ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY ts, EventType DESC, RowID) AS se
FROM C1
),
C3 AS
-- For start events, the expression s - (se - s) - 1 represents how many sessions were active
-- just before the current (hence - 1)
--
-- For end events, the expression (se - e) - e represents how many sessions are active
-- right after this one
--
-- The above two expressions are 0 exactly when a group of packed intervals
-- either starts or ends, respectively
--
-- After filtering only events when a group of packed intervals either starts or ends,
-- group each pair of adjacent start/end events
(
SELECT id, tp, ts,
((ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY ts) - 1) / 2 + 1)
AS grpnum
FROM C2
WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0
)
SELECT id, tp, MIN(ts) AS StartDate, DATEADD(day, -1, MAX(ts)) AS EndDate
FROM C3
GROUP BY id, tp, grpnum
ORDER BY id, tp, StartDate;
Result
+----+----+------------+------------+
| id | tp | StartDate | EndDate |
+----+----+------------+------------+
| 1 | 1 | 2012-02-18 | 2012-09-27 |
| 1 | 3 | 2014-08-23 | 2014-11-24 |
| 2 | 1 | 2015-07-04 | 2015-09-06 |
| 3 | 0 | 2013-11-01 | 2013-12-01 |
| 3 | 0 | 2018-01-09 | 2018-02-09 |
+----+----+------------+------------+
create table #table
(Id int,StartDate date, EndDate date, Type int)
insert into #table
values
('1','2012-02-18','2012-03-18','1'),('1','2012-03-19','2012-06-19','1'),
('1','2012-06-27','2012-09-27','1'),('1','2014-08-23','2014-09-24','3'),
('1','2014-09-23','2014-10-24','3'),('1','2014-10-23','2014-11-24','3'),
('2','2015-07-04','2015-08-06','1'),('2','2015-08-04','2015-09-06','1'),
('3','2013-11-01','2013-12-01','0'),('3','2018-01-09','2018-02-09','0')
select ID,MIN(startdate)sd,MAX(EndDate)ed,type from #table
group by ID,TYPE,YEAR(startdate),YEAR(EndDate)
this can be easily achieved by using some window-functions and CTE's. Here is the solution
DECLARE #table TABLE
(id INT,
StartDate DATE,
EndDate DATE,
[Type] INT
);
INSERT INTO #table(Id, StartDate, EndDate, [Type]) VALUES
(1, '2012-02-18', '2012-03-18', 1),
(1, '2012-03-17', '2012-06-29', 1),
(1, '2012-06-27', '2012-09-27', 1),
(1, '2014-08-23', '2014-09-24', 3),
(1, '2014-09-23', '2014-10-24', 3),
(1, '2014-10-23', '2014-11-24', 3),
(2, '2015-07-04', '2015-08-06', 1),
(2, '2015-08-04', '2015-09-06', 1),
(3, '2013-11-01', '2013-12-01', 0),
(3, '2018-01-09', '2018-02-09', 0);
WITH C1 AS
(
SELECT *,
MAX(EndDate) OVER(PARTITION BY Id, [Type]
ORDER BY StartDate, EndDate
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS PrevEnd
FROM #table
),
C2 AS
(
SELECT *,
SUM(StartFlag) OVER(PARTITION BY Id, [Type]
ORDER BY StartDate, EndDate
ROWS UNBOUNDED PRECEDING) AS GroupID
FROM C1
CROSS APPLY ( VALUES(CASE WHEN StartDate <= PrevEnd THEN NULL ELSE 1 END) ) AS A(StartFlag)
)
SELECT Id, [Type], MIN(StartDate) AS StartDate, MAX(EndDate) AS EndDate
FROM C2
GROUP BY Id, [Type], GroupID;
I have table like this. I want to get employee records to get their current Designation(whose effectiveto is null) and the date where they FIRST joined as Trainee(min(effectivefrom) where Designation= Trainee)
+----+-------------------+---------------+-------------+
| ID | Designation | EffectiveFrom | EffectiveTo |
+----+-------------------+---------------+-------------+
| 1 | Trainee | 01/01/2000 | 31/12/2000 |
| 1 | Assistant Manager | 01/01/2001 | 31/12/2004 |
| 1 | Suspended | 01/01/2005 | 01/02/2005 |
| 1 | Trainee | 02/03/2005 | 31/03/2005 |
| 1 | Manager | 01/04/2005 | NULL |
| 2 | Trainee | 01/01/2014 | 31/12/2014 |
| 2 | Developer | 01/01/2015 | 31/12/2016 |
| 2 | Architect | 01/01/2017 | NULL |
+----+-------------------+---------------+-------------+
How to get result like this
+----+---------------------+---------------------+
| ID | Current Designation | Date First Employed |
+----+---------------------+---------------------+
| 1 | Manager | 01/01/2000 |
| 2 | Architect | 01/01/2014 |
+----+---------------------+---------------------+
The date of first employment could be located using CROSS APPLY and SELECT TOP(1)
CREATE TABLE #table1(
ID int,
Designation varchar(17),
EffectiveFrom datetime,
EffectiveTo varchar(10));
INSERT INTO #table1
(ID, Designation, EffectiveFrom, EffectiveTo)
VALUES
(1, 'Trainee', '2000-01-01 01:00:00', '31/12/2000'),
(1, 'Assistant Manager', '2001-01-01 01:00:00', '31/12/2004'),
(1, 'Suspended', '2005-01-01 01:00:00', '01/02/2005'),
(1, 'Trainee', '2005-02-03 01:00:00', '31/03/2005'),
(1, 'Manager', '2005-01-04 01:00:00', NULL),
(2, 'Trainee', '2014-01-01 01:00:00', '31/12/2014'),
(2, 'Developer', '2015-01-01 01:00:00', '31/12/2016'),
(2, 'Architect', '2017-01-01 01:00:00', NULL);
select t.ID, t.Designation [Current Designation],
ef.EffectiveFrom [Date First Employed]
from #table1 t
cross apply (select top(1) cast(tt.EffectiveFrom as date) EffectiveFrom
from #table1 tt
where t.ID=tt.ID
and Designation='Trainee'
order by tt.EffectiveFrom) ef
where t.EffectiveTo is null;
ID Current Designation Date First Employed
1 Manager 2000-01-01
2 Architect 2014-01-01
One method is conditional aggregation. It is a bit unclear how you define "current", but assuming this is associated with EffectiveTo being NULL:
select id,
max(case when EffectiveTo is null then designation end) as current_designation,
min(effectivefrom) as start_ate
from t
group by id;
You can try below query:
select id,max(current_designation) current_designation,min(date_first_employee) date_first_employee from
(select id,
max(case when EffectiveTo is null then designation end) over (partition by id) as current_designation,
(case when Designation='Trainee' then EffectiveFrom end) Date_First_Employee
from desig) t
group by id
Output:
This is another possilbe solution
SQL Fiddle
MySQL 5.6 Schema Setup:
CREATE TABLE table1
(`ID` int, `Designation` varchar(17), `EffectiveFrom` datetime, `EffectiveTo` varchar(10))
;
INSERT INTO table1
(`ID`, `Designation`, `EffectiveFrom`, `EffectiveTo`)
VALUES
(1, 'Trainee', '2000-01-01 01:00:00', '31/12/2000'),
(1, 'Assistant Manager', '2001-01-01 01:00:00', '31/12/2004'),
(1, 'Suspended', '2005-01-01 01:00:00', '01/02/2005'),
(1, 'Trainee', '2005-02-03 01:00:00', '31/03/2005'),
(1, 'Manager', '2005-01-04 01:00:00', NULL),
(2, 'Trainee', '2014-01-01 01:00:00', '31/12/2014'),
(2, 'Developer', '2015-01-01 01:00:00', '31/12/2016'),
(2, 'Architect', '2017-01-01 01:00:00', NULL)
;
Query 1:
SELECT
ID,
(SELECT
`Designation`
FROM
table1
WHERE
`EffectiveFrom` = (SELECT
MAX(`EffectiveFrom`)
FROM
table1
WHERE
ID = t1.ID)) AS `Current Designation`
,DATE(MIN(`EffectiveFrom`)) AS `Date First Employed`
FROM
table1 t1
GROUP BY ID
Results:
| ID | Current Designation | Date First Employed |
|----|---------------------|---------------------|
| 1 | Trainee | 2000-01-01 |
| 2 | Architect | 2014-01-01 |
It's actually a rather simple self-join assuming that the EffectiveFrom and EffectiveTo columns are always filled in appropriately (i.e. there's always only one NULL value for EffectiveTo of a given ID). Since it's possible for someone to be a Trainee twice, you also need to use a window function like ROW_NUMBER() to filter out only the earliest Traininee EffectiveFrom date:
WITH CTE_Designations AS
(
SELECT T1.ID, T1.Designation AS CurrentDesignation, ISNULL(T2.EffectiveFrom, T1.EffectiveFrom) AS DateFirstEmployed -- If the join fails below then that means the earliest Designation is in T1 (e.g. that is the 'Trainee' record)
FROM DesignationsTable AS T1
LEFT JOIN DesignationsTable AS T2
ON T1.ID = T2.ID
AND T1.Designation <> T2.Designation
AND T2.Designation = 'Trainee'
WHERE T1.EffectiveTo IS NULL
),
CTE_Designations_FirstEmployedOnly AS
(
SELECT ID, CurrentDesignation, DateFirstEmployed, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY DateFirstEmployed) AS SortId -- Generates a unique ID per DateFirstEmployed row for each Designation.ID sorted by DateFirstEmployed
FROM CTE_Designations
)
SELECT ID, CurrentDesignation, DateFirstEmployed
FROM CTE_Designations_FirstEmployedOnly
WHERE SortId = 1
Results per your example data:
Results if you had an additional person who was still a Trainee:
This returns the result shown for the given data
Solves for "current Designation(whose effectiveto is null) and the date where they FIRST joined". Also handles terminated employees where EffectiveTo is not NULL as well as new employees with a single row. Replace "#t" with your table name.
SELECT a.Id, d.Designation, a.EffectiveFrom
FROM
(SELECT *, ROW_NUMBER() OVER ( PARTITION BY Id ORDER BY EffectiveFrom ASC ) r FROM #t) a
INNER JOIN
(SELECT *, ROW_NUMBER() OVER ( PARTITION BY Id ORDER BY EffectiveFrom DESC ) r FROM #t) d
ON a.Id = d.Id
WHERE a.r = 1 AND d.r = 1
Result:
Id Designation EffectiveFrom
1 Manager 2000-01-01
2 Architect 2014-01-01
I have data organised like this
CREATE TABLE sandbox.tab_1 (id serial, started timestamp, ended timestamp);
INSERT INTO sandbox.tab_1 (id, started, ended) VALUES
(1, '2020-01-03'::timestamp, NULL),
(2, '2020-01-05'::timestamp, '2020-01-06'),
(3, '2020-01-07'::timestamp, NULL),
(4, '2020-01-08'::timestamp, NULL);
I need to count the number of rows where started >= and ended < than a generated time series that goes from min(started) to max(started). This would give me for each day the stock of started and not ended ids at a given time. The result would be something like this:
Thank you for your help
You can LEFT JOIN the table to the series of timestamps on the start being less then or equal to the timestamp and the end being greater than the timestamp or being NULL. Then GROUP BY the timestamps and take the count().
SELECT gs.ts,
count(t1.started)
FROM generate_series('2020-01-03'::timestamp, '2020-01-08'::timestamp, '1 day'::interval) gs (ts)
LEFT JOIN tab_1 t1
ON t1.started <= gs.ts
AND (t1.ended IS NULL
OR t1.ended > gs.ts)
GROUP BY gs.ts
ORDER BY gs.ts;
db<>fiddle
Here is one option with generate_series() and union all:
select ts, sum(sum(cnt)) over(order by ts) stock
from (
select generate_series(min(started), max(started), interval '1' day) ts, 0 cnt from tab_1
union all select started, 1 from tab_1
union all select ended, -1 from tab_1 where ended is not null
) t
group by ts
order by ts
Demo on DB Fiddlde:
ts | stock
:------------------ | ----:
2020-01-03 00:00:00 | 1
2020-01-04 00:00:00 | 1
2020-01-05 00:00:00 | 2
2020-01-06 00:00:00 | 1
2020-01-07 00:00:00 | 2
2020-01-08 00:00:00 | 3
I have a table in the following format
Id StartDate EndDate Type
1 2012-02-18 2012-03-18 1
1 2012-03-17 2012-06-29 1
1 2012-06-27 2012-09-27 1
1 2014-08-23 2014-09-24 3
1 2014-09-23 2014-10-24 3
1 2014-10-23 2014-11-24 3
2 2015-07-04 2015-08-06 1
2 2015-08-04 2015-09-06 1
3 2013-11-01 2013-12-01 0
3 2018-01-09 2018-02-09 0
I found similar questions here, but not something that could help me solve my problem. I want to merge rows that has the same Id, Type and overlapping date periods.
The result from the above table should be
Id StartDate EndDate Type
1 2012-02-18 2012-09-27 1
1 2014-08-23 2014-11-24 3
2 2015-07-04 2015-09-06 1
3 2013-11-01 2013-12-01 0
3 2018-01-09 2018-02-09 0
In another server, I was able to do it with the following restrictions and the query below:
Didn't care about the Type column, but just the Id
Had a newer version of SQL Server (2012), but now I have 2008 which the code is not compatible
SELECT Id
, MIN(StartDate) AS StartDate
, MAX(EndDate) AS EndDate
FROM (
SELECT *
, SUM(CASE WHEN a.EndDate = a.StartDate THEN 0
ELSE 1
END
) OVER (ORDER BY Id, StartDate) sm
FROM (
SELECT Id
, StartDate
, EndDate
, LAG(EndDate, 1, NULL) OVER (PARTITION BY Id ORDER BY Id, EndDate) EndDate
FROM #temptable
) a
) b
GROUP BY Id, sm
Any advice how I can
Include Type on the process
Make it work on SQL Server 2008
This approach uses an additional temp table to identify the groups of overlapping dates, and then performs a quick aggregate based on the groupings.
SELECT *, ROW_NUMBER() OVER (ORDER BY Id, Type) AS UID,
ROW_NUMBER() OVER (ORDER BY Id, Type) AS GroupId INTO #G FROM #TempTable
WHILE ##ROWCOUNT <> 0 BEGIN
UPDATE T1 SET
GroupId = T2.GroupId
FROM #G T1
INNER JOIN (
SELECT T1.UID, CASE WHEN T1.GroupId < T2.GroupId THEN T1.GroupId ELSE T2.GroupId END
FROM #G T1
LEFT OUTER JOIN #G T2
ON T1.Id = T2.Id AND T1.Type = T2.Type AND T1.GroupId <> T2.GroupId
AND T1.StartDate <= T2.EndDate AND T2.StartDate <= T1.EndDate
) T2 (UID, GroupId)
ON T1.UID = T2.UID
WHERE T1.GroupId <> T2.GroupId
END
SELECT Id, MIN(StartDate) AS StartDate, MAX(EndDate) AS EndDate, Type
FROM #G G GROUP BY GroupId, Id, Type
This returns the expected values
Id StartDate EndDate Type
----------- ---------- ---------- -----------
1 2012-02-18 2012-09-27 1
1 2014-08-23 2014-11-24 3
2 2015-07-04 2015-09-06 1
3 2013-11-01 2013-12-01 0
3 2018-01-09 2018-02-09 0
This is 2008 compatible. A CTE really is the best way to link up all overlapping records in my opinion. The date overlap logic came from this thread: SO Date Overlap
I added extra data that's more complex to make sure that it's working as expected.
DECLARE #Data table (Id INT, StartDate DATE, EndDate DATE, Type INT)
INSERT INTO #data
SELECT 1,'2/18/2012' ,'3/18/2012', 1 UNION ALL
select 1,'3/17/2012','6/29/2012',1 UNION ALL
select 1,'6/27/2012','9/27/2012',1 UNION ALL
select 1,'8/23/2014','9/24/2014',3 UNION ALL
select 1,'9/23/2014','10/24/2014',3 UNION ALL
select 1,'10/23/2014','11/24/2014',3 UNION ALL
select 2,'7/4/2015','8/6/2015',1 UNION ALL
select 2,'8/4/2015','9/6/2015',1 UNION ALL
select 3,'11/1/2013','12/1/2013',0 UNION ALL
select 3,'1/9/2018','2/9/2018',0 UNION ALL
select 4,'1/1/2018','1/2/2018',0 UNION ALL --many non overlapping dates
select 4,'1/4/2018','1/5/2018',0 UNION ALL
select 4,'1/7/2018','1/9/2018',0 UNION ALL
select 4,'1/11/2018','1/13/2018',0 UNION ALL
select 4,'2/7/2018','2/8/2018',0 UNION ALL --many overlapping dates
select 4,'2/8/2018','2/9/2018',0 UNION ALL
select 4,'2/9/2018','2/10/2018',0 UNION all
select 4,'2/10/2018','2/11/2018',0 UNION all
select 4,'2/11/2018','2/12/2018',0 UNION all
select 4,'2/12/2018','2/13/2018',0 UNION all
select 4,'3/7/2018','3/8/2018',0 UNION ALL --many overlapping dates, second instance of id 4, type 0
select 4,'3/8/2018','3/9/2018',0 UNION ALL
select 4,'3/9/2018','3/10/2018',0 UNION all
select 4,'3/10/2018','3/11/2018',0 UNION all
select 4,'3/11/2018','3/12/2018',0 UNION all
select 4,'3/12/2018','3/13/2018',0
;
WITH cdata
AS (SELECT Id,
d.Type,
d.StartDate,
d.EndDate,
CurrentStart = d.StartDate
FROM #Data d
WHERE
NOT EXISTS (
SELECT * FROM #Data x WHERE x.StartDate < d.StartDate AND d.StartDate <= x.EndDate AND d.EndDate >= x.StartDate AND d.Id = x.Id AND d.Type = x.Type --get first records for overlapping ranges
)
UNION ALL
SELECT d.Id,
d.Type,
StartDate = CASE WHEN d2.StartDate < d.StartDate THEN d2.StartDate ELSE d.StartDate END,
EndDate = CASE WHEN d2.EndDate > d.EndDate THEN d2.EndDate ELSE d.EndDate END,
CurrentStart = d2.StartDate
FROM cdata d
INNER JOIN #Data d2
ON (
d.StartDate <= d2.EndDate
AND d.EndDate >= d2.StartDate
)
AND d2.Id = d.Id
AND d2.Type = d.Type
AND d2.StartDate > d.CurrentStart)
SELECT cdata.Id, cdata.Type, cdata.StartDate, EndDate = MAX(cdata.EndDate)
FROM cdata
GROUP BY cdata.Id, cdata.Type, cdata.StartDate
This looks like a Packing Intervals problem. See the post by Itzik Ben-Gan for all the details and what indexes he recommends to make it work efficiently. He presents a solution without recursive CTE.
Two notes.
The query below assumes that intervals are [closed; open), i.e. StartDate is inclusive and EndDate is exclusive. This way to represent such data is often the most convenient. (in the same sense as having arrays as zero-based instead of 1-based is usually more convenient in programming languages).
I added a RowID column to have unambiguous sorting.
Sample data
DECLARE #T TABLE
(
RowID int IDENTITY,
id int,
StartDate date,
EndDate date,
tp int
);
INSERT INTO #T(Id, StartDate, EndDate, tp) VALUES
(1, '2012-02-18', '2012-03-18', 1),
(1, '2012-03-17', '2012-06-29', 1),
(1, '2012-06-27', '2012-09-27', 1),
(1, '2014-08-23', '2014-09-24', 3),
(1, '2014-09-23', '2014-10-24', 3),
(1, '2014-10-23', '2014-11-24', 3),
(2, '2015-07-04', '2015-08-06', 1),
(2, '2015-08-04', '2015-09-06', 1),
(3, '2013-11-01', '2013-12-01', 0),
(3, '2018-01-09', '2018-02-09', 0);
-- Make EndDate an opened interval, make it exclusive
-- [Start; End)
UPDATE #T
SET EndDate = DATEADD(day, 1, EndDate)
;
Recommended indexes
-- indexes to support solutions
CREATE UNIQUE INDEX idx_start_id ON T(id, tp, StartDate, RowID);
CREATE UNIQUE INDEX idx_end_id ON T(id, tp, EndDate, RowID);
Query
Read the Itzik's post to understand what is going on. He has nice illustrations there. In short, each timestamp (start or end) is treated as an event. Each event has a + or - type. Each time we encounter a + event (some interval starts) we increase the running counter. Each time we encounter a - event (some interval ends) we decrease the running counter. When the running counter is 0 it means that the streak of overlapping intervals is over.
I took Itzik's query as is and simply changed the column names to match your names.
WITH C1 AS
-- let e = end ordinals, let s = start ordinals
(
SELECT
RowID, id, tp, StartDate AS ts, +1 AS EventType,
NULL AS e,
ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY StartDate, RowID) AS s
FROM #T
UNION ALL
SELECT
RowID, id, tp, EndDate AS ts, -1 AS EventType,
ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY EndDate, RowID) AS e,
NULL AS s
FROM #T
),
C2 AS
-- let se = start or end ordinal, namely, how many events (start or end) happened so far
(
SELECT C1.*,
ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY ts, EventType DESC, RowID) AS se
FROM C1
),
C3 AS
-- For start events, the expression s - (se - s) - 1 represents how many sessions were active
-- just before the current (hence - 1)
--
-- For end events, the expression (se - e) - e represents how many sessions are active
-- right after this one
--
-- The above two expressions are 0 exactly when a group of packed intervals
-- either starts or ends, respectively
--
-- After filtering only events when a group of packed intervals either starts or ends,
-- group each pair of adjacent start/end events
(
SELECT id, tp, ts,
((ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY ts) - 1) / 2 + 1)
AS grpnum
FROM C2
WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0
)
SELECT id, tp, MIN(ts) AS StartDate, DATEADD(day, -1, MAX(ts)) AS EndDate
FROM C3
GROUP BY id, tp, grpnum
ORDER BY id, tp, StartDate;
Result
+----+----+------------+------------+
| id | tp | StartDate | EndDate |
+----+----+------------+------------+
| 1 | 1 | 2012-02-18 | 2012-09-27 |
| 1 | 3 | 2014-08-23 | 2014-11-24 |
| 2 | 1 | 2015-07-04 | 2015-09-06 |
| 3 | 0 | 2013-11-01 | 2013-12-01 |
| 3 | 0 | 2018-01-09 | 2018-02-09 |
+----+----+------------+------------+
create table #table
(Id int,StartDate date, EndDate date, Type int)
insert into #table
values
('1','2012-02-18','2012-03-18','1'),('1','2012-03-19','2012-06-19','1'),
('1','2012-06-27','2012-09-27','1'),('1','2014-08-23','2014-09-24','3'),
('1','2014-09-23','2014-10-24','3'),('1','2014-10-23','2014-11-24','3'),
('2','2015-07-04','2015-08-06','1'),('2','2015-08-04','2015-09-06','1'),
('3','2013-11-01','2013-12-01','0'),('3','2018-01-09','2018-02-09','0')
select ID,MIN(startdate)sd,MAX(EndDate)ed,type from #table
group by ID,TYPE,YEAR(startdate),YEAR(EndDate)
this can be easily achieved by using some window-functions and CTE's. Here is the solution
DECLARE #table TABLE
(id INT,
StartDate DATE,
EndDate DATE,
[Type] INT
);
INSERT INTO #table(Id, StartDate, EndDate, [Type]) VALUES
(1, '2012-02-18', '2012-03-18', 1),
(1, '2012-03-17', '2012-06-29', 1),
(1, '2012-06-27', '2012-09-27', 1),
(1, '2014-08-23', '2014-09-24', 3),
(1, '2014-09-23', '2014-10-24', 3),
(1, '2014-10-23', '2014-11-24', 3),
(2, '2015-07-04', '2015-08-06', 1),
(2, '2015-08-04', '2015-09-06', 1),
(3, '2013-11-01', '2013-12-01', 0),
(3, '2018-01-09', '2018-02-09', 0);
WITH C1 AS
(
SELECT *,
MAX(EndDate) OVER(PARTITION BY Id, [Type]
ORDER BY StartDate, EndDate
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS PrevEnd
FROM #table
),
C2 AS
(
SELECT *,
SUM(StartFlag) OVER(PARTITION BY Id, [Type]
ORDER BY StartDate, EndDate
ROWS UNBOUNDED PRECEDING) AS GroupID
FROM C1
CROSS APPLY ( VALUES(CASE WHEN StartDate <= PrevEnd THEN NULL ELSE 1 END) ) AS A(StartFlag)
)
SELECT Id, [Type], MIN(StartDate) AS StartDate, MAX(EndDate) AS EndDate
FROM C2
GROUP BY Id, [Type], GroupID;
I am new to writing queries in Postgres and am interested in understanding how one can count the number of unique first time users per day.
If the table only has two columns- user_id and start_time which is a timestamp that indicates the time of use. If a user has used on previous day, the user_id should not be counted.
Why does the following query not work? Shouldn't it be possible to select distinct on two variables at once?
SELECT COUNT (DISTINCT min(start_time::date), user_id),
start_time::date as date
FROM mytable
GROUP BY date
produces
ERROR: function count(date, integer) does not exist
The output would look like this
date count
1 2017-11-22 56
2 2017-11-23 73
3 2017-11-24 13
4 2017-11-25 91
5 2017-11-26 107
6 2017-11-27 33...
Any suggestions about how to count distinct min Date and user_id and then group by date in psql would be appreciated.
Try this
select start_time,count(*) as count from
(
select user_id,min(start_time::date) as start_time
from mytable
group by user_id
)distinctRecords
group by start_time;
This will count each user only once for min date.
You may try this logic:
First find the first login time of each user_id - MIN
(start_time) .
Joining the above results with the main table, increment the count of
user only if the user has not logged in yet. COUNT does not add 1 to the record when it's argument is NULL.
SQL Fiddle
PostgreSQL 9.6 Schema Setup:
CREATE TABLE yourtable
(user_id int, start_time varchar(19))
;
INSERT INTO yourtable
(user_id, start_time)
VALUES
(1, '2018-03-19 08:05:01'),
(2, '2018-03-19 08:05:01'),
(1, '2018-03-19 08:05:04'),
(3, '2018-03-19 08:05:01'),
(1, '2018-03-20 08:05:04'),
(2, '2018-03-20 08:05:04'),
(4, '2018-03-20 08:05:04'),
(3, '2018-03-20 08:05:06'),
(3, '2018-03-20 08:05:04'),
(3, '2018-03-20 08:05:05'),
(1, '2018-03-21 08:05:06'),
(3, '2018-03-21 08:05:05'),
(6, '2018-03-21 08:05:06'),
(3, '2018-03-22 08:05:05'),
(4, '2018-03-22 08:05:05'),
(5, '2018-03-23 08:05:05')
;
Query 1:
WITH f
AS ( SELECT user_id, MIN (start_time) first_start_time
FROM yourtable
GROUP BY user_id)
SELECT t.start_time::DATE
,count( CASE WHEN t.start_time > f.first_start_time
THEN NULL ELSE 1 END )
FROM yourtable t JOIN f ON t.user_id = f.user_id
GROUP BY start_time::DATE
ORDER BY 1
Results:
| start_time | count |
|------------|-------|
| 2018-03-19 | 3 |
| 2018-03-20 | 1 |
| 2018-03-21 | 1 |
| 2018-03-22 | 0 |
| 2018-03-23 | 1 |
you can use following query:
select count(user_id ) total_user , start_time
from (
SELECT min (date (start_time)) start_time, user_id
FROM mytable )tmp
group by start_time