Alternative to Datediff over()? - sql

I have data of this form:
user_id event started ended date
1 started 1 0 3/1/2018
1 ended 0 1 3/2/2018
2 started 1 0 3/5/2018
2 ended 0 1 3/22/2018
3 started 1 0 3/25/2018
There are other events and columns for 0/1 but they are irrelevant.
I am trying to get how long it takes each user to get from started to ended.
I tried datediff(day, case when started=1 then date end, case when ended=1 then date end) but since they are on different rows it doesnt work. Something along the lines of datediff over() could work, but that is obviously not a valid function.
Thanks in advance!

Assuming that you can't end before you started, you simply need MIN & MAX as Windowed Aggregates:
select user_id,
datediff(day,
min(date) over (partition by user_id),
max(date) over (partition by user_id))
from myTable
where event in ('started', 'ended')
Using this you can add any additional columns, too.
If one result row is also ok, you can do simple aggregation:
select user_id,
min(date) as started,
max(date) as ended,
datediff(day,
min(date),
max(date)) as duration
from myTable
where event in ('started', 'ended')
group by user_id

You could inner join the table on itself using the user_id column:
SELECT a.[user_id]
, a.[date] AS StartDate
, b.EndDate
, DATEDIFF(DAY, a.[date], b.EndDate) AS DateDifference
FROM dbo.TableNameHere AS a
INNER JOIN
(
SELECT [user_id]
, [date] AS EndDate
FROM dbo.TableNameHere
WHERE [ended] = 1
) AS b
ON a.[user_id] = b.[user_id]
WHERE a.[started] = 1
In my example above, you don't really need any of the columns in the first SELECT besides the DateDifference, I just had them for visibility in my testing.

Related

GROUP BY with a condition on WHERE clause

I have the following query:
SELECT
Group as [Grupo],
COUNT(*) as [Total]
FROM
Table
WHERE
Status NOT IN ('Closed', 'Cancelled', 'Resolved') AND
DATEDIFF(day,Submit_Date,GETDATE()) > 30
GROUP BY
Group,
DATEDIFF(day,Submit_Date,GETDATE())
The objective is to get tickets with aging above 30 days. The output is:
Group Total
Group A 4
Group A 1
Group A 2
Group A 2
Group B 1
Group B 1
What I'm hoping to see:
Group Total
Group A 9
Group B 2
I might be missing something dumb here... Can someone help me with this one? Thanks
seems like you just need to group by "Group" only:
SELECT
Group as [Grupo],
COUNT(*) as [Total]
FROM
Table
WHERE
Status NOT IN ('Closed', 'Cancelled', 'Resolved') AND
DATEDIFF(day,Submit_Date,GETDATE()) > 30
GROUP BY
Group
You need to fix the GROUP BY. These keys define each row and apparently you want one row per group.
I would also suggest fixing the date logic:
SELECT [Group] as [Grupo], COUNT(*) as [Total]
FROM Table
WHERE Status NOT IN ('Closed', 'Cancelled', 'Resolved') AND
Submit_Date < DATEADD(DAY, -30 CONVERT(DATE, GETDATE()))
GROUP BY [Group];
Avoiding the function call on Submit_Date should help the optimizer produce the best execution plan.

Date filtering in SQL

Table below consists of 2 columns: a unique identifier and date. I am trying to build a new column of episodes, where a new episode would be triggered when >= 3 months between dates. This process should occur for each unique EMID. In the table attached, EMID ending in 98 would only have 1 episode, there are no intervals >2 months between each row in the date column. However, EMID ending in 03 would have 2 episodes, as there is almost a 3 year gap between rows 12 and 13. I have tried the following code, which doesn't work.
Table:
SELECT TOP (1000) [EMID],[Date]
CASE
WHEN DATEDIFF(month, Date, LEAD Date) <3
THEN "1"
ELSE IF DATEDIFF(month, Date, LEAD Date) BETWEEN 3 AND 5
THEN "2"
ELSE "3"
END episode
FROM [res_treatment_escalation].[dbo].[cspine42920a]
EDIT: Using Microsoft SQL Server Management Studio.
EDIT 2: I have made some progress but the output is not exactly what I am looking for. Here is the query I used:
SELECT TOP (1000) [EMID],[visit_date_01],
CASE
WHEN DATEDIFF(DAY, visit_date_01, LAG(visit_date_01,1,getdate()) OVER (partition by EMID order by EMID)) <= 90 THEN '1'
WHEN DATEDIFF(DAY, visit_date_01, LAG(visit_date_01,1,getdate()) OVER (PARTITION BY EMID ORDER BY EMID)) BETWEEN 90 AND 179 THEN '2'
WHEN DATEDIFF(DAY, visit_date_01, LAG(visit_date_01,1,getdate()) OVER (PARTITION BY EMID order by EMID)) > 180 THEN '3'
END AS EPISODE
FROM [res_treatment_escalation].[dbo].['c-spine_full_dataset_4#29#20_wi$']
table2Here is the actual vs expected output
The partition by EMID does not seem to be working correctly. Every time there is a new EMID a new episode is triggered. I am using day instead of month as the filter in DATEDIFF- this does not seem to recognize new episodes within the same EMID
Hmmm: Use LAG() to get the previous date. Use a date comparison to assign a flag and then a cumulative sum:
select c.*,
sum(case when prev_date > dateadd(month, -3, date) then 0 else 1 end) over
(partition by emid order by date) as episode_number
from (select c.*, lag(date) over (partition by emid order by date) as prev_date
from res_treatment_escalation.dbo.cspine42920a c
) c;

How to calculate number of days working on tasks if we have many tasks and the date range of each tasks could have overlap

I run into a question during working and I would really appreciate if anyone could give me some ideas.
We have a table which keeps tracking of tasks employee has finished. Table structure as below :
EmployeeNum | TaskID |Start Date of task | End Date of task
I want to calculate how many days each employee has invested in each task using this table. At first my code looks like this:
Select
EmployeeNum,TaskID,DateDiff(day,StartDate,EndDate)+1 as PureDay
from
TaskTable
Group by
EmployeeNum,TaskID
But then I found a problem that there are overlaps in the date range for each task.
For example, we have TaskA, TaskB, TaskC for one employee.
TaskA is from 2018-10-01 to 2018-10-05
TaskB from 2018-10-02 to 2018-10-07
TaskC from 2018-10-09 to 2018-10-10
In this way, the actual working days of this employee should be from 2018-10-01 to 2018-10-07, and then 2018-10-09 to 2018-10-10 which is 9 days. If I calculate date range of each task then add them together then actual working days become 5+6+2=13 days instead of 9.
I'm wandering if there could be any good ways to solve this overlapping problem ? Thank you very much for any ideas!
Following query will count how many working days each employee spent on each task ;
SELECT
EmployeeNum,
TaskID,
(DATEDIFF(dd, StartDate, EndDate) + 1)
-(DATEDIFF(wk, StartDate, EndDate) * 2)
-(CASE WHEN DATENAME(dw, StartDate) = 'Sunday' THEN 1 ELSE 0 END)
-(CASE WHEN DATENAME(dw, EndDate) = 'Saturday' THEN 1 ELSE 0 END) as PureDay
FROM
TaskTable
GROUP BY
EmployeeNum,
TaskID
See this link for on explanation on how this computation works.
Once you know the date when a task starts, you can use a cumulative sum to assign a group to each record and then simply aggregate by that group (and other information).
The following query should do what you want:
with starts as (
select sm.*,
(case when exists (select 1
from tb_TaskMaster sm2
where sm2.EmpID = sm.EmpID and
sm2.StartDate < sm.StartDate and
sm2.EndDate >= sm.StartDate
)
then 0 else 1
end) as isstart
from tb_TaskMaster sm
)
select EmpID, count(TaskId) as cnt_TaskID, min(StartDate) as StartDate, max(EndDate) as EndDate,
datediff(Day, min(StartDate), max(EndDate)) + 1 as PureDay
from (select s.*, sum(isstart) over (partition by EmpID order by StartDate) as grp
from starts s
) s
group by EmpID, grp
order by EmpID
In this db<>fiddle, you could find the DDL & DML for my example data and the working of the code.
You can try this.
Im not sure it will work all the way but you can give it a try :)
declare #table table (empid int,taskid nvarchar(50),startdate date, enddate date)
insert into #table
values
(1,'TaskA','2018-10-01','2018-10-05'),
(1,'TaskB','2018-10-02','2018-10-07'),
(1,'TaskC','2018-10-09','2018-10-10')
select *,case when comparedate > startdate then datediff(dd,comparedate,enddate) else datediff(dd,startdate,enddate)+1 end as countofworkingdays from (
Select empid,taskid,startdate,enddate,lag(enddate,1,'1900-01-01') over(partition by empid order by startdate) as CompareDate from #table
)x
Result
This eliminates overlapping ranges by adjusting the start date based on all previous end dates:
with maxEndDates as
( -- find the maximum previous end date
Select empid,taskid,startdate,enddate,
max(EndDate)
over (partition by EmpID
order by StartDate, EndDate desc
rows between unbounded preceding and 1 preceding) as maxEndDate
from TaskTable
),
daysPerTask as
( -- calculate the difference based on the adjusted start date to eliminate overlaping days
select *,
case when maxEndDate >= enddate then 0 -- range already fully covered
when maxEndDate > startdate then datediff(dd, maxEndDate, enddate) -- range partially overlapping
else datediff(dd, startdate, enddate)+1 -- new range
end as dayCount
from maxEndDates
)
-- get the final count
select EmpID, sum(dayCount)
from daysPerTask
group by EmpID;
See db<>fiddle
Thank you all very much for your responding and help. I found a solution during searching in Stackoverflow, the following is it's link:
T-SQL date range in a table split and add the individual date to the table
The Tally table suggested by Felix in the above question is a great way to solve my problem since I have millions of records and the real situation is really complicated.
Thank you all again for your help!

DB2 SQL Pairing Dates

I am trying to pair up dates that I am getting from my SQL. The output at the moment looks something like this:
start_date end_date
2015-02-02 2015-02-02
2015-02-02 2015-02-03
2015-02-03 2015-02-03
2015-04-12 2015-02-12
I would like the ouput to be paired up so that the smallest and the biggest date of a date group is chosen, so that the output would look like this:
start_date end_date
2015-02-02 2015-02-03
2015-04-12 2015-02-12
Using the first response I get something like this, I believe I have formatted this wrong, I am getting the same date pairs as before, but it does run.
select min(date), max(date)
from (select date,
sum(case when sum(inc) = 0 then 1 else 0 end) over (order by date desc) as grp
from (select t1.datev as date, 1 as inc
from table2 t1,
table3 c,
table4 cr
where t1.datev between date(c.e_start_date) and date(c.e_end_date)
and t1.datev not in (select date(temp.datev) from mdmins11.temp temp where temp.number < 4000 and temp.organisation_id = 11111)
and c.tp_cd in (1,6)
and cr.from_id = c.id
and cr.organisation_id = 11111
union all
select t.datev as date, -1 as inc
from table1 t,
table3 c,
table4 cr
where t.datev between date(c.e_start_date) and date(c.e_end_date)
and t.datev not in (select date(temp.datev) from mdmins11.temp temp where temp.number < 4000 and temp.organisation_id = 11111)
and c.tp_cd in (1,6)
and cr.from_id = c.id
and cr.organisation_id = 11111
) t
group by date
) t
group by grp;
One method is to determine where groups of non-overlapping dates start. For this, you can use not exists. Then count up this flag over all records. This uses window functions. However, this poses problems because you have multiple starts on the same date.
Another method is to keep track of starts and stops and note where the sum is zero. These represent boundaries between groups. The following should work on your data:
select min(date), max(date)
from (select date,
sum(case when sum(inc) = 0 then 1 else 0 end) over (order by date desc) as grp
from (select start_date as date, 1 as inc
from table
union all
select end_date as date, -1 as inc
from table
) t
group by date
) t
group by grp;
This type of problem is made more complicated when duplicate values are allowed on a given date. Given only the dates, this is challenging. With a separate unique id for each row, then there are more robust solutions.
EDIT:
A more robust solution:
select min(start_date), max(end_date)
from (select t.*, sum(StartGroupFlag) over (order by start_date) as grp
from (select t.*,
(case when not exists (select 1
from table t2
where t2.start_date < t.start_date and
t2.end_date >= t.start_date
)
then 1 else 0
end) as StartGroupFlag
from table t
) t
) t
group by grp;

query to display additional column based on aggregate value

I've been mulling on this problem for a couple of hours now with no luck, so I though people on SO might be able to help :)
I have a table with data regarding processing volumes at stores. The first three columns shown below can be queried from that table. What I'm trying to do is to add a 4th column that's basically a flag regarding if a store has processed >=$150, and if so, will display the corresponding date. The way this works is the first instance where the store has surpassed $150 is the date that gets displayed. Subsequent processing volumes don't count after the the first instance the activated date is hit. For example, for store 4, there's just one instance of the activated date.
store_id sales_volume date activated_date
----------------------------------------------------
2 5 03/14/2012
2 125 05/21/2012
2 30 11/01/2012 11/01/2012
3 100 02/06/2012
3 140 12/22/2012 12/22/2012
4 300 10/15/2012 10/15/2012
4 450 11/25/2012
5 100 12/03/2012
Any insights as to how to build out this fourth column? Thanks in advance!
The solution start by calculating the cumulative sales. Then, you want the activation date only when the cumulative sales first pass through the $150 level. This happens when adding the current sales amount pushes the cumulative amount over the threshold. The following case expression handles this.
select t.store_id, t.sales_volume, t.date,
(case when 150 > cumesales - t.sales_volume and 150 <= cumesales
then date
end) as ActivationDate
from (select t.*,
sum(sales_volume) over (partition by store_id order by date) as cumesales
from t
) t
If you have an older version of Postgres that does not support cumulative sum, you can get the cumulative sales with a subquery like:
(select sum(sales_volume) from t t2 where t2.store_id = t.store_id and t2.date <= t.date) as cumesales
Variant 1
You can LEFT JOIN to a table that calculates the first date surpassing the 150 $ limit per store:
SELECT t.*, b.activated_date
FROM tbl t
LEFT JOIN (
SELECT store_id, min(thedate) AS activated_date
FROM (
SELECT store_id, thedate
,sum(sales_volume) OVER (PARTITION BY store_id
ORDER BY thedate) AS running_sum
FROM tbl
) a
WHERE running_sum >= 150
GROUP BY 1
) b ON t.store_id = b.store_id AND t.thedate = b.activated_date
ORDER BY t.store_id, t.thedate;
The calculation of the the first day has to be done in two steps, since the window function accumulating the running sum has to be applied in a separate SELECT.
Variant 2
Another window function instead of the LEFT JOIN. May of may not be faster. Test with EXPLAIN ANALYZE.
SELECT *
,CASE WHEN running_sum >= 150 AND thedate = first_value(thedate)
OVER (PARTITION BY store_id, running_sum >= 150 ORDER BY thedate)
THEN thedate END AS activated_date
FROM (
SELECT *
,sum(sales_volume)
OVER (PARTITION BY store_id ORDER BY thedate) AS running_sum
FROM tbl
) b
ORDER BY store_id, thedate;
->sqlfiddle demonstrating both.