How do I calculate day difference using more than one date? - sql

I have the following table:
Table A:
ID
Transaction_Date
Cancel_Flag
1
2014-02-18 00:00:00.000
No
1
2014-02-18 00:00:00.000
No
1
2014-02-19 00:00:00.000
Yes
1
2014-05-20 00:00:00.000
No
1
2014-05-21 00:00:00.000
No
1
2014-05-22 00:00:00.000
Yes
1
2014-05-23 00:00:00.000
No
I want an output that looks like this:
Calculate the day difference between the transaction_date(where cancel_flag = No) and transaction_date(where cancel_flag = Yes).
If there's more than 1 cancellation_flag = Yes. The day difference used should be the minimum.
ID
Transaction_Date
Cancel_Flag
Days_Since_Cancel
1
2014-02-18 00:00:00.000
No
-1
1
2014-02-18 00:00:00.000
No
-1
1
2014-02-19 00:00:00.000
Yes
0
1
2014-05-20 00:00:00.000
No
1
1
2014-05-21 00:00:00.000
No
-1
1
2014-05-22 00:00:00.000
Yes
0
1
2014-05-22 00:00:00.000
No
+1
1
2014-05-23 00:00:00.000
No
+2
Thanks in advance,

For each record, the only 'cancel' rows you are interested in are the one just before or the one just after the current row when the data set is sorted by transaction_date. Because of this, solutions involving window functions seem quite appropriate here.
For any given row, you can get the date of the prior cancel transaction by
max(Case When Cancel_Flag='Yes' Then transaction_date End)
Over (Partition By ID Order By Transaction_Date Rows Between Unbounded Preceding And Current Row)
, and the date of the following cancel transaction with
min(Case When Cancel_Flag='Yes' Then transaction_date End)
Over (Partition By ID Order By Transaction_Date Rows Between Current Row And Unbounded Following)
Just use each in a datediff() with the current rows transaction date, and you've got two possible results that you can select from to get the final result.
Select ID,Transaction_Date,Cancel_Flag,
Case When prior_cancel is null or next_cancel<abs(prior_cancel)
Then next_cancel Else prior_cancel End as Days_Since_Cancel
From (
Select A.*,
datediff(day,Transaction_Date,
max(Case When Cancel_Flag='Yes' Then transaction_date End)
Over (Partition By ID Order By Transaction_Date Rows Between Unbounded Preceding And Current Row)
) as prior_cancel,
datediff(day,Transaction_Date,
min(Case When Cancel_Flag='Yes' Then transaction_date End)
Over (Partition By ID Order By Transaction_Date Rows Between Current Row And Unbounded Following)
) as next_cancel
From Table_A A
)
Order By ID,Transaction_Date
EDIT ADDITION
Note that, in place of min(...) you can use first_value(... Ignore Nulls) and in place of max(...) you can use last_value(... Ignore Nulls). These might be a tiny bit more efficient because while you cannot determine min & max without examining the entire window frame, in theory first and last can be determined without examining every element. These are always functionally equivalent when the Order By column and the min/max(column) are the same, in this case Transaction_Date.

Related

Counting Sick days over the weekend

I'm trying to solve a problem in the following (simplified) dataset:
Name
Date
Workday
Calenderday
Leave
PersonA
2023-01-01
0
1
NULL
PersonA
2023-01-07
0
1
NULL
PersonA
2023-01-08
0
1
NULL
PersonA
2023-01-13
1
1
Sick
PersonA
2023-01-14
0
1
NULL
PersonA
2023-01-15
0
1
NULL
PersonA
2023-01-16
1
1
Sick
PersonA
2023-01-20
1
1
Holiday
PersonA
2023-01-21
0
1
NULL
PersonA
2023-01-22
0
1
NULL
PersonA
2023-01-23
1
1
Holiday
PersonB
2023-01-01
0
1
NULL
PersonB
2023-01-02
1
1
Sick
PersonB
2023-01-03
1
1
Sick
Where the lines with NULL in [Leave] is weekend.
What I want is a result looking like this:
Name
Leave
PeriodStartDate
PeriodEndDate
Workdays
Weekdays
PersonA
Sick
2023-01-13
2023-01-16
2
4
PersonA
Holiday
2023-01-20
2023-01-23
2
4
PersonB
Sick
2023-01-02
2023-01-03
2
2
where the difference between [Workdays] and [Weekdays] is that weekdays also counts the weekend.
What I have been trying is to first make a row (in two different ways)
ROW_NUMBER() OVER (PARTITION BY \[Name\] ORDER BY \[Date\]) as RowNo1
ROW_NUMBER() OVER (PARTITION BY \[Name\], \[Leave\] ORDER BY \[Date\]) as RowNo2
and after that to make a period base date:
DATEADD(DAY, 0 - \[RowNo1\], Date) as PeriodBaseDate1
,DATEADD(DAY, 0 - \[RowNo2\], \[Date\]) as PeriodBaseDate2
and after that do something like this:
MIN(\[Date\]) as PeriodStartDate
,MAX(\[Dato\]) as PeriodEndDate
,SUM(\[Calenderday\]) as Weekdays
,SUM(\[Workday\]) as Workdays
GROUP BY \[PeriodBaseDate (1 or 2?)\], \[Leave\], \[Name\]
But whatever I do I can't seem to get it to count the weekends in the periods.
It doesn't have to include my try with the RowNo, PeriodBaseDate etc.
As we don't have your actual full solutions, I've provided a full working one. I firstly use LAST_VALUE to have all the rows have a value for their Leave (provided there was a non-NULL value previously).
Once I do that, you have a gaps and island problem, and can aggregate based on that.
I assume you are using SQL Server 2022, the latest version of SQL Server at the time of writing, as no version details are given and thus have access to the IGNORE NULLS syntax.
SELECT *
INTO dbo.YourTable
FROM (VALUES('PersonA',CONVERT(date,'2023-01-01'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-07'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-08'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-13'),1,1,'Sick'),
('PersonA',CONVERT(date,'2023-01-14'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-15'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-16'),1,1,'Sick'),
('PersonA',CONVERT(date,'2023-01-20'),1,1,'Holiday'),
('PersonA',CONVERT(date,'2023-01-21'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-22'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-23'),1,1,'Holiday'),
('PersonB',CONVERT(date,'2023-01-01'),0,1,NULL),
('PersonB',CONVERT(date,'2023-01-02'),1,1,'Sick'),
('PersonB',CONVERT(date,'2023-01-03'),1,1,'Sick'))V(Name,Date,Workday,Calenderday,Leave);
GO
WITH Leaves AS(
SELECT Name,
[Date],
Workday,
Calenderday, --It's spelt Calendar, you should correct this typopgraphical error as objects with typoes lead to further problems.
--Leave,
LAST_VALUE(Leave) IGNORE NULLS OVER (PARTITION BY Name ORDER BY Date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Leave
FROM dbo.YourTable YT),
LeaveGroups AS(
SELECT Name,
[Date],
Workday,
CalenderDay,
Leave,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Date) -
ROW_NUMBER() OVER (PARTITION BY Name, Leave ORDER BY Date) AS Grp
FROM Leaves)
SELECT Name,
Leave,
MIN([Date]) AS PeriodStartDate,
MAX([Date]) AS PeriodEndDate,
SUM(WorkDay) AS WorkDays, --Assumes Workday is not a bit, if it is, CAST or CONVERT it to a int
DATEDIFF(DAY,MIN([Date]), MAX([Date]))+1 AS Weekdays
--SUM(CASE WHEN (DATEPART(WEEKDAY,[Date]) + ##DATEFIRST + 5) % 7 BETWEEN 0 AND 4 THEN 1 END) AS Weekdays --This method is language agnostic
FROM LeaveGroups
WHERE Leave IS NOT NULL
GROUP BY Name,
Leave,
Grp
ORDER BY Name,
PeriodStartDate;
GO
DROP TABLE dbo.YourTable;
I am not sure what you are trying to do. Based on what I understood, below script gives the expected output.
SELECT Name, Leave, Min(Date) PeriodStartDate,Max(Date) PeriodEndDate, SUM(Workday) Workdays, DATEDIFF(DAY,Min(Date),Max(Date))+ 1 Weekdays from YourTable
WHERE Leave IS NOT NULL
GROUP BY Name, Leave

redshift cumulative count records via SQL

I've been struggling to find an answer for this question. I think this question is similar to what i'm looking for but when i tried this it didn't work.
Because there's no new unique user_id added between 02-20 and 02-27, the cumulative count will be the same. Then for 02-27, there is a unique user_id which hasn't appeared on any previous dates (6)
Here's my input
date user_id
2020-02-20 1
2020-02-20 2
2020-02-20 3
2020-02-20 4
2020-02-20 4
2020-02-20 5
2020-02-21 1
2020-02-22 2
2020-02-23 3
2020-02-24 4
2020-02-25 4
2020-02-27 6
Output table:
date daily_cumulative_count
2020-02-20 5
2020-02-21 5
2020-02-22 5
2020-02-23 5
2020-02-24 5
2020-02-25 5
2020-02-27 6
This is what i tried and the result is not quite what i want
select
stat_date,count(DISTINCT user_id),
sum(count(DISTINCT user_id)) over (order by stat_date rows unbounded preceding) as cumulative_signups
from data_engineer_interview
group by stat_date
order by stat_date
it returns this instead;
date,count,cumulative_sum
2022-02-20,5,5
2022-02-21,1,6
2022-02-22,1,7
2022-02-23,1,8
2022-02-24,1,9
2022-02-25,1,10
2022-02-27,1,11
The problem with this task is that it could be done by comparing each row uniquely with all previous rows to see if there is a match in user_id. Since you are using Redshift I'll assume that your data table could be very large so attacking the problem this way will bog down in some form of a loop join.
You want to think about the problem differently to avoid this looping issue. If you derive a dataset with id and first_date_of_id you can then just do a cumulative sum sorted by date. Like this
select user_id, min("date") as first_date,
count(user_id) over (order by first_date rows unbounded preceding) as date_out
from data_engineer_interview
group by user_id
order by date_out;
This is untested and won't produce the full list of dates that you have in your example output but rather only the dates where new ids show up. If this is an issue it is simple to add in the additional dates with no count change.
We can do this via a correlated subquery followed by aggregation:
WITH cte AS (
SELECT
date,
CASE WHEN EXISTS (
SELECT 1
FROM data_engineer_interview d2
WHERE d2.date < d1.date AND
d2.user_id = d1.user_id
) THEN 0 ELSE 1 END AS flag
FROM (SELECT DISTINCT date, user_id FROM data_engineer_interview) d1
)
SELECT date, SUM(flag) AS daily_cumulative_count
FROM cte
ORDER BY date;

SQL Troubleshooting Help on Table Structure

I'm attempting to calculate average number of days between a customer's 1st and 3rd purchase, but struggling to get the data ordered in a way that will allow me to calculate.
I currently have the below data table. (Note: Order sequence number refers to the number order for that customer.)
Order Date
Customer Number
Order Sequence Number
2020-09-20
1
1
2021-01-20
1
2
2021-01-21
1
3
2020-10-01
2
1
2020-08-06
3
1
2020-09-06
3
2
2020-09-09
3
3
I've been trying to get the data to look like the following table. [To then be able to calculate datediff on the last two columns.]
Customer Number
Order Count
First Order Date
Third Order Date
1
3
2020-09-20
2021-01-21
2
1
2020-10-01
Null
3
3
2020-08-06
2020-09-09
I've completely messed up the code, but here's what I've been trying.
CREATE TABLE X2 as
SELECT
customer_number,
max(order_sequence_number) as order_count,
CASE
WHEN order_sequence_number = 1 then order_date
ELSE null
END as first_order_date,
CASE
WHEN order_sequence_number = 3 then order_date
ELSE null
END as third_order_date
FROM X1
GROUP BY customer_number;
Can someone please tell me what I'm missing? Thanks in advance!
You are on the right track but you need aggregation functions:
SELECT customer_number,
max(order_sequence_number) as order_count,
MAX(CASE WHEN order_sequence_number = 1 THEN order_date END) as first_order_date,
MAX(CASE WHEN order_sequence_number = 3 THEN order_date END) as third_order_date
FROM X1
GROUP BY customer_number;
To get the difference in days, you would just subtract the two expressions using whatever date arithmetic is supported in your database.

How to get rolling MIN number for all rest rows(include current rows) BY category

I have a data table as below, which sorted by data, route_number and sequence.
Delivery Date Order_ID Route_Number Stop # Sequence Min Stop# Formula
12/11/2017 Z11 100201 2 1 1 MIN(D2:$D$6)
12/11/2017 Z12 100201 1 2 1 MIN(D3:$D$6)
12/11/2017 Z13 100201 3 3 3 MIN(D4:$D$6)
12/11/2017 Z14 100201 5 4 4 MIN(D5:$D$6)
12/11/2017 Z15 100201 4 5 4 MIN(D6:$D$6)
What I am trying to do is in my SQL query, how can I get the column Min Stop# as I can in the excel.
The logic is: give me the min stop# from current row to all rest rows in same route_number,and delivery date, I am thinking something like Partition by delivery_date, route_number.
Does anyone has some ideas?
Thanks
Use min window function.
select t.*,min(stop) over(partition by route_number,delivery_date
order by sequence rows between current row
and unbounded following) as min_stop
from tbl t
min(stop) over (partition by route_number, delivery_date
order by sequence rows between current row and unbounded following)
or
min(stop) over (partition by route_number, delivery_date
order by sequence desc rows between unbounded preceding and current row)
which can be simplified to
min(stop) over (partition by route_number, delivery_date
order by sequence desc) m2
because rows between unbounded preceding and current row is the default window when you use ordering in over clause.

Need to get the minimum start date and maximum end date, when there is no break in months

i have 8 rows as shown below,
Column1 Start_date end_date Row_number
1 2014-02-01 2014-02-28 1
1 2014-03-01 2014-03-31 2
1 2014-04-01 2014-04-30 3
1 2014-05-01 2014-05-31 4
1 2014-07-01 2014-07-31 5
1 2015-02-01 2015-02-28 6
1 2015-03-01 2015-03-31 7
I need result like below,
Column1 Start_date end_date
1 2014-02-01 2014-05-31
1 2014-07-01 2014-07-31
1 2015-02-01 2015-03-31
so when the end_date of first row is one day less than the start_date in next row, I need to group all the continuous rows like that and get the result as I shown. I need to do this only via SQL. please let me know, if anyone have any idea to solve this.
In the input record, you can see, first 4 rows are continuous, and 5th row is not continuous and 6th and 7th row is a continuous one.
Thanks in advance.
The trick here is that you need to first filter out only entries that are the ends of an interval, and then merge them together, rather than trying to keep a running count in one go.
So I don't know what flavour of SQL you're running, and I have no idea what you're trying to signify with Column1, but this should do the trick (written in SQL server flavour, but the only functions you need to adjust are the dateadd and the isnull). The fiddle is here
SELECT DISTINCT
CASE WHEN Q1.IsStart = 1
THEN Q1.start_date
ELSE LAG(start_date) OVER(ORDER BY Q1.Row_number) END AS start_date,
CASE WHEN Q1.IsEnding = 1
THEN Q1.end_date
ELSE LEAD(end_date) OVER(ORDER BY Q1.Row_number) END AS end_date
FROM
(SELECT
start_date,
end_date,
Row_number,
CASE WHEN DATEADD(day,1,end_date) =
ISNULL(LEAD(start_date) OVER(ORDER BY Row_number),
end_date)
THEN 0
ELSE 1 END AS IsEnding,
CASE WHEN DATEADD(day,-1,start_date) =
ISNULL(LAG(end_date) OVER(ORDER BY Row_number),
start_date)
THEN 0
ELSE 1 END AS IsStart
FROM table1) Q1
WHERE Q1.IsEnding = 1 OR Q1.IsStart = 1
For ANSI SQL/For those of you without LAG or LEAD:
SELECT
StartDates.start_date,
MIN(EndDates.end_date)
FROM
(SELECT
MainEntry.start_date,
MainEntry.row_number
FROM
mytable MainEntry
LEFT OUTER JOIN mytable PrevEntry ON PrevEntry.row_number - 1 = MainEntry.row_number
WHERE
PrevEntry.end_date IS NULL OR
EXTRACT(day FROM (MainEntry.start_date - PrevEntry.end_date)) > 1) StartDates
INNER JOIN
(SELECT
MainEntry.end_date,
MainEntry.row_number
FROM
mytable MainEntry
LEFT OUTER JOIN mytable NextEntry ON NextEntry.row_number + 1 = MainEntry.row_number
WHERE
NextEntry.start_date IS NULL OR
EXTRACT(day FROM (NextEntry.start_date - MainEntry.end_date)) > 1) EndDates
ON StartDates.row_number <= EndDates.row_number
GROUP BY
StartDates.start_date
Note that the GROUP BY could contain StartDates.row_number if that takes advantage of an index. Also note that this ANSI solution initially missed the edge cases of rows without any pairs (had INNER JOINs inside the subqueries).