how to clean sql table base on startdate, enddate and effective date - sql

I have a really dirty table in which I have a mix between the start date and one values's change effective date.
The table look like this
id
value
startdate
enddate
effective date
1
0.3
2020-10-07
2021-02-28
2020-07-01
1
1
2020-10-07
2021-02-28
2020-10-07
2
0.46
2021-01-01
2021-01-01
2
1
2021-01-01
2020-10-07
2021-05-01
3
1
2021-08-01
2021-08-01
4
1
2019-03-01
2019-03-01
4
0.5
2019-03-01
2020-08-01
4
0.7
2019-03-01
2021-05-01
When the enddate is empty it means that there is not change planning and when the start date is later and the effective date, it means than they delete an older record and create a new one with other values.
my goal is to clean the table and get it sorted as something like this.
id
value
startdate_valid
enddate_valid
1
0.3
2020-07-01
2020-10-07
1
1
2020-10-07
2021-02-28
2
0.46
2021-01-01
2021-05-01
2
1
2021-05-01
3
1
2021-08-01
4
1
2019-03-01
2020-08-01
4
0.5
2020-08-01
2021-05-01
4
0.7
2021-05-01
any idea of how can I achieve this?
EDIT:
I think I was able to get the startdate_valid value by using
MAX([effective date]) OVER(PARTITION BY id, YEAR([effective date]), MONTH([effective date]) ORDER BY [effective date])
This make sense as I have the startdate included in the effective date but I am still stuck in order to get the enddate_valid

I have found a solution to my problem, I needed to do it in two steps so if someone has a better solution, please share and I will set it as correct
SELECT
*,
COALESCE(
LEAD(sub.StartDate_value) OVER(PARTITION BY sub.Code ORDER BY sub.StartDate_value),
sub.[startdate]) AS [EndDate_value]
FROM (
SELECT
id, name,
COALESCE(
MAX([effective date]) OVER(PARTITION BY id YEAR([effective date]), MONTH([effective date]) ORDER BY [effective date]),
startdate)
) AS StartDate_value
from table ) sub

Related

SQL: Calculate duration based on dates and parameters (Change Log)

I have a dataset that is like a ticketing system change log, I am trying to calculate kind of like an SLA time across the records, like how long did this specific ticket sit with this Group 2 for before it was either resolved or moved to another group to resolve.
The data looks like so:
ID
field
value
start
end
1
assignment_group
Group 1
2022-03-21 08:00:00
2022-03-21 08:05:00
1
incident_state
Work in Progress
2022-03-21 08:05:00
2022-03-21 08:30:00
1
assignment_group
Group 2
2022-03-21 08:35:00
2022-03-21 08:50:00
1
assigned_to
User 1
2022-03-21 08:50:00
2022-03-21 08:51:00
1
incident_state
Work in Progress
2022-03-21 09:00:00
2022-03-21 09:30:00
1
incident_state
Resolved
2022-03-21 09:30:00
2022-03-21 09:31:00
2
assignment_group
Group 2
2022-01-21 11:30:00
2022-01-21 11:35:00
2
assigned_to
User 1
2022-01-21 11:35:00
2022-01-21 11:37:00
2
incident_state
Work in Progress
2022-01-21 11:40:00
2022-01-21 11:55:00
2
assignment_group
Group 3
2022-01-21 11:58:00
2022-01-21 12:00:00
2
assigned_to
User 2
2022-01-21 12:05:00
2022-01-21 12:06:00
2
incident_state
Resolved
2022-01-21 12:10:00
2022-01-21 12:07:00
The issue I am having is calculating the duration based on the start time the ticket was assigned to a specific group and the end time of when either the ticket was resolved by that group or moved to another group to resolve. For example, I am only interested in Group 2, the duration for ticket 1 for when the ticket was sitting with Group 2 till the ticket was resolved by Group 2 is 2022-03-21 08:35:00 to 2022-03-21 09:31:00, so duration is 1 hour and 1 minute. But for Ticket 2, the ticket sat with Group 2 from 2022-01-21 11:30:00 till it was transferred to another group to resolve at 2022-01-21 11:58:00.
My code looks like so at the moment, I join two tables to pull in the ticket information and then the ticket state changes (so every time an action is taken on that ticket). Then I am left with the table above. I kind of guess I need to use a lead function but I can't figure out how to get the correct end time for the correct record (When incident_state = resolved OR assignment_group = another group):
WITH incidents as
(
SELECT number, sys_id AS SYS_ID_INCIDENT
FROM tables.ServiceIncidents
),
changes as
(
SELECT id, start, field, field_value, value, `end`
FROM tables.IncidentInstances
),
incident_changes as
(
SELECT *, TIMESTAMP_DIFF(changes.`end`,changes.start, MINUTE) as Duration, row_number() over (partition by number order by start) as RN
FROM incidents
LEFT JOIN changes
ON (incidents.SYS_ID_INCIDENT = changes.id)
),
IAMtickets as
(
SELECT i.number, i.SYS_ID_INCIDENT, start, field, value, `end`, Duration, RN
FROM incident_changes i
INNER JOIN
(SELECT DISTINCT number FROM incident_changes WHERE value ='Group 2') r
ON i.number = r.number
),
cte as
(
SELECT *, lead(value) over (partition by IAMtickets.number order by start)
FROM IAMtickets
)
SELECT * FROM CTE
I want the output to be something like this:
ID
Assigned to
Duration
Outcome
1
User 1
1 hour 1 minute
Resolved
2
User 1
28 minutes
Transferred

CASE in WHERE Clause in Snowflake

I am trying to do a case statement within the where clause in snowflake but I’m not quite sure how should I go about doing it.
What I’m trying to do is, if my current month is Jan, then the where clause for date is between start of previous year and today. If not, the where clause for date would be between start of current year and today.
WHERE
CASE MONTH(CURRENT_DATE()) = 1 THEN DATE BETWEEN DATE_TRUNC(‘YEAR’, DATEADD(YEAR, -1, CURRENT_DATE())) AND CURRENT_DATE()
CASE MONTH(CURRENT_DATE()) != 1 THEN DATE BETWEEN DATE_TRUNC(‘YEAR’, CURRENT_DATE()) AND CURRENT_DATE()
END
Appreciate any help on this!
Use a CASE expression that returns -1 if the current month is January or 0 for any other month, so that you can get with DATEADD() a date of the previous or the current year to use in DATE_TRUNC():
WHERE DATE BETWEEN
DATE_TRUNC('YEAR', DATEADD(YEAR, CASE WHEN MONTH(CURRENT_DATE()) = 1 THEN -1 ELSE 0 END, CURRENT_DATE()))
AND
CURRENT_DATE()
I suspect that you don't even need to use CASE here:
WHERE
(MONTH(CURRENT_DATE()) = 1 AND
DATE BETWEEN DATE_TRUNC(‘YEAR’, DATEADD(YEAR, -1, CURRENT_DATE())) AND
CURRENT_DATE()) OR
(MONTH(CURRENT_DATE()) != 1 AND
DATE BETWEEN DATE_TRUNC(‘YEAR’, CURRENT_DATE()) AND CURRENT_DATE())
So the other answers are quite good, but... the answer can be even simpler
Making a little table to brake down what is happening.
select
row_number() over (order by null) - 1 as rn,
dateadd('day', rn * 5, date_trunc('year',current_date())) as pretend_current_date,
DATEADD(YEAR, -1, pretend_current_date) as pcd_sub1,
month(pretend_current_date) as pcd_month,
DATE_TRUNC(year, iff(pcd_month = 1, pcd_sub1, pretend_current_date)) as _from,
pretend_current_date as _to
from table(generator(ROWCOUNT => 30))
order by rn;
this shows:
RN
PRETEND_CURRENT_DATE
PCD_SUB1
PCD_MONTH
_FROM
_TO
0
2022-01-01
2021-01-01
1
2021-01-01
2022-01-01
1
2022-01-06
2021-01-06
1
2021-01-01
2022-01-06
2
2022-01-11
2021-01-11
1
2021-01-01
2022-01-11
3
2022-01-16
2021-01-16
1
2021-01-01
2022-01-16
4
2022-01-21
2021-01-21
1
2021-01-01
2022-01-21
5
2022-01-26
2021-01-26
1
2021-01-01
2022-01-26
6
2022-01-31
2021-01-31
1
2021-01-01
2022-01-31
7
2022-02-05
2021-02-05
2
2022-01-01
2022-02-05
8
2022-02-10
2021-02-10
2
2022-01-01
2022-02-10
9
2022-02-15
2021-02-15
2
2022-01-01
2022-02-15
10
2022-02-20
2021-02-20
2
2022-01-01
2022-02-20
11
2022-02-25
2021-02-25
2
2022-01-01
2022-02-25
12
2022-03-02
2021-03-02
3
2022-01-01
2022-03-02
13
2022-03-07
2021-03-07
3
2022-01-01
2022-03-07
14
2022-03-12
2021-03-12
3
2022-01-01
2022-03-12
15
2022-03-17
2021-03-17
3
2022-01-01
2022-03-17
16
2022-03-22
2021-03-22
3
2022-01-01
2022-03-22
17
2022-03-27
2021-03-27
3
2022-01-01
2022-03-27
18
2022-04-01
2021-04-01
4
2022-01-01
2022-04-01
19
2022-04-06
2021-04-06
4
2022-01-01
2022-04-06
20
2022-04-11
2021-04-11
4
2022-01-01
2022-04-11
21
2022-04-16
2021-04-16
4
2022-01-01
2022-04-16
22
2022-04-21
2021-04-21
4
2022-01-01
2022-04-21
23
2022-04-26
2021-04-26
4
2022-01-01
2022-04-26
24
2022-05-01
2021-05-01
5
2022-01-01
2022-05-01
25
2022-05-06
2021-05-06
5
2022-01-01
2022-05-06
26
2022-05-11
2021-05-11
5
2022-01-01
2022-05-11
27
2022-05-16
2021-05-16
5
2022-01-01
2022-05-16
28
2022-05-21
2021-05-21
5
2022-01-01
2022-05-21
29
2022-05-26
2021-05-26
5
2022-01-01
2022-05-26
Your logic is asking "is the current date in the month of January", at which point take the prior year, and then date truncate to the year, otherwise take the current date and truncate to the year. As the start of a BETWEEN test.
This is the same as getting the current date subtracting one month, and truncating this to year.
Thus there is no need for any IFF or CASE
WHERE date BETWEEN DATE_TRUNC(year, DATEADD(month,-1, CURRENT_DATE())) AND CURRENT_DATE()
and if you like to drop some paren's, CURRENT_DATE can be used if you leave it in upper case, thus it can even be smaller:
WHERE date BETWEEN DATE_TRUNC(year, DATEADD(month,-1, CURRENT_DATE)) AND CURRENT_DATE

Rolling Sum Calculation Based on 2 Date Fields

Giving up after a few hours of failed attempts.
My data is in the following format - event_date can never be higher than create_date.
I'd need to calculate on a rolling n-day basis (let's say 3) the sum of units where the create_date and event_date were within the same 3-day window. The data is illustrative but each event_date can have over 500+ different create_dates associated with it and the number isn't constant. There is a possibility of event_dates missing.
So let's say for 2022-02-03, I only want to sum units where both the event_date and create_date values were between 2022-02-01 and 2022-02-03.
event_date
create_date
rowid
units
2022-02-01
2022-01-20
1
100
2022-02-01
2022-02-01
2
100
2022-02-02
2022-01-21
3
100
2022-02-02
2022-01-23
4
100
2022-02-02
2022-01-31
5
100
2022-02-02
2022-02-02
6
100
2022-02-03
2022-01-30
7
100
2022-02-03
2022-02-01
8
100
2022-02-03
2022-02-03
9
100
2022-02-05
2022-02-01
10
100
2022-02-05
2022-02-03
11
100
The output I'd need to get to (added in brackets the rows I'd need to include in the calculation for each date but my result would only need to include the numerical sum) . I tried calculating using either dates but neither of them returned the results I needed.
date
units
2022-02-01
100 (Row 2)
2022-02-02
300 (Row 2,5,6)
2022-02-03
300 (Row 2,6,8,9)
2022-02-04
200 (Row 6,9)
2022-02-05
200 (Row 9,11)
In Python I solved above with a definition that looped through filtering a dataframe for each date but I am struggling to do the same in SQL.
Thank you!
Consider below approach
with events_dates as (
select date from (
select min(event_date) min_date, max(event_date) max_date
from your_table
), unnest(generate_date_array(min_date, max_date)) date
)
select date, sum(units) as units, string_agg('' || rowid) rows_included
from events_dates
left join your_table
on create_date between date - 2 and date
and event_date between date - 2 and date
group by date
if applied to sample data in your question - output is

How do I use SQL window to sum rows with a condition

Assume this is my table:
id start_date event_date sales
------------------------------------
1 2020-09-09 2020-08-30 27.9
1 2020-09-09 2020-09-01 15
1 2020-09-09 2020-09-05 25
1 2020-09-09 2020-09-06 20.75
2 2020-09-09 2020-01-30 5
2 2020-09-09 2020-08-01 12
I'm trying to use a window function, where I want to sum sales in event_date for 7 days prior to the start date for each id, so the output I'm trying to reach looks like this...
id start_date event_date sales sales_7_days
-------------------------------------------------
1 2020-09-09 2020-08-30 27.9 0
1 2020-09-09 2020-09-01 15 0 <---- this is not within 7 days of start_date
1 2020-09-09 2020-09-05 25 25 <---- this is within 7 days of start_date
1 2020-09-09 2020-09-06 20.75 40.75 <---- this is within 7 days of start_date
2 2020-09-09 2020-01-30 5 0
2 2020-09-09 2020-09-03 12 12
This is what I've tried so far, but the problem is it seems to start summing from 7 days previous to event_date rather than start_date.
SELECT
id,
start_date,
event_date,
sales,
CASE WHEN event_date >= DATE_ADD(start_date, -7) THEN SUM(sales) \
OVER(PARTITION BY id ORDER BY event_date RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW) ELSE 0 END AS sales_7_days
FROM
sample_df
ORDER BY
id,
start_date,
event_date
So the query above is producing the below (which I don't want, because the window sum starts from event_date rather than start_date)
id start_date event_date sales sales_7_days
-------------------------------------------------
1 2020-09-09 2020-08-30 27.9 0
1 2020-09-09 2020-09-01 15 0
1 2020-09-09 2020-09-05 25 67.9
1 2020-09-09 2020-09-06 20.75 60.75
2 2020-09-09 2020-01-30 5 0
2 2020-09-09 2020-09-03 12 17
Does anybody have any tips here?
where I want to sum sales in event_date for 7 days prior to the start date for each id
Because the start date is constant for each id, this is a constant. You can calculate it as:
select s.*,
sum(case when event_date <= start_date and event_date >= start_date - interval 7 day
then sales
end) over (partition by id)
from sample_df s;
Your results suggest, though, that you really want a cumulative sum based on the event_date. That's fine, but a different question. The answer for that is to tweak the SQL:
select s.*,
sum(case when event_date <= start_date and event_date >= start_date - interval 7 day
then sales
end) over (partition by id order by event_date)
from sample_df s;

SQL Select up to a certain sum

I have been trying to figure out a way to write a SQL script to select a given sum, and would appreciate any ideas given to do so.
I am trying to do a stock valuation based on the dates of goods received. At month-end closing, the value of my stocks remaining in the warehouse would be a specified sum of the last received goods.
The below query is done by a couple of unions but reduces to:
SELECT DATE, W1 FROM Table
ORDER BY DATE DESC
Query result:
Row DATE W1
1 2019-02-28 00:00:00 13250
2 2019-02-28 00:00:00 42610
3 2019-02-28 00:00:00 41170
4 2019-02-28 00:00:00 13180
5 2019-02-28 00:00:00 20860
6 2019-02-28 00:00:00 19870
7 2019-02-28 00:00:00 37780
8 2019-02-28 00:00:00 47210
9 2019-02-28 00:00:00 32000
10 2019-02-28 00:00:00 41930
I have thought about solving this issue by calculating a cumulative sum as follows:
Row DATE W1 Cumulative Sum
1 2019-02-28 00:00:00 13250 13250
2 2019-02-28 00:00:00 42610 55860
3 2019-02-28 00:00:00 41170 97030
4 2019-02-28 00:00:00 13180 110210
5 2019-02-28 00:00:00 20860 131070
6 2019-02-28 00:00:00 19870 150940
7 2019-02-28 00:00:00 37780 188720
8 2019-02-28 00:00:00 47210 235930
9 2019-02-28 00:00:00 32000 267930
10 2019-02-28 00:00:00 41930 309860
However, I am stuck when figuring out a way to use a parameter to return only the rows of interest.
For example, if a parameter was specified as '120000', it would return the rows where the cumulative sum is exactly 120000.
Row DATE W1 Cumulative Sum W1_Select
1 2019-02-28 00:00:00 13250 13250 13250
2 2019-02-28 00:00:00 42610 55860 42610
3 2019-02-28 00:00:00 41170 97030 41170
4 2019-02-28 00:00:00 13180 110210 13180
5 2019-02-28 00:00:00 20860 131070 9790
----------
Total 120000
This just requires some arithmetic:
select t.*,
(case when running_sum < #threshold then w1
else #threshold - w1
end)
from (select date, w1, sum(w1) over (order by date) as running_sum
from t
) t
where running_sum - w1 < #threshold;
Actually, in your case, the dates are all the same. That is a bit counter-intuitive, but you need to use the row for this to work:
select t.*,
(case when running_sum < #threshold then w1
else #threshold - w1
end)
from (select date, w1, sum(w1) over (order by row) as running_sum
from t
) t
where running_sum - w1 < #threshold;
Here is a db<>fiddle.