SQL Matching many-to-many dates for ID field - sql

Edit: Fixed Start Date for User 2
I have a list of user ids, each having many start dates and many end dates.
A start date can be recorded many times after the "actual" start date of an "event", same goes for the end date.
The result should be each the first start date and first end date for each user "event"
I hope that makes sense, see the example below.
Thanks!
Assuming the Following tables are given:
Start Table:
+--------+-------------+
| UserID | Start |
+--------+-------------+
| 1 | 2019-01-01 |
| 1 | 2019-01-02 |
| 1 | 2019-01-03 |
| 1 | 2019-04-01 |
| 1 | 2019-04-02 |
| 1 | 2019-04-03 |
| 2 | 2019-06-01 |
| 2 | 2019-06-02 |
| 2 | 2019-10-01 |
| 2 | 2019-10-02 |
+--------+-------------+
End Table:
+--------+------------+
| UserID | End |
+--------+------------+
| 1 | 2019-03-01 |
| 1 | 2019-03-02 |
| 1 | 2019-03-03 |
| 1 | 2019-05-01 |
| 1 | 2019-05-02 |
| 1 | 2019-05-03 |
| 2 | 2019-08-01 |
| 2 | 2019-08-02 |
| 2 | 2019-12-01 |
| 2 | 2019-12-02 |
+--------+------------+
Result:
+--------+------------+------------+
| UserID | Start | End |
+--------+------------+------------+
| 1 | 2019-01-01 | 2019-03-01 |
| 1 | 2019-04-01 | 2019-05-01 |
| 2 | 2019-06-01 | 2019-08-01 |
| 2 | 2019-10-01 | 2019-12-01 |
+--------+------------+------------+

Not sure I agree with your 2019-10-02
Here is one solution
Example
Select UserID
,[Start] = min([Start])
,[End]
From (
Select A.*
,[End] = (Select min([End]) From EndTable Where UserID=A.UserID and [End] >= A.Start )
From StartTable A
) A
Group By UserID,[End]
Returns
UserID Start End
1 2019-01-01 2019-03-01
1 2019-04-01 2019-05-01
2 2019-06-01 2019-08-01
2 2019-10-01 2019-12-01

Related

Redshift SQL - Count Sequences of Repeating Values Within Groups

I have a table that looks like this:
| id | date_start | gap_7_days |
| -- | ------------------- | --------------- |
| 1 | 2021-06-10 00:00:00 | 0 |
| 1 | 2021-06-13 00:00:00 | 0 |
| 1 | 2021-06-19 00:00:00 | 0 |
| 1 | 2021-06-27 00:00:00 | 0 |
| 2 | 2021-07-04 00:00:00 | 1 |
| 2 | 2021-07-11 00:00:00 | 1 |
| 2 | 2021-07-18 00:00:00 | 1 |
| 2 | 2021-07-25 00:00:00 | 1 |
| 2 | 2021-08-01 00:00:00 | 1 |
| 2 | 2021-08-08 00:00:00 | 1 |
| 2 | 2021-08-09 00:00:00 | 0 |
| 2 | 2021-08-16 00:00:00 | 1 |
| 2 | 2021-08-23 00:00:00 | 1 |
| 2 | 2021-08-30 00:00:00 | 1 |
| 2 | 2021-08-31 00:00:00 | 0 |
| 2 | 2021-09-01 00:00:00 | 0 |
| 2 | 2021-08-08 00:00:00 | 1 |
| 2 | 2021-08-15 00:00:00 | 1 |
| 2 | 2021-08-22 00:00:00 | 1 |
| 2 | 2021-08-23 00:00:00 | 1 |
For each ID, I check whether consecutive date_start values are 7 days apart, and put a 1 or 0 in gap_7_days accordingly.
I want to do the following (using Redshift SQL only):
Get the length of each sequence of consecutive 1s in gap_7_days for each ID
Expected output:
| id | date_start | gap_7_days | sequence_length |
| -- | ------------------- | --------------- | --------------- |
| 1 | 2021-06-10 00:00:00 | 0 | |
| 1 | 2021-06-13 00:00:00 | 0 | |
| 1 | 2021-06-19 00:00:00 | 0 | |
| 1 | 2021-06-27 00:00:00 | 0 | |
| 2 | 2021-07-04 00:00:00 | 1 | 6 |
| 2 | 2021-07-11 00:00:00 | 1 | 6 |
| 2 | 2021-07-18 00:00:00 | 1 | 6 |
| 2 | 2021-07-25 00:00:00 | 1 | 6 |
| 2 | 2021-08-01 00:00:00 | 1 | 6 |
| 2 | 2021-08-08 00:00:00 | 1 | 6 |
| 2 | 2021-08-09 00:00:00 | 0 | |
| 2 | 2021-08-16 00:00:00 | 1 | 3 |
| 2 | 2021-08-23 00:00:00 | 1 | 3 |
| 2 | 2021-08-30 00:00:00 | 1 | 3 |
| 2 | 2021-08-31 00:00:00 | 0 | |
| 2 | 2021-09-01 00:00:00 | 0 | |
| 2 | 2021-08-08 00:00:00 | 1 | 4 |
| 2 | 2021-08-15 00:00:00 | 1 | 4 |
| 2 | 2021-08-22 00:00:00 | 1 | 4 |
| 2 | 2021-08-23 00:00:00 | 1 | 4 |
Get the number of sequences for each ID
Expected output:
| id | num_sequences |
| -- | ------------------- |
| 1 | 0 |
| 2 | 3 |
How can I achieve this?
If you want the number of sequences, just look at the previous value. When the current value is "1" and the previous is NULL or 0, then you have a new sequence.
So:
select id,
sum( (gap_7_days = 1 and coalesce(prev_gap_7_days, 0) = 0)::int ) as num_sequences
from (select t.*,
lag(gap_7_days) over (partition by id order by date_start) as prev_gap_7_days
from t
) t
group by id;
If you actually want the lengths of the sequences, as in the intermediate results, then ask a new question. That information is not needed for this question.

Join on minimum date between two dates - Spark SQL

I have a table of daily data and a table of monthly data. I'm trying to retrieve one daily record corresponding to each monthly record. The wrinkles are that some days are missing from the daily data and the field I care about, new_status, is sometimes null on the month_end_date.
month_df
| ID | month_end_date |
| -- | -------------- |
| 1 | 2019-07-31 |
| 1 | 2019-06-30 |
| 2 | 2019-10-31 |
daily_df
| ID | daily_date | new_status |
| -- | ---------- | ---------- |
| 1 | 2019-07-29 | 1 |
| 1 | 2019-07-30 | 1 |
| 1 | 2019-08-01 | 2 |
| 1 | 2019-08-02 | 2 |
| 1 | 2019-08-03 | 2 |
| 1 | 2019-06-29 | 0 |
| 1 | 2019-06-30 | 0 |
| 2 | 2019-10-30 | 5 |
| 2 | 2019-10-31 | NULL |
| 2 | 2019-11-01 | 6 |
| 2 | 2019-11-02 | 6 |
I want to fuzzy join daily_df to monthly_df where daily_date is >= month_end_dt and less than some buffer afterwards (say, 5 days). I want to keep only the record with the minimum daily date and a non-null new_status.
This post solves the issue using an OUTER APPLY in SQL Server, but that seems not to be an option in Spark SQL. I'm wondering if there's a method that is similarly computationally efficient that works in Spark.

Merge historical periods of an dimension entity into one

I have a Slowly Changing Dimension type 2 with rows that are identical (besides the Start and End date). How do I write a pretty SQL query to merge rows that identical and have connected time periods?
Current data
+-------------+---------------------+--------------+------------+
| DimensionID | DimensionAttribute | RowStartDate | RowEndDate |
+-------------+---------------------+--------------+------------+
| 1 | SomeValue | 2019-01-01 | 2019-01-31 |
| 1 | SomeValue | 2019-02-01 | 2019-02-28 |
| 1 | AnotherValue | 2019-03-01 | 2019-03-31 |
| 1 | SomeValue | 2019-04-01 | 2019-04-30 |
| 1 | SomeValue | 2019-05-01 | 2019-05-31 |
| 2 | SomethingElse | 2019-01-01 | 2019-01-31 |
| 2 | SomethingElse | 2019-02-01 | 2019-02-28 |
| 2 | SomethingElse | 2019-03-01 | 2019-03-31 |
| 2 | CompletelyDifferent | 2019-04-01 | 2019-04-30 |
| 2 | SomethingElse | 2019-05-01 | 2019-05-31 |
+-------------+---------------------+--------------+------------+
Result
+-------------+---------------------+--------------+------------+
| DimensionID | DimensionAttribute | RowStartDate | RowEndDate |
+-------------+---------------------+--------------+------------+
| 1 | SomeValue | 2019-01-01 | 2019-02-28 |
| 1 | AnotherValue | 2019-03-01 | 2019-03-31 |
| 1 | SomeValue | 2019-04-01 | 2019-05-31 |
| 2 | SomethingElse | 2019-01-01 | 2019-03-31 |
| 2 | CompletelyDifferent | 2019-04-01 | 2019-04-30 |
| 2 | SomethingElse | 2019-05-01 | 2019-05-31 |
+-------------+---------------------+--------------+------------+
For this version of the problem, I would use lag() to determine where the groups start, then a cumulative sum and aggregation:
select dimensionid, DimensionAttribute,
min(row_start_date), max(row_end_date)
from (select t.*,
sum(case when prev_red = dateadd(day, -1, row_start_date)
then 0 else 1
end) over (partition by dimensionid, DimensionAttribute order by row_start_date) as grp
from (select t.*,
lag(row_end_date) over (partition by dimensionid, DimensionAttribute order by row_start_date) as prev_red
from t
) t
) t
group by dimensionid, DimensionAttribute, grp;
In particular, this will recognize gaps in the rows. It will only combine rows when they exactly fit together -- the previous end date is one day before the start date. This can be tweaked, of course, to allow a gap of 1 or 2 days or to allow overlaps.

Set a flag based on the value of another flag in the past hour

I have a table with the following design:
+------+-------------------------+-------------+
| Shop | Date | SafetyEvent |
+------+-------------------------+-------------+
| 1 | 2018-06-25 10:00:00.000 | 0 |
| 1 | 2018-06-25 10:30:00.000 | 1 |
| 1 | 2018-06-25 10:45:00.000 | 0 |
| 2 | 2018-06-25 11:00:00.000 | 0 |
| 2 | 2018-06-25 11:30:00.000 | 0 |
| 2 | 2018-06-25 11:45:00.000 | 0 |
| 3 | 2018-06-25 12:00:00.000 | 1 |
| 3 | 2018-06-25 12:30:00.000 | 0 |
| 3 | 2018-06-25 12:45:00.000 | 0 |
+------+-------------------------+-------------+
Basically at each shop, we track the date/time of a repair and flag if a safety event occurred. I want to add an additional column that tracks if a safety event has occurred in the last 8 hours at each shop. The end result will be like this:
+------+-------------------------+-------------+-------------------+
| Shop | Date | SafetyEvent | SafetyEvent8Hours |
+------+-------------------------+-------------+-------------------+
| 1 | 2018-06-25 10:00:00.000 | 0 | 0 |
| 1 | 2018-06-25 10:30:00.000 | 1 | 1 |
| 1 | 2018-06-25 10:45:00.000 | 0 | 1 |
| 2 | 2018-06-25 11:00:00.000 | 0 | 0 |
| 2 | 2018-06-25 11:30:00.000 | 0 | 0 |
| 2 | 2018-06-25 11:45:00.000 | 0 | 0 |
| 3 | 2018-06-25 12:00:00.000 | 1 | 1 |
| 3 | 2018-06-25 12:30:00.000 | 0 | 1 |
| 3 | 2018-06-25 12:45:00.000 | 0 | 1 |
+------+-------------------------+-------------+-------------------+
I was trying to use DATEDIFF but couldn't figure out how to have it occur for each row.
This isn't particularly efficient, but you can use apply or a correlated subquery:
select t.*, t8.SafetyEvent8Hours
from t apply
(select max(SafetyEvent) as SafetyEvent8Hours
from t t2
where t2.shop = t.shop and
t2.date <= t.date and
t2.date > dateadd(hour, -8, t.date)
) t8;
If you can rely on events being logged every 15 minutes, then a more efficient method is to use window functions:
select t.*,
max(SafetyEvent) over (partition by shop order by date rows between 31 preceding and current row) as SafetyEvent8Hours
from t

Get last value with delta from previous row

I have data
| account | type | position | created_date |
|---------|------|----------|------|
| 1 | 1 | 1 | 2016-08-01 00:00:00 |
| 2 | 1 | 2 | 2016-08-01 00:00:00 |
| 1 | 2 | 2 | 2016-08-01 00:00:00 |
| 2 | 2 | 1 | 2016-08-01 00:00:00 |
| 1 | 1 | 2 | 2016-08-02 00:00:00 |
| 2 | 1 | 1 | 2016-08-02 00:00:00 |
| 1 | 2 | 1 | 2016-08-03 00:00:00 |
| 2 | 2 | 2 | 2016-08-03 00:00:00 |
| 1 | 1 | 2 | 2016-08-04 00:00:00 |
| 2 | 1 | 1 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 2016-08-07 00:00:00 |
| 2 | 2 | 1 | 2016-08-07 00:00:00 |
I need to get last positions (account, type, position) and delta from previous position. I'm trying to use Window functions but only get all rows and can't grouping them/get last.
SELECT
account,
type,
FIRST_VALUE(position) OVER w AS position,
FIRST_VALUE(position) OVER w - LEAD(position, 1, 0) OVER w AS delta,
created_date
FROM table
WINDOW w AS (PARTITION BY account ORDER BY created_date DESC)
I have result
| account | type | position | delta | created_date |
|---------|------|----------|-------|--------------|
| 1 | 1 | 1 | 1 | 2016-08-01 00:00:00 |
| 1 | 1 | 2 | 1 | 2016-08-02 00:00:00 |
| 1 | 1 | 2 | 0 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 2 | 2016-08-01 00:00:00 |
| 1 | 2 | 1 | -1 | 2016-08-03 00:00:00 |
| 1 | 2 | 2 | 1 | 2016-08-07 00:00:00 |
| 2 | 1 | 2 | 2 | 2016-08-01 00:00:00 |
| 2 | 2 | 1 | 1 | 2016-08-01 00:00:00 |
| and so on |
but i need only last record for each account/type pair
| account | type | position | delta | created_date |
|---------|------|----------|-------|--------------|
| 1 | 1 | 2 | 0 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 1 | 2016-08-07 00:00:00 |
| 2 | 1 | 1 | 0 | 2016-08-04 00:00:00 |
| and so on |
Sorry for my bad language and Thanks for any help.
My "best" try..
WITH cte_delta AS (
SELECT
account,
type,
FIRST_VALUE(position) OVER w AS position,
FIRST_VALUE(position) OVER w - LEAD(position, 1, 0) OVER w AS delta,
created_date
FROM table
WINDOW w AS (PARTITION BY account ORDER BY created_date DESC)
),
cte_date AS (
SELECT
account,
type,
MAX(created_date) AS created_date
FROM cte_delta
GROUP BY account, type
)
SELECT cd.*
FROM
cte_delta cd,
cte_date ct
WHERE
cd.account = ct.account
AND cd.type = ct.type
AND cd.created_date = ct.created_date