Merge historical periods of an dimension entity into one - sql

I have a Slowly Changing Dimension type 2 with rows that are identical (besides the Start and End date). How do I write a pretty SQL query to merge rows that identical and have connected time periods?
Current data
+-------------+---------------------+--------------+------------+
| DimensionID | DimensionAttribute | RowStartDate | RowEndDate |
+-------------+---------------------+--------------+------------+
| 1 | SomeValue | 2019-01-01 | 2019-01-31 |
| 1 | SomeValue | 2019-02-01 | 2019-02-28 |
| 1 | AnotherValue | 2019-03-01 | 2019-03-31 |
| 1 | SomeValue | 2019-04-01 | 2019-04-30 |
| 1 | SomeValue | 2019-05-01 | 2019-05-31 |
| 2 | SomethingElse | 2019-01-01 | 2019-01-31 |
| 2 | SomethingElse | 2019-02-01 | 2019-02-28 |
| 2 | SomethingElse | 2019-03-01 | 2019-03-31 |
| 2 | CompletelyDifferent | 2019-04-01 | 2019-04-30 |
| 2 | SomethingElse | 2019-05-01 | 2019-05-31 |
+-------------+---------------------+--------------+------------+
Result
+-------------+---------------------+--------------+------------+
| DimensionID | DimensionAttribute | RowStartDate | RowEndDate |
+-------------+---------------------+--------------+------------+
| 1 | SomeValue | 2019-01-01 | 2019-02-28 |
| 1 | AnotherValue | 2019-03-01 | 2019-03-31 |
| 1 | SomeValue | 2019-04-01 | 2019-05-31 |
| 2 | SomethingElse | 2019-01-01 | 2019-03-31 |
| 2 | CompletelyDifferent | 2019-04-01 | 2019-04-30 |
| 2 | SomethingElse | 2019-05-01 | 2019-05-31 |
+-------------+---------------------+--------------+------------+

For this version of the problem, I would use lag() to determine where the groups start, then a cumulative sum and aggregation:
select dimensionid, DimensionAttribute,
min(row_start_date), max(row_end_date)
from (select t.*,
sum(case when prev_red = dateadd(day, -1, row_start_date)
then 0 else 1
end) over (partition by dimensionid, DimensionAttribute order by row_start_date) as grp
from (select t.*,
lag(row_end_date) over (partition by dimensionid, DimensionAttribute order by row_start_date) as prev_red
from t
) t
) t
group by dimensionid, DimensionAttribute, grp;
In particular, this will recognize gaps in the rows. It will only combine rows when they exactly fit together -- the previous end date is one day before the start date. This can be tweaked, of course, to allow a gap of 1 or 2 days or to allow overlaps.

Related

Redshift SQL - Count Sequences of Repeating Values Within Groups

I have a table that looks like this:
| id | date_start | gap_7_days |
| -- | ------------------- | --------------- |
| 1 | 2021-06-10 00:00:00 | 0 |
| 1 | 2021-06-13 00:00:00 | 0 |
| 1 | 2021-06-19 00:00:00 | 0 |
| 1 | 2021-06-27 00:00:00 | 0 |
| 2 | 2021-07-04 00:00:00 | 1 |
| 2 | 2021-07-11 00:00:00 | 1 |
| 2 | 2021-07-18 00:00:00 | 1 |
| 2 | 2021-07-25 00:00:00 | 1 |
| 2 | 2021-08-01 00:00:00 | 1 |
| 2 | 2021-08-08 00:00:00 | 1 |
| 2 | 2021-08-09 00:00:00 | 0 |
| 2 | 2021-08-16 00:00:00 | 1 |
| 2 | 2021-08-23 00:00:00 | 1 |
| 2 | 2021-08-30 00:00:00 | 1 |
| 2 | 2021-08-31 00:00:00 | 0 |
| 2 | 2021-09-01 00:00:00 | 0 |
| 2 | 2021-08-08 00:00:00 | 1 |
| 2 | 2021-08-15 00:00:00 | 1 |
| 2 | 2021-08-22 00:00:00 | 1 |
| 2 | 2021-08-23 00:00:00 | 1 |
For each ID, I check whether consecutive date_start values are 7 days apart, and put a 1 or 0 in gap_7_days accordingly.
I want to do the following (using Redshift SQL only):
Get the length of each sequence of consecutive 1s in gap_7_days for each ID
Expected output:
| id | date_start | gap_7_days | sequence_length |
| -- | ------------------- | --------------- | --------------- |
| 1 | 2021-06-10 00:00:00 | 0 | |
| 1 | 2021-06-13 00:00:00 | 0 | |
| 1 | 2021-06-19 00:00:00 | 0 | |
| 1 | 2021-06-27 00:00:00 | 0 | |
| 2 | 2021-07-04 00:00:00 | 1 | 6 |
| 2 | 2021-07-11 00:00:00 | 1 | 6 |
| 2 | 2021-07-18 00:00:00 | 1 | 6 |
| 2 | 2021-07-25 00:00:00 | 1 | 6 |
| 2 | 2021-08-01 00:00:00 | 1 | 6 |
| 2 | 2021-08-08 00:00:00 | 1 | 6 |
| 2 | 2021-08-09 00:00:00 | 0 | |
| 2 | 2021-08-16 00:00:00 | 1 | 3 |
| 2 | 2021-08-23 00:00:00 | 1 | 3 |
| 2 | 2021-08-30 00:00:00 | 1 | 3 |
| 2 | 2021-08-31 00:00:00 | 0 | |
| 2 | 2021-09-01 00:00:00 | 0 | |
| 2 | 2021-08-08 00:00:00 | 1 | 4 |
| 2 | 2021-08-15 00:00:00 | 1 | 4 |
| 2 | 2021-08-22 00:00:00 | 1 | 4 |
| 2 | 2021-08-23 00:00:00 | 1 | 4 |
Get the number of sequences for each ID
Expected output:
| id | num_sequences |
| -- | ------------------- |
| 1 | 0 |
| 2 | 3 |
How can I achieve this?
If you want the number of sequences, just look at the previous value. When the current value is "1" and the previous is NULL or 0, then you have a new sequence.
So:
select id,
sum( (gap_7_days = 1 and coalesce(prev_gap_7_days, 0) = 0)::int ) as num_sequences
from (select t.*,
lag(gap_7_days) over (partition by id order by date_start) as prev_gap_7_days
from t
) t
group by id;
If you actually want the lengths of the sequences, as in the intermediate results, then ask a new question. That information is not needed for this question.

Join on minimum date between two dates - Spark SQL

I have a table of daily data and a table of monthly data. I'm trying to retrieve one daily record corresponding to each monthly record. The wrinkles are that some days are missing from the daily data and the field I care about, new_status, is sometimes null on the month_end_date.
month_df
| ID | month_end_date |
| -- | -------------- |
| 1 | 2019-07-31 |
| 1 | 2019-06-30 |
| 2 | 2019-10-31 |
daily_df
| ID | daily_date | new_status |
| -- | ---------- | ---------- |
| 1 | 2019-07-29 | 1 |
| 1 | 2019-07-30 | 1 |
| 1 | 2019-08-01 | 2 |
| 1 | 2019-08-02 | 2 |
| 1 | 2019-08-03 | 2 |
| 1 | 2019-06-29 | 0 |
| 1 | 2019-06-30 | 0 |
| 2 | 2019-10-30 | 5 |
| 2 | 2019-10-31 | NULL |
| 2 | 2019-11-01 | 6 |
| 2 | 2019-11-02 | 6 |
I want to fuzzy join daily_df to monthly_df where daily_date is >= month_end_dt and less than some buffer afterwards (say, 5 days). I want to keep only the record with the minimum daily date and a non-null new_status.
This post solves the issue using an OUTER APPLY in SQL Server, but that seems not to be an option in Spark SQL. I'm wondering if there's a method that is similarly computationally efficient that works in Spark.

SQL Matching many-to-many dates for ID field

Edit: Fixed Start Date for User 2
I have a list of user ids, each having many start dates and many end dates.
A start date can be recorded many times after the "actual" start date of an "event", same goes for the end date.
The result should be each the first start date and first end date for each user "event"
I hope that makes sense, see the example below.
Thanks!
Assuming the Following tables are given:
Start Table:
+--------+-------------+
| UserID | Start |
+--------+-------------+
| 1 | 2019-01-01 |
| 1 | 2019-01-02 |
| 1 | 2019-01-03 |
| 1 | 2019-04-01 |
| 1 | 2019-04-02 |
| 1 | 2019-04-03 |
| 2 | 2019-06-01 |
| 2 | 2019-06-02 |
| 2 | 2019-10-01 |
| 2 | 2019-10-02 |
+--------+-------------+
End Table:
+--------+------------+
| UserID | End |
+--------+------------+
| 1 | 2019-03-01 |
| 1 | 2019-03-02 |
| 1 | 2019-03-03 |
| 1 | 2019-05-01 |
| 1 | 2019-05-02 |
| 1 | 2019-05-03 |
| 2 | 2019-08-01 |
| 2 | 2019-08-02 |
| 2 | 2019-12-01 |
| 2 | 2019-12-02 |
+--------+------------+
Result:
+--------+------------+------------+
| UserID | Start | End |
+--------+------------+------------+
| 1 | 2019-01-01 | 2019-03-01 |
| 1 | 2019-04-01 | 2019-05-01 |
| 2 | 2019-06-01 | 2019-08-01 |
| 2 | 2019-10-01 | 2019-12-01 |
+--------+------------+------------+
Not sure I agree with your 2019-10-02
Here is one solution
Example
Select UserID
,[Start] = min([Start])
,[End]
From (
Select A.*
,[End] = (Select min([End]) From EndTable Where UserID=A.UserID and [End] >= A.Start )
From StartTable A
) A
Group By UserID,[End]
Returns
UserID Start End
1 2019-01-01 2019-03-01
1 2019-04-01 2019-05-01
2 2019-06-01 2019-08-01
2 2019-10-01 2019-12-01

update last record of each cluster

I have a table and I already create the lead values for the next date in each product cluster. In addition to that I created a delta value which displays the difference between date and lead_date.
+---------+------------+------------+------------+
| Product | Date | LeadDate | delta_days |
+---------+------------+------------+------------+
| A | 2018-01-15 | 2018-01-23 | 8 |
| A | 2018-01-23 | 2018-02-19 | 27 |
| A | 2018-02-19 | 2017-05-25 | -270 |
| B | 2017-05-25 | 2017-05-30 | 5 |
| B | 2017-05-30 | 2016-01-01 | -515 |
| C | 2016-01-01 | 2016-01-02 | 1 |
| C | 2016-01-02 | 2016-01-03 | 1 |
| C | 2016-01-03 | NULL | NULL |
+---------+------------+------------+------------+
What I want to do is that I need to update the last record of each product cluster and set Lead_date and delta_days to null. How should I do this?
This is my goal:
+---------+------------+------------+------------+
| Product | Date | LeadDate | delta_days |
+---------+------------+------------+------------+
| A | 2018-01-15 | 2018-01-23 | 8 |
| A | 2018-01-23 | 2018-02-19 | 27 |
| A | 2018-02-19 | NULL | NULL |
| B | 2017-05-25 | 2017-05-30 | 5 |
| B | 2017-05-30 | NULL | NULL |
| C | 2016-01-01 | 2016-01-02 | 1 |
| C | 2016-01-02 | 2016-01-03 | 1 |
| C | 2016-01-03 | NULL | NULL |
+---------+------------+------------+------------+
Lag/Lead have a default value if it can't find the next/previous value:
LAG (scalar_expression [,offset] [,default])
OVER ( [ partition_by_clause ] order_by_clause )
Just specify that you want the [default] to be NULL in your code to produce your lead column
In your code (guess since we don't have it):
SELECT date,
LEAD([date], 1, NULL) OVER(PARTITION BY Product ORDER BY [date]) as your_new_col
IMO, this is better than running an actual update since this will be dynamic in case you insert a new record that will change the existing order of your records.
You can use updatable cte with last_value() function :
with updatable as (
select *, last_value(date) over (partition by product order by date) as last_val
from table
)
update updatable
set LeadDate = null, delta_days = null
where Date = last_val;

Get last value with delta from previous row

I have data
| account | type | position | created_date |
|---------|------|----------|------|
| 1 | 1 | 1 | 2016-08-01 00:00:00 |
| 2 | 1 | 2 | 2016-08-01 00:00:00 |
| 1 | 2 | 2 | 2016-08-01 00:00:00 |
| 2 | 2 | 1 | 2016-08-01 00:00:00 |
| 1 | 1 | 2 | 2016-08-02 00:00:00 |
| 2 | 1 | 1 | 2016-08-02 00:00:00 |
| 1 | 2 | 1 | 2016-08-03 00:00:00 |
| 2 | 2 | 2 | 2016-08-03 00:00:00 |
| 1 | 1 | 2 | 2016-08-04 00:00:00 |
| 2 | 1 | 1 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 2016-08-07 00:00:00 |
| 2 | 2 | 1 | 2016-08-07 00:00:00 |
I need to get last positions (account, type, position) and delta from previous position. I'm trying to use Window functions but only get all rows and can't grouping them/get last.
SELECT
account,
type,
FIRST_VALUE(position) OVER w AS position,
FIRST_VALUE(position) OVER w - LEAD(position, 1, 0) OVER w AS delta,
created_date
FROM table
WINDOW w AS (PARTITION BY account ORDER BY created_date DESC)
I have result
| account | type | position | delta | created_date |
|---------|------|----------|-------|--------------|
| 1 | 1 | 1 | 1 | 2016-08-01 00:00:00 |
| 1 | 1 | 2 | 1 | 2016-08-02 00:00:00 |
| 1 | 1 | 2 | 0 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 2 | 2016-08-01 00:00:00 |
| 1 | 2 | 1 | -1 | 2016-08-03 00:00:00 |
| 1 | 2 | 2 | 1 | 2016-08-07 00:00:00 |
| 2 | 1 | 2 | 2 | 2016-08-01 00:00:00 |
| 2 | 2 | 1 | 1 | 2016-08-01 00:00:00 |
| and so on |
but i need only last record for each account/type pair
| account | type | position | delta | created_date |
|---------|------|----------|-------|--------------|
| 1 | 1 | 2 | 0 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 1 | 2016-08-07 00:00:00 |
| 2 | 1 | 1 | 0 | 2016-08-04 00:00:00 |
| and so on |
Sorry for my bad language and Thanks for any help.
My "best" try..
WITH cte_delta AS (
SELECT
account,
type,
FIRST_VALUE(position) OVER w AS position,
FIRST_VALUE(position) OVER w - LEAD(position, 1, 0) OVER w AS delta,
created_date
FROM table
WINDOW w AS (PARTITION BY account ORDER BY created_date DESC)
),
cte_date AS (
SELECT
account,
type,
MAX(created_date) AS created_date
FROM cte_delta
GROUP BY account, type
)
SELECT cd.*
FROM
cte_delta cd,
cte_date ct
WHERE
cd.account = ct.account
AND cd.type = ct.type
AND cd.created_date = ct.created_date