Next value per group in SQL - sql

I am trying to fix a data quality issue and I have the following table origin:
WITH origin AS (
SELECT 1 AS item_id, 'cake' as item_group, DATE '2020-04-01' AS start_date, DATE '2020-12-07' AS end_date, 1 as group_rank UNION ALL
SELECT 2, 'cake', DATE '2020-12-07',DATE '2020-12-31', 2 as group_rank UNION ALL
SELECT 3, 'cake', DATE '2020-12-07',DATE '2020-12-31', 2 as group_rank UNION ALL
SELECT 4, 'cake', DATE '2020-12-07',DATE '2020-12-31', 2 as group_rank UNION ALL
SELECT 5, 'cake', DATE '2020-12-07',DATE '2020-12-31', 2 as group_rank UNION ALL
SELECT 6, 'cake', DATE '2020-12-31',DATE '2021-12-07', 3 as group_rank UNION ALL
SELECT 7, 'cake', DATE '2020-12-31',DATE '2021-12-07', 3 as group_rank UNION ALL
SELECT 8, 'pie', DATE '2020-12-07',DATE '2020-12-31', 1 as group_rank UNION ALL
SELECT 9, 'pie', DATE '2020-12-31',DATE '2021-12-07', 2 as group_rank UNION ALL
SELECT 10, 'pie', DATE '2020-12-31',DATE '2021-12-07', 2 as group_rank
)
select *
from origin
item_id
item_group
start_date
end_date
group_rank
1
cake
2020-04-01
2020-12-07
1
2
cake
2020-12-07
2020-12-31
2
3
cake
2020-12-07
2020-12-31
2
4
cake
2020-12-07
2020-12-31
2
5
cake
2020-12-07
2020-12-31
2
6
cake
2020-12-31
2021-12-07
3
7
cake
2020-12-31
2021-12-07
3
8
pie
2020-12-07
2020-12-31
1
9
pie
2020-12-31
2021-12-07
2
10
pie
2020-12-31
2021-12-07
2
Every row is a unique item, which is of a certain item_group: pie or cake. Items within the group are ranked according to the start_date. The problem with the table is that when I do a join with a calendar table, because some items have overlapping start_date and end_date (1 item ends the same day when the other one end ) I end up having duplicates. What I want to achieve, I want to fix the end_dates (-1 day) of the old items.
For that I need to understand whether the items are overlapping within 1 day. I thought i'd use the rank to find the next value within the group: basically check current rank, find the one higher, take the start_date of the higher rank. But i couldn't figure out the way to get this right.
So my ideal table is the following:
WITH final_result AS (
SELECT 1 AS item_id, 'cake' as item_group, DATE '2020-04-01' AS start_date, DATE '2020-12-07' AS end_date, 1 as group_rank, DATE '2020-12-07' as next_group_start_date, 1 as end_date_equals_next_group_start_date, DATE '2020-12-06' as new_end_date UNION ALL
SELECT 2, 'cake', DATE '2020-12-07',DATE '2020-12-31', 2 as group_rank, DATE '2020-12-31', 1, DATE '2020-12-30' UNION ALL
SELECT 3, 'cake', DATE '2020-12-07',DATE '2020-12-31', 2 as group_rank, DATE '2020-12-31', 1, DATE '2020-12-30' UNION ALL
SELECT 4, 'cake', DATE '2020-12-07',DATE '2020-12-31', 2 as group_rank, DATE '2020-12-31', 1, DATE '2020-12-30' UNION ALL
SELECT 5, 'cake', DATE '2020-12-07',DATE '2020-12-31', 2 as group_rank, DATE '2020-12-31', 1, DATE '2020-12-30' UNION ALL
SELECT 6, 'cake', DATE '2020-12-31',DATE '2021-12-07', 3 as group_rank, NULL, 0, DATE '2020-12-07' UNION ALL
SELECT 7, 'cake', DATE '2020-12-31',DATE '2021-12-07', 3 as group_rank, NULL, 0, DATE '2020-12-07' UNION ALL
SELECT 8, 'pie', DATE '2020-12-07',DATE '2020-12-31', 1 as group_rank, DATE '2020-12-31', 1, DATE '2020-12-30' UNION ALL
SELECT 9, 'pie', DATE '2020-12-31',DATE '2021-12-07', 2 as group_rank, NULL, 0, DATE '2020-12-06' UNION ALL
SELECT 10, 'pie', DATE '2020-12-31',DATE '2021-12-07', 2 as group_rank, NULL, 0, DATE '2020-12-06'
)
select *
from final_result
item_id
item_group
start_date
end_date
group_rank
next_group_start_date
end_date_equals_next_group_start_date
new_end_date
1
cake
2020-04-01
2020-12-07
1
2020-12-07
1
2020-12-06
2
cake
2020-12-07
2020-12-31
2
2020-12-31
1
2020-12-30
3
cake
2020-12-07
2020-12-31
2
2020-12-31
1
2020-12-30
4
cake
2020-12-07
2020-12-31
2
2020-12-31
1
2020-12-30
5
cake
2020-12-07
2020-12-31
2
2020-12-31
1
2020-12-30
6
cake
2020-12-31
2021-12-07
3
NULL
0
2020-12-07
7
cake
2020-12-31
2021-12-07
3
NULL
0
2020-12-07
8
pie
2020-12-07
2020-12-31
1
2020-12-31
1
2020-12-30
9
pie
2020-12-31
2021-12-07
2
NULL
0
2020-12-06
10
pie
2020-12-31
2021-12-07
2
NULL
0
2020-12-06
By identifying the new_group_start_date I can understand whether there is an overlap on a day. end_date_equals_next_group_start_dateshows whether start_date = new_group_start_date, i.e. there is an overlap. If so - I can create a new_end_date, which is end_date - 1.

What you want to do is use the LEAD() window function.
SELECT
*,
LEAD(start_date, 1) OVER(PARTITION BY item_group ORDER BY group_rank) AS next_group_start_date
FROM origin
This works but doesn't give the exact result you were expecting. In order to get the expected result you need to join the origin table with a table using the LEAD() window function a distinct item_group, group, start_date table.
SELECT
*,
end_date - end_date_equals_next_group_start_date AS new_end_date
FROM (
SELECT
origin.*,
b.next_group_start_date,
CASE
WHEN origin.end_date = b.next_group_start_date
THEN 1
ELSE 0
END AS end_date_equals_next_group_start_date
FROM origin
JOIN (
SELECT
item_group,
group_rank,
LEAD(start_date, 1) OVER(PARTITION BY item_group ORDER BY group_rank) AS next_group_start_date
FROM (
SELECT DISTINCT item_group, group_rank, start_date
FROM origin
) a
) b ON origin.item_group = b.item_group and origin.group_rank = b.group_rank
) c
Here's a dbfiddle of the query

Related

Analytic function/logic to get min and max record date in Oracle

I have a requirement to fetch value based on eff_dt and end date. given below sample data.
Database : Oracle 11g
Example data:
id
val
eff_date
end_date
10
100
01-Jan-21
04-Jan-21
10
105
05-Jan-21
07-Jan-21
10
100
08-Jan-21
10-Jan-21
10
100
11-Jan-21
17-Jan-21
10
100
18-Jan-21
21-Jan-21
10
110
22-Jan-21
null
output:
id
val
eff_date
end_date
10
100
01-Jan-21
04-Jan-21
10
105
05-Jan-21
07-Jan-21
10
100
08-Jan-21
21-Jan-21
10
110
22-Jan-21
null
You can use the ROW_NUMBER analytic function and then aggregate:
SELECT id,
val,
MIN(eff_date) AS eff_date,
MAX(end_date) AS end_date
FROM (
SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY eff_date)
- ROW_NUMBER() OVER (PARTITION BY id, val ORDER BY eff_date) AS grp
FROM table_name t
)
GROUP BY id, val, grp
ORDER BY id, eff_date;
Which, for the sample data:
CREATE TABLE table_name (id, val, eff_date, end_date) AS
SELECT 10, 100, DATE '2021-01-01', DATE '2021-01-04' FROM DUAL UNION ALL
SELECT 10, 105, DATE '2021-01-05', DATE '2021-01-07' FROM DUAL UNION ALL
SELECT 10, 100, DATE '2021-01-08', DATE '2021-01-10' FROM DUAL UNION ALL
SELECT 10, 100, DATE '2021-01-11', DATE '2021-01-17' FROM DUAL UNION ALL
SELECT 10, 100, DATE '2021-01-18', DATE '2021-01-21' FROM DUAL UNION ALL
SELECT 10, 110, DATE '2021-01-22', null FROM DUAL;
Outputs:
ID
VAL
EFF_DATE
END_DATE
10
100
2021-01-01 00:00:00
2021-01-04 00:00:00
10
105
2021-01-05 00:00:00
2021-01-07 00:00:00
10
100
2021-01-08 00:00:00
2021-01-21 00:00:00
10
110
2021-01-22 00:00:00
null
From Oracle 12, you can use MATCH_RECOGNIZE to perform row-by-row processing:
SELECT *
FROM table_name t
MATCH_RECOGNIZE(
PARTITION BY id
ORDER BY eff_date
MEASURES
FIRST(val) AS val,
FIRST(eff_date) AS eff_date,
LAST(end_date) AS end_date
PATTERN (same_val+)
DEFINE same_val AS FIRST(val) = val
)
Which has the same output and is likely to be more efficient.
fiddle

Last 12 months of data for each item/row from the selected date in oracle sql

I need to identify last 12 months of data for each item in each row. For eg: if A item has dateperiod:1/12/2021. it will provide an output as Qty: 23 which covers from 11/2021 to 12/2020
How can I write Sql in Oracle to achieve this below result.
Four columns item, Qty, DatePeriod and 12 months Qty Value
Item Qty DatePeriod 12 Months Qty Value
A 2 1/1/2020
A 3 1/2/2020
A 4 1/3/2020
A 1 1/4/2020
A 2 1/5/2020
A 2 1/6/2020
A 1 1/7/2020
A 2 1/8/2020
A 1 1/9/2020
A 2 1/10/2020
A 2 1/11/2020
A 2 1/12/2020
A 2 1/1/2021
A 3 1/2/2021
A 4 1/3/2021
A 1 1/4/2021
A 2 1/5/2021
A 2 1/6/2021
A 1 1/7/2021
A 2 1/8/2021
A 1 1/9/2021
A 2 1/10/2021 9/2021 to 10/2020 qty: 24
A 1 1/11/2021 10/ 2021 to 11/2020 Qty: 24
A 1 1/12/2021 11/2021 to 12/2020 Qty: 23
B 2 1/1/2020
B 2 1/2/2020
B 2 1/3/2020
B 5 1/4/2020
B 6 1/5/2020
B 2 1/6/2020
B 1 1/7/2020
B 2 1/8/2020
B 1 1/9/2020
B 2 1/10/2020
B 2 1/11/2020
B 2 1/12/2020
B 2 1/1/2021
B 1 1/2/2021
B 1 1/3/2021
B 1 1/4/2021
B 1 1/5/2021
B 2 1/6/2021
B 1 1/7/2021
B 2 1/8/2021
B 2 1/9/2021
B 3 1/10/2021 9/2021 to 10/2020 qty: 19
B 2 1/11/2021 10/ 2021 to 11/2020 Qty: 20
B 2 1/12/2021 11/2021 to 12/2020 Qty: 20
To find the last 12 months data for each item (which may have different times for the latest items) then, from Oracle 12, you can use MATCH_RECOGNIZE to perform row-by-row processing:
SELECT *
FROM table_name
MATCH_RECOGNIZE (
PARTITION BY item
ORDER BY DatePeriod DESC
MEASURES
LAST(dateperiod) AS from_date,
FIRST(dateperiod) AS to_date,
SUM(qty) AS total
PATTERN (^ year+)
DEFINE year AS dateperiod > ADD_MONTHS(FIRST(datePeriod), -12)
);
In earlier versions, you can use:
SELECT item,
from_date,
dateperiod AS to_date,
total
FROM (
SELECT t.*,
SUM(qty) OVER (
PARTITION BY item
ORDER BY dateperiod
RANGE BETWEEN INTERVAL '11' MONTH PRECEDING
AND INTERVAL '0' MONTH FOLLOWING
) AS total,
MIN(dateperiod) OVER (
PARTITION BY item
ORDER BY dateperiod
RANGE BETWEEN INTERVAL '11' MONTH PRECEDING
AND INTERVAL '0' MONTH FOLLOWING
) AS from_date,
ROW_NUMBER() OVER (PARTITION BY item ORDER BY dateperiod DESC) AS rn
FROM table_name t
)
WHERE rn = 1;
Which, for your sample data:
CREATE TABLE table_name (Item, Qty, DatePeriod) AS
SELECT 'A', 2, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 'A', 3, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 'A', 4, DATE '2020-03-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2020-04-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-05-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-06-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2020-07-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-08-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2020-09-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-10-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-11-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-12-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2021-01-01' FROM DUAL UNION ALL
SELECT 'A', 3, DATE '2021-02-01' FROM DUAL UNION ALL
SELECT 'A', 4, DATE '2021-03-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2021-04-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2021-05-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2021-06-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2021-07-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2021-08-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2021-09-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2021-10-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2021-11-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2021-12-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2020-03-01' FROM DUAL UNION ALL
SELECT 'B', 5, DATE '2020-04-01' FROM DUAL UNION ALL
SELECT 'B', 6, DATE '2020-05-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2020-06-01' FROM DUAL UNION ALL
SELECT 'B', 1, DATE '2020-07-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2020-08-01' FROM DUAL UNION ALL
SELECT 'B', 1, DATE '2020-09-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2020-10-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2020-11-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2020-12-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2021-01-01' FROM DUAL UNION ALL
SELECT 'B', 1, DATE '2021-02-01' FROM DUAL UNION ALL
SELECT 'B', 1, DATE '2021-03-01' FROM DUAL UNION ALL
SELECT 'B', 1, DATE '2021-04-01' FROM DUAL UNION ALL
SELECT 'B', 1, DATE '2021-05-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2021-06-01' FROM DUAL UNION ALL
SELECT 'B', 1, DATE '2021-07-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2021-08-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2021-09-01' FROM DUAL UNION ALL
SELECT 'B', 3, DATE '2021-10-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2021-11-01' FROM DUAL UNION ALL
SELECT 'B', 2, DATE '2021-12-01' FROM DUAL;
Both output:
ITEM
FROM_DATE
TO_DATE
TOTAL
A
2021-01-01 00:00:00
2021-12-01 00:00:00
22
B
2021-01-01 00:00:00
2021-12-01 00:00:00
20
If you want to get the running totals for each row then you can use the SUM analytic function with a range window:
SELECT t.*,
SUM(qty) OVER (
PARTITION BY item
ORDER BY dateperiod
RANGE BETWEEN INTERVAL '11' MONTH PRECEDING
AND INTERVAL '0' MONTH FOLLOWING
) AS last_year_total
FROM table_name t
Which outputs:
ITEM
QTY
DATEPERIOD
LAST_YEAR_TOTAL
A
2
2020-01-01 00:00:00
2
A
3
2020-02-01 00:00:00
5
A
4
2020-03-01 00:00:00
9
A
1
2020-04-01 00:00:00
10
A
2
2020-05-01 00:00:00
12
A
2
2020-06-01 00:00:00
14
A
1
2020-07-01 00:00:00
15
A
2
2020-08-01 00:00:00
17
A
1
2020-09-01 00:00:00
18
A
2
2020-10-01 00:00:00
20
A
2
2020-11-01 00:00:00
22
A
2
2020-12-01 00:00:00
24
A
2
2021-01-01 00:00:00
24
A
3
2021-02-01 00:00:00
24
A
4
2021-03-01 00:00:00
24
A
1
2021-04-01 00:00:00
24
A
2
2021-05-01 00:00:00
24
A
2
2021-06-01 00:00:00
24
A
1
2021-07-01 00:00:00
24
A
2
2021-08-01 00:00:00
24
A
1
2021-09-01 00:00:00
24
A
2
2021-10-01 00:00:00
24
A
1
2021-11-01 00:00:00
23
A
1
2021-12-01 00:00:00
22
B
2
2020-01-01 00:00:00
2
B
2
2020-02-01 00:00:00
4
B
2
2020-03-01 00:00:00
6
B
5
2020-04-01 00:00:00
11
B
6
2020-05-01 00:00:00
17
B
2
2020-06-01 00:00:00
19
B
1
2020-07-01 00:00:00
20
B
2
2020-08-01 00:00:00
22
B
1
2020-09-01 00:00:00
23
B
2
2020-10-01 00:00:00
25
B
2
2020-11-01 00:00:00
27
B
2
2020-12-01 00:00:00
29
B
2
2021-01-01 00:00:00
29
B
1
2021-02-01 00:00:00
28
B
1
2021-03-01 00:00:00
27
B
1
2021-04-01 00:00:00
23
B
1
2021-05-01 00:00:00
18
B
2
2021-06-01 00:00:00
18
B
1
2021-07-01 00:00:00
18
B
2
2021-08-01 00:00:00
18
B
2
2021-09-01 00:00:00
19
B
3
2021-10-01 00:00:00
20
B
2
2021-11-01 00:00:00
20
B
2
2021-12-01 00:00:00
20
db<>fiddle here

How can I partition by group that falls within a time range?

I have the following table showing when customers bought a certain product. The data I have is CustomerID, Amount, Dat. I am trying to create the column ProductsIn30Days, which represents how many products a customer bought in the range Dat-30 days inclusive the current day.
For example, ProductsIn30Days for CustomerID 1 on Dat 25.3.2020 is 7, since the customer bought 2 products on 25.3.2020 and 5 more products on 24.3.2020, which falls within 30 days before 25.3.2020.
CustomerID
Amount
Dat
ProductsIn30Days
1
1
23.3.2018
1
1
2
24.3.2020
2
1
3
24.3.2020
5
1
2
25.3.2020
7
1
2
24.5.2020
2
1
1
15.6.2020
3
2
7
24.3.2017
7
2
2
24.3.2020
2
I tried something like this with no success, since the partition only works on a single date rather than on a range like I would need:
select CustomerID, Amount, Dat,
sum(Amount) over (partition by CustomerID, Dat-30)
from table
Thank you for help.
You can use an analytic SUM function with a range window:
SELECT t.*,
SUM(Amount) OVER (
PARTITION BY CustomerID
ORDER BY Dat
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW
) AS ProductsIn30Days
FROM table_name t;
Which, for the sample data:
CREATE TABLE table_name (CustomerID, Amount, Dat) AS
SELECT 1, 1, DATE '2018-03-23' FROM DUAL UNION ALL
SELECT 1, 2, DATE '2020-03-24' FROM DUAL UNION ALL
SELECT 1, 3, DATE '2020-03-24' FROM DUAL UNION ALL
SELECT 1, 2, DATE '2020-03-25' FROM DUAL UNION ALL
SELECT 1, 2, DATE '2020-05-24' FROM DUAL UNION ALL
SELECT 1, 1, DATE '2020-06-15' FROM DUAL UNION ALL
SELECT 2, 7, DATE '2017-03-24' FROM DUAL UNION ALL
SELECT 2, 2, DATE '2020-03-24' FROM DUAL;
Outputs:
CUSTOMERID
AMOUNT
DAT
PRODUCTSIN30DAYS
1
1
2018-03-23 00:00:00
1
1
2
2020-03-24 00:00:00
5
1
3
2020-03-24 00:00:00
5
1
2
2020-03-25 00:00:00
7
1
2
2020-05-24 00:00:00
2
1
1
2020-06-15 00:00:00
3
2
7
2017-03-24 00:00:00
7
2
2
2020-03-24 00:00:00
2
Note: If you have values on the same date then they will be tied in the order and always aggregated together (i.e. rows 2 & 3). If you want them to be aggregated separately then you need to order by something else to break the ties but that would not work with a RANGE window.
db<>fiddle here

Valid_from Valid_to from a full loaded table

There is a source table which loads the data full and monthly. The table looks like below example.
Source table:
pk
code_paym
code_terms
etl_id
1
2
3
2020-08-01
1
2
3
2020-09-01
1
2
4
2020-10-01
1
2
4
2020-11-01
1
2
4
2020-12-01
1
2
4
2021-01-01
1
2
3
2021-02-01
1
2
3
2021-03-01
1
2
3
2021-04-01
1
2
3
2021-05-01
I would like to create valid_from valid_to columns from the source table like below example.
Desired Output:
pk
code_paym
code_terms
valid_from
valid_to
1
2
3
2020-08-01
2020-09-01
1
2
4
2020-10-01
2021-01-01
1
2
3
2021-02-01
2021-05-01
As it can be seen attributes can go back to the same values by the time.
How can I make this output happen by sql code?
Thank you very much,
Regards
Using CONDITIONAL_TRUE_EVENT windowed function to determine continuous subgroups:
CREATE OR REPLACE TABLE t( pk INT, code_paym INT, code_terms INT, etl_id DATE)
AS
SELECT 1, 2, 3, '2020-08-01'
UNION ALL SELECT 1, 2, 3, '2020-09-01'
UNION ALL SELECT 1, 2, 4, '2020-10-01'
UNION ALL SELECT 1, 2, 4, '2020-11-01'
UNION ALL SELECT 1, 2, 4, '2020-12-01'
UNION ALL SELECT 1, 2, 4, '2021-01-01'
UNION ALL SELECT 1, 2, 3, '2021-02-01'
UNION ALL SELECT 1, 2, 3, '2021-03-01'
UNION ALL SELECT 1, 2, 3, '2021-04-01'
UNION ALL SELECT 1, 2, 3, '2021-05-01';
Query:
WITH cte AS (
SELECT t.*,
CONDITIONAL_TRUE_EVENT(CODE_TERMS != LAG(CODE_TERMS,1,CODE_TERMS)
OVER(PARTITION BY PK, CODE_PAYM ORDER BY ETL_ID))
OVER(PARTITION BY PK, CODE_PAYM ORDER BY ETL_ID) AS grp
FROM t
)
SELECT PK, CODE_PAYM, grp, MIN(ETL_ID) AS valid_from, MAX(ETL_ID) AS valid_to
FROM cte
GROUP BY PK, CODE_PAYM, grp;
Output:

How to use date_diff for two adjacent sessions using BigQuery?

I'm trying to calculate average hours between two adjacent sessions using the data from the following table:
user_id
event_timestamp
session_num
A
2021-04-16 10:00:00.000 UTC
1
A
2021-04-16 11:00:00.000 UTC
2
A
2021-04-16 13:00:00.000 UTC
3
A
2021-04-16 16:00:00.000 UTC
4
B
2021-04-16 12:00:00.000 UTC
1
B
2021-04-16 14:00:00.000 UTC
2
B
2021-04-16 19:00:00.000 UTC
3
C
2021-04-16 10:00:00.000 UTC
1
C
2021-04-16 17:00:00.000 UTC
2
C
2021-04-16 18:00:00.000 UTC
3
So, for user A we have
1 hour between session_num = 2 and session_num = 1,
2 hours between session_num = 3 and session_num = 2,
3 hours between session_num = 4 and session_num = 3.
Same for the other users:
2, 5 hours for user B;
7, 1 hours for user C.
The result I expect to get should be the arithmetic average of this date_diff(HOUR).
So, avg(1,2,3,2,5,7,1) = 3 hours is the average time between two adjacent sessions.
Any one have an idea what query can be used so the date_diff function would be applien only for anjacent sessions?
The average hours between sessions for a given user is most simply calculated as:
select user_id,
timestamp_diff(max(event_timestamp), min(event_timestamp), hour) * 1.0 / nullif(count(*) - 1, 0)
from t
group by user_id;
That is, the average time between sessions for a user is the maximum timestamp minus the minimum timestamp divided by one less than the number of sessions.
Try this one:
with mytable as (
select 'A' as user_id, timestamp '2021-04-16 10:00:00.000' as event_timestamp, 1 as session_num union all
select 'A', '2021-04-16 11:00:00.000', 2 as session_num union all
select 'A', '2021-04-16 13:00:00.000', 3 as session_num union all
select 'A', '2021-04-16 16:00:00.000', 4 as session_num union all
select 'B', '2021-04-16 12:00:00.000', 1 as session_num union all
select 'B', '2021-04-16 14:00:00.000', 2 as session_num union all
select 'B', '2021-04-16 19:00:00.000', 3 as session_num union all
select 'C', '2021-04-16 10:00:00.000', 1 as session_num union all
select 'C', '2021-04-16 17:00:00.000', 2 as session_num union all
select 'C', '2021-04-16 18:00:00.000', 3 as session_num
)
select avg(diff) as average
from (
select
user_id,
timestamp_diff(event_timestamp, lag(event_timestamp) OVER (partition by user_id order by event_timestamp), hour) as diff
from mytable
)