Get a rolling order count into session data - google-bigquery

I have the following table
One client has two purchases in one session.
My goal is to assign a order counter to each row of the table.
To reach this goal I am using the lag function to call the last order_id and the last order_timestamp:
SELECT
lag(event_timestamp) over (partition by session_id order by
ecom_data.order_id) as prev_order_timestamp,
lag(ecom_data.order_id)
over (partition by session_id order by event_timestamp) as
prev_order_number
From table
My desired output is this:
Problem, I do not get the previous order time. Instead I get the event_timestamp from the previous event.
My second challenge is that I do not know how to assign a order_count. My desired output is this:
Ideally, this order count should be rolling as in the real data set I dont know how many orders in total each session had. There can be 0 - infinite orders.
Can you help ?
Thank you!

### create sample table (helps to introduce these in your questions)
WITH
base AS (
SELECT
'A' AS client_id,
1 AS session_id,
'pv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:17:41") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'ts' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:17:42") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'pv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:27:14") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'atc' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:27:15") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'p' AS event_name,
123 AS order_id,
DATETIME("2022-05-12 10:30:47") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'pv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:30:50") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'pv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:31:01") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'atc' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:31:20") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'ts' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:31:22") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'rv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:31:32") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'pv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:31:35") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'pv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:32:49") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'p' AS event_name,
456 AS order_id,
DATETIME("2022-05-12 10:33:35") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'pv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:33:48") AS event_timestamp
UNION ALL
SELECT
'A' AS client_id,
1 AS session_id,
'tv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:33:50") AS event_timestamp
UNION ALL
SELECT
'B' AS client_id,
1 AS session_id,
'pv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 10:31:50") AS event_timestame
UNION ALL
SELECT
'B' AS client_id,
1 AS session_id,
'p' AS event_name,
123 AS order_id,
DATETIME("2022-05-12 10:33:50") AS event_timestame
UNION ALL
SELECT
'C' AS client_id,
1 AS session_id,
'pv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 11:13:50") AS event_timestame
UNION ALL
SELECT
'C' AS client_id,
1 AS session_id,
'pv' AS event_name,
NULL AS order_id,
DATETIME("2022-05-12 11:33:50") AS event_timestame),
prev_order1 AS (
SELECT
*,
LAG(order_id) OVER (PARTITION BY client_id ORDER BY event_timestamp) AS prev_order_number1
FROM
base),
### filling in order number using your requested output
prev_order2 AS (
SELECT
*,
MAX(prev_order_number1) OVER(partition by client_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS prev_order_number2
FROM
prev_order1
ORDER BY
event_timestamp )
### inserting order_counter logic
SELECT
*,
DENSE_RANK() OVER(partition by client_id ORDER BY prev_order_number2) - 1 AS order_counter
FROM
prev_order2
Think about edge cases and perhaps if you want to partition by other dimensions such as client_id vs total table (as you have it now). I included client_id = B as an example.

Related

create time range with 2 columns date_time

The problem I am facing is how to find distinct time periods from multiple time periods with overlap in Teradata ANSI SQL.
For example, the attached tables contain multiple overlapping time periods, how can I combine those time periods into 3 unique time periods in Teradata SQL???
I think I can do it in python with the loop function, but not sure how to do it in SQL
ID
Start Date
End Date
001
2005-01-01
2006-01-01
001
2005-01-01
2007-01-01
001
2008-01-01
2008-06-01
001
2008-04-01
2008-12-01
001
2010-01-01
2010-05-01
001
2010-04-01
2010-12-01
001
2010-11-01
2012-01-01
My expected result is:
ID
start_Date
end_date
001
2005-01-01
2007-01-01
001
2008-01-01
2008-12-01
001
2010-01-01
2012-01-01
From Oracle 12, you can use MATCH_RECOGNIZE to perform a row-by-row comparison:
SELECT *
FROM table_name
MATCH_RECOGNIZE(
PARTITION BY id
ORDER BY start_date
MEASURES
FIRST(start_date) AS start_date,
MAX(end_date) AS end_date
ONE ROW PER MATCH
PATTERN (overlapping_ranges* last_range)
DEFINE overlapping_ranges AS NEXT(start_date) <= MAX(end_date)
)
Which, for the sample data:
CREATE TABLE table_name (ID, Start_Date, End_Date) AS
SELECT '001', DATE '2005-01-01', DATE '2006-01-01' FROM DUAL UNION ALL
SELECT '001', DATE '2005-01-01', DATE '2007-01-01' FROM DUAL UNION ALL
SELECT '001', DATE '2008-01-01', DATE '2008-06-01' FROM DUAL UNION ALL
SELECT '001', DATE '2008-04-01', DATE '2008-12-01' FROM DUAL UNION ALL
SELECT '001', DATE '2010-01-01', DATE '2010-05-01' FROM DUAL UNION ALL
SELECT '001', DATE '2010-04-01', DATE '2010-12-01' FROM DUAL UNION ALL
SELECT '001', DATE '2010-11-01', DATE '2012-01-01' FROM DUAL;
Outputs:
ID
START_DATE
END_DATE
001
2005-01-01 00:00:00
2007-01-01 00:00:00
001
2008-01-01 00:00:00
2008-12-01 00:00:00
001
2010-01-01 00:00:00
2012-01-01 00:00:00
db<>fiddle here
Update: Alternative query
SELECT id,
start_date,
end_date
FROM (
SELECT id,
dt,
SUM(cnt) OVER (PARTITION BY id ORDER BY dt) AS grp,
cnt
FROM (
SELECT ID,
dt,
SUM(type) OVER (PARTITION BY id ORDER BY dt, ROWNUM) * type AS cnt
FROM table_name
UNPIVOT (dt FOR type IN (start_date AS 1, end_date AS -1))
)
WHERE cnt IN (1,0)
)
PIVOT (MAX(dt) FOR cnt IN (1 AS start_date, 0 AS end_date))
Or, an equivalent that does not use UNPIVOT, PIVOT or ROWNUM and works in both Oracle and PostgreSQL:
SELECT id,
MAX(CASE cnt WHEN 1 THEN dt END) AS start_date,
MAX(CASE cnt WHEN 0 THEN dt END) AS end_date
FROM (
SELECT id,
dt,
SUM(cnt) OVER (PARTITION BY id ORDER BY dt) AS grp,
cnt
FROM (
SELECT ID,
dt,
SUM(type) OVER (PARTITION BY id ORDER BY dt, rn) * type AS cnt
FROM (
SELECT r.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY dt ASC, type DESC) AS rn
FROM (
SELECT id, 1 AS type, start_date AS dt FROM table_name
UNION ALL
SELECT id, -1 AS type, end_date AS dt FROM table_name
) r
) p
) s
WHERE cnt IN (1,0)
) t
GROUP BY id, grp
Update 2: Another Alternative
SELECT id,
MIN(start_date) AS start_date,
MAX(end_Date) AS end_date
FROM (
SELECT t.*,
SUM(CASE WHEN start_date <= prev_max THEN 0 ELSE 1 END)
OVER (PARTITION BY id ORDER BY start_date) AS grp
FROM (
SELECT t.*,
MAX(end_date) OVER (
PARTITION BY id ORDER BY start_date
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
) AS prev_max
FROM table_name t
) t
) t
GROUP BY id, grp
db<>fiddle Oracle PostgreSQL
This is a gaps and islands problem. Try this:
with u as
(select ID, start_date, end_date,
case
when start_date <= lag(end_date) over(partition by ID order by start_date, end_date) then 0
else 1 end as grp
from table_name),
v as
(select ID, start_date, end_date,
sum(grp) over(partition by ID order by start_date, end_date) as island
from u)
select ID, min(start_date) as start_Date, max(end_date) as end_date
from v
group by ID, island;
Fiddle
Basically you can identify "islands" by comparing start_date of current row to end_date of previous row (ordered by start_date, end_date), if it precedes it then it's the same island. Then you can do a rolling sum() to get the island numbers. Finally select min(start_date) and max(end_date) from each island to get the desired output.
This may work ,with little bit of change in function , I tried it in Dbeaver :
select ID,Start_Date,End_Date
from
(
select t.*,
dense_rank () over(partition by extract (year from Start_Date) order BY End_Date desc) drnk
from testing_123 t
) temp
where temp.drnk = 1
ORDER BY Start_Date;
Try this
WITH a as (
SELECT
ID,
LEFT(Start_Date, 4) as Year,
MIN(Start_Date) as New_Start_Date
FROM
TAB1
GROUP BY
ID,
LEFT(Start_Date, 4)
), b as (
SELECT
a.ID,
Year,
New_Start_Date,
End_Date
FROM
a
LEFT JOIN
TAB1
ON LEFT(a.New_Start_Date, 4) = LEFT(TAB1.Start_Date, 4)
)
select
ID,
New_Start_Date as Start_Date,
MAX(End_Date)
from
b
GROUP BY
ID,
New_Start_Date;
Example: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=97f91b68c635aebfb752538cdd752ace

LAG with condition

I want to get a value from the previous row that matches a certain condition.
For example: here I want for each row to get the timestamp from the last event = 1.
I feel I can do it without joins with LAG and PARTITION BY with CASE but I am not able to crack it.
Please help.
Here is one approach using analytic functions:
WITH cte AS (
SELECT *, COUNT(CASE WHEN event = 1 THEN 1 END) OVER
(PARTITION BY customer_id ORDER BY ts) cnt
FROM yourTable
)
SELECT ts, customer_id, event,
MAX(CASE WHEN event = 1 THEN ts END) OVER
(PARTITION BY customer_id, cnt) AS desired_result
FROM cte
ORDER BY customer_id, ts;
Demo
We can articulate your problem by saying that your want the desired_result column to contain the most recent timestamp value when the event was 1. The count (cnt) in the CTE above computes a pseudo group of records for each time the event is 1. Then we simply do a conditional aggregation over customer and pseudo group to find the timestamp value.
One more approach with "one query":
with data as
(
select sysdate - 0.29 ts, 111 customer_id, 1 event from dual union all
select sysdate - 0.28 ts, 111 customer_id, 2 event from dual union all
select sysdate - 0.27 ts, 111 customer_id, 3 event from dual union all
select sysdate - 0.26 ts, 111 customer_id, 1 event from dual union all
select sysdate - 0.25 ts, 111 customer_id, 1 event from dual union all
select sysdate - 0.24 ts, 111 customer_id, 2 event from dual union all
select sysdate - 0.23 ts, 111 customer_id, 1 event from dual union all
select sysdate - 0.22 ts, 111 customer_id, 1 event from dual
)
select
ts, event,
last_value(case when event=1 then ts end) ignore nulls
over (partition by customer_id order by ts) desired_result,
max(case when event=1 then ts end)
over (partition by customer_id order by ts) desired_result_2
from data
order by ts
Edit: As suggested by MatBailie the max(case...) works as well and is a more general approach. The "last_value ... ignore nulls" is Oracle specific.

Stopping leading observations once certain threshold met in Oracle

Hopefully this will make sense. In short, what I have is a pt with multiple encounters. Starting with the first encounter (and is always included) and then including the next encounter if within 4 hrs.
If the next encounter does not meet criteria then all other observations will not be included in the output-
The code below shows the problem. It outputs rows 1,2, and 4. I want rows 1&2 but not 4.
Any tips appreciated on this
TIA
With Base as
(select 123 as ID, 12345 as enc_id, TO_DATE('2019-07-01 13:27:18', 'YYYY-MM-DD HH24:MI:SS') as dt from dual union
select 123 as ID, 12346 as enc_id, TO_DATE('2019-07-01 16:27:18', 'YYYY-MM-DD HH24:MI:SS') as dt from dual union
select 123 as ID, 12347 as enc_id, TO_DATE('2019-07-02 16:27:18', 'YYYY-MM-DD HH24:MI:SS') as dt from dual union
select 123 as ID, 12348 as enc_id, TO_DATE('2019-07-02 18:27:18', 'YYYY-MM-DD HH24:MI:SS') as dt from dual)
select * from (select ID,ENC_ID,dt,row_number() over (partition by ID order by DT) RK,
lag(dt) over (partition by ID order by dt) prev_dt,
(DT-lag(dt) over (partition by ID order by dt))*24 as time_dif_hrs from base) where RK=1 or TIME_DIF_HRS<4
You can use another analytical function sum as follows:
Select ID,ENC_ID,dt from
(select ID,ENC_ID,dt,rk,
sum(case when (date - prev_date)* 24 < 4 then 0 else 1 end)
over( partition by ID order by DT) as cond_met_running
from (select ID,ENC_ID,dt,
row_number() over (partition by ID order by DT) RK,
lag(dt) over (partition by ID order by dt) prev_dt
from base)
)
) Where rk = 1 or cond_met_running = 0

Finding the most recent thing prior to a specific event

I'm doing some timestamp problem solving but am stuck with some join logic.
I have a table of data like so:
id, event_time, event_type, location
1001, 2018-06-04 18:23:48.526895 UTC, I, d
1001, 2018-06-04 19:26:44.359296 UTC, I, h
1001, 2018-06-05 06:07:03.658263 UTC, I, w
1001, 2018-06-07 00:47:44.651841 UTC, I, d
1001, 2018-06-07 00:48:17.857729 UTC, C, d
1001, 2018-06-08 00:04:53.086240 UTC, I, a
1001, 2018-06-12 21:23:03.071829 UTC, I, d
...
And I'm trying to find the timestamp difference between when a user has an event_type of C and the most recent event type of I up to event_type C for a given location value.
Ultimately the schema I'm after is:
id, location, timestamp_diff
1001, d, 33
1001, z, 21
1002, a, 55
...
I tried the following, which works for only one id value, but doesn't seem to work for multiples ids. I might be over-complicating the issue, but I wasn't sure. On one id it gives about 5 rows, which is right. However, when I open it up two ids, I get upwards of 200 rows when I should get something like 7 (5 for the first id and 2 for the second):
with c as (
select
id
,event_time as c_time
,location
from data
where event_type = 'C'
and id = '1001'
)
,i as (
select
id
,event_time as i_time
,location
from data
where event_type = 'I'
)
,check1 as (
c.*
,i.i_time
from c
left join i on (c.id = i.id and c.location = i.location)
group by 1,2,3,4
having i_time <= c_time
)
,check2 as (
select
id
,c_time
,location
,max(i_time) as i_time
from check1
group by 1,2,3
)
select
id
,location
,timestamp_diff(c_time, i_time, second) as timestamp_diff
#standardSQL
SELECT id, location, TIMESTAMP_DIFF(event_time, i_event_time, SECOND) AS diff
FROM (
SELECT *, MAX(IF(event_type = 'I', event_time, NULL)) OVER(win2) AS i_event_time
FROM (
SELECT *, COUNTIF(event_type = 'C') OVER(win1) grp
FROM `project.dataset.table`
WINDOW win1 AS (PARTITION BY id, location ORDER BY event_time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
WINDOW win2 AS (PARTITION BY id, location, grp ORDER BY event_time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
WHERE event_type = 'C'
AND NOT i_event_time IS NULL
This version addresses some edge cases - like for example case when there are consecutive 'C' events with "missing" 'I' events as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1001 id, TIMESTAMP '2018-06-04 18:23:48.526895 UTC' event_time, 'I' event_type, 'd' location UNION ALL
SELECT 1001, '2018-06-04 19:26:44.359296 UTC', 'I', 'h' UNION ALL
SELECT 1001, '2018-06-05 06:07:03.658263 UTC', 'I', 'w' UNION ALL
SELECT 1001, '2018-06-07 00:47:44.651841 UTC', 'I', 'd' UNION ALL
SELECT 1001, '2018-06-07 00:48:17.857729 UTC', 'C', 'd' UNION ALL
SELECT 1001, '2018-06-08 00:04:53.086240 UTC', 'C', 'd' UNION ALL
SELECT 1001, '2018-06-12 21:23:03.071829 UTC', 'I', 'd'
)
SELECT id, location, TIMESTAMP_DIFF(event_time, i_event_time, SECOND) AS diff
FROM (
SELECT *, MAX(IF(event_type = 'I', event_time, NULL)) OVER(win2) AS i_event_time
FROM (
SELECT *, COUNTIF(event_type = 'C') OVER(win1) grp
FROM `project.dataset.table`
WINDOW win1 AS (PARTITION BY id, location ORDER BY event_time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
WINDOW win2 AS (PARTITION BY id, location, grp ORDER BY event_time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
WHERE event_type = 'C'
AND NOT i_event_time IS NULL
result is
Row id location diff
1 1001 d 33
while if not to address that mentioned edge case it would be
Row id location diff
1 1001 d 33
2 1001 d 83795
You can use a cumulative max() function to get the most recent i time before every event.
Then just filter based on the C event:
select id, location,
timestamp_diff(event_time, i_event_time, second) as diff
from (select t.*,
max(case when event_type = 'I' then event_time end) over (partition by id, location order by event_time) as i_event_time
from t
) t
where event_type = 'C';

SQL: Query for getting date wise increment

Need to count the number of user being added every day, given a date range from date and to date, an e.g is shown below:
select
'2017-06-01' as myDate
, count(distinct user_id)
from tbl_stats
where date(dateTime)<='2017-06-01'
union all
select
'2017-06-02' as myDate
, count(distinct user_id)
from tbl_stats
where date(dateTime)<='2017-06-02'
The output would be like:
reportDate | count
------------+-------
2017-06-01 | 2467
2017-06-02 | 2470
So, I will just have fromDate and toDate and i would need date wise distinct user count in the table. I will not be using any procedures or loops.
SELECT DATE(ts.dateTime) AS reportDate
, COUNT(distinct ts.user_id) AS userCount
FROM tbl_stats AS ts
WHERE ts.dateTime >= #lowerBoundDate
AND ts.dateTime < TIMESTAMPADD('DAY', 1, #upperBoundDate)
GROUP BY DATE(ts.dateTime)
To get cumulative (distinct) users count per day, use following, replace custom dates given in following example with your start and end dates.
WITH test_data AS (
SELECT '2017-01-01'::date as event_date, 1::int as user_id
UNION
SELECT '2017-01-01'::date as event_date, 2::int as user_id
UNION
SELECT '2017-01-02'::date as event_date, 1::int as user_id
UNION
SELECT '2017-01-02'::date as event_date, 2::int as user_id
UNION
SELECT '2017-01-02'::date as event_date, 3::int as user_id
UNION
SELECT '2017-01-03'::date as event_date, 4::int as user_id
UNION
SELECT '2017-01-03'::date as event_date, 5::int as user_id
UNION
SELECT '2017-01-04'::date as event_date, 1::int as user_id
UNION
SELECT '2017-01-04'::date as event_date, 2::int as user_id
UNION
SELECT '2017-01-04'::date as event_date, 3::int as user_id
UNION
SELECT '2017-01-04'::date as event_date, 4::int as user_id
UNION
SELECT '2017-01-04'::date as event_date, 5::int as user_id
UNION
SELECT '2017-01-04'::date as event_date, 6::int as user_id
UNION
SELECT '2017-01-05'::date as event_date, 3::int as user_id
UNION
SELECT '2017-01-05'::date as event_date, 4::int as user_id
UNION
SELECT '2017-01-05'::date as event_date, 5::int as user_id
UNION
SELECT '2017-01-05'::date as event_date, 6::int as user_id
UNION
SELECT '2017-01-05'::date as event_date, 7::int as user_id
UNION
SELECT '2017-01-05'::date as event_date, 8::int as user_id
UNION
SELECT '2017-01-06'::date as event_date, 7::int as user_id
UNION
SELECT '2017-01-06'::date as event_date, 9::int as user_id
)
SELECT event_date,
COUNT(distinct user_id) AS distinct_user_per_day,
SUM(COUNT(distinct user_id)) OVER (ORDER BY event_date) AS cumulative_user_count
FROM test_data
WHERE event_date >= '2017-01-01'
AND
event_date <= '2017-01-06'
GROUP BY
event_date