Numbering rows based on multiple fields changes (incluging an "invisible" one) in PostgreSQL - sql

I had a look at the previous topics, but I cannot achieve what I want.
I have a table like this :
id status update_date
--- --- ---
A PENDING 2020-11-01
A PENDING 2020-11-02
A CONFIRMED 2020-11-03
A CONFIRMED 2020-11-04
A CONFIRMED 2020-11-05
A PENDING 2020-11-06
A PAID 2020-11-07
B CONFIRMED 2020-11-02
etc.
and I want to have this :
id status rank
--- --- ---
A PENDING 1
A CONFIRMED 2
A PENDING 3
A PAID 4
B CONFIRMED 1
etc.
meaning taking into account the update_date (and of course the status change) to sort and number the rows, but NOT having the order date in the final result
PS: as you can see, I can go back and forth from one status to the other ( PENDING -> CONFIRMED -> PENDING -> etc.) multiple times
Thanks lot !

You can address this as a gaps-and-island problem. The difference between row numbers gives you the group each record belongs to, that you can then use to aggregate:
select id, status,
row_number() over(partition by id order by min(update_date)) as rn
from (
select t.*,
row_number() over(partition by id order by update_date) rn1,
row_number() over(partition by id, status order by update_date) rn2
from mytable t
) t
group by id, status, rn1 - rn2
order by id, min(update_date)
Demo on DB Fiddle:
id | status | rn
:- | :-------- | -:
A | PENDING | 1
A | CONFIRMED | 2
A | PENDING | 3
A | PAID | 4
B | CONFIRMED | 1

step-by-step demo:db<>fiddle
SELECT
id,
status,
row_number() OVER (PARTITION BY id) -- 3
FROM (
SELECT
*,
lead(status) OVER (PARTITION BY id ORDER BY update_date) AS next -- 1
FROM
mytable
) s
WHERE status != next OR next is null -- 2
lead() window function copies the next status value to the current record
Remove all records, where the current and the next status equal (no change of status)
add a row count with row_number() window function

Related

How to combine Cross Join and String Agg in Bigquery with date time difference

I am trying to go from the following table
| user_id | touch | Date | Purchase Amount
| 1 | Impression| 2020-09-12 |0
| 1 | Impression| 2020-10-12 |0
| 1 | Purchase | 2020-10-13 |125$
| 1 | Email | 2020-10-14 |0
| 1 | Impression| 2020-10-15 |0
| 1 | Purchase | 2020-10-30 |122
| 2 | Impression| 2020-10-15 |0
| 2 | Impression| 2020-10-16 |0
| 2 | Email | 2020-10-17 |0
to
| user_id | path | Number of days between First Touch and Purchase | Purchase Amount
| 1 | Impression,Impression,Purchase | 2020-10-13(Purchase) - 2020-09-12 (Impression) |125$
| 1 | Email,Impression, Purchase | 2020-10-30(Purchase) - 2020-10-14(Email) | 122$
| 2 | Impression, Impression, Email | 2020-12-31 (Fixed date) - 2020-10-15(Impression) | 0$
In essence, I am trying to create a new row for each unique user in the table every time a 'Purchase' is encountered in a comma-separated string.
Also, take the difference between the first touch and first purchase for each unique user. When a new row is created we do the same for the same user as show in the example above.
From the little I have gathered I need to use a mixture of cross join and string agg but I tried using a case statement within string agg and was not able to get to the required result.
Is there a better way to do it in SQL (Bigquery).
Thank you
Below is for BigQuery Standard SQL
#standardSQL
select user_id,
string_agg(touch order by date) path,
date_diff(max(date), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
if to apply to sample data from your question - output is
another change, in case there is no Purchase in the touch we calculate the number of days from a fixed window we have set. How can I add this to the query above?
select user_id,
string_agg(touch order by date) path,
date_diff(if(countif(touch = 'Purchase') = 0, '2020-12-31', max(date)), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
with output
Means you need solution which divides row if there is purchase in touch.
Use following query:
Select user_id,
Aggregation function according to your requirement,
Sum(purchase_amount)
From
(Select t.*,
Sum(case when touch = 'Purchase' then 1 else 0 end) over (partition by user_id order by date) as sm
From t) t
Group by user_id, sm
We could approach this as a gaps-and-island problem, where every island ends with a purchase. How do we define the groups? By counting how many purchases we have ahead (current row included) - so with a descending sort in the query.
select user_id, string_agg(touch order by date),
min(date) as first_date, max(date) as max_date,
date_diff(max(date), min(date)) as cnt_days
from (
select t.*,
countif(touch = 'Purchase') over(partition by user_id order by date desc) as grp
from mytable t
) t
group by user_id, grp
You can create a value for each row that corresponds to the number of instances where table.touch = 'Purchase', which can then be used to group on:
with r as (select row_number() over(order by t1.user_id) rid, t1.* from table t1)
select t3.user_id, group_concat(t3.touch), sum(t3.amount), date_diff(max(t3.date), min(t3.date))
from (select
(select sum(r1.touch = 'Purchase' AND r1.rid < r2.rid) from r r1) c1, r2.* from r r2
) t3
group by t3.c1;

Query with conditional lag statement

I'm trying to find the previous value of a column where the row meets some criteria. Consider the table:
| user_id | session_id | time | referrer |
|---------|------------|------------|------------|
| 1 | 1 | 2018-01-01 | [NULL] |
| 1 | 2 | 2018-02-01 | google.com |
| 1 | 3 | 2018-03-01 | google.com |
I want to find, for each session, the previous value of session_id where the referrer is NULL. So, for the second AND third rows, the value of parent_session_id should be 1.
However, by just using lag(session_id) over (partition by user_id order by time), I will get parent_session_id=2 for the 3rd row.
I suspect it can be done using a combination of window functions, but I just can't figure it out.
I'd use last_value() in combination with if():
WITH t AS (SELECT * FROM UNNEST([
struct<user_id int64, session_id int64, time date, referrer string>(1, 1, date('2018-01-01'), NULL),
(1,2,date('2018-02-01'), 'google.com'),
(1,3,date('2018-03-01'), 'google.com')
]) )
SELECT
*,
last_value(IF(referrer is null, session_id, NULL) ignore nulls)
over (partition by user_id order by time rows between unbounded preceding and 1 preceding) lastNullrefSession
FROM t
You could even do this via a correlated subquery:
SELECT
session_id,
(SELECT MAX(t2.session_id) FROM yourTable t2
WHERE t2.referrer IS NULL AND t2.session_id < t1.session_id) prev_session_id
FROM yourTable t1
ORDER BY
session_id;
Here is an approach using analytic functions which might work:
WITH cte AS (
SELECT *,
SUM(CASE WHEN referrer IS NULL THEN 1 ELSE 0 END)
OVER (ORDER BY session_id) cnt
FROM yourTable
)
SELECT
session_id,
CASE WHEN cnt = 0
THEN NULL
ELSE MIN(session_id) OVER (PARTITION BY cnt) END prev_session_id
FROM cte
ORDER BY
session_id;

Count values checking if consecutive

This is my table:
Event Order Timestamp
delFailed 281475031393706 2018-07-24T15:48:08.000Z
reopen 281475031393706 2018-07-24T15:54:36.000Z
reopen 281475031393706 2018-07-24T15:54:51.000Z
I need to count the number of event 'delFailed' and 'reopen' to calculate #delFailed - #reopen.
The difficulty is that there cannot be two same consecutives events, so that in this case the result will be "0" not "-1".
This is what i have achieved so far (Which is wrong because it gives me -1 instead of 0 due to the fact there are two consecutive "reopen" events )
with
events as (
select
event as events,
orders,
"timestamp"
from main_source_execevent
where orders = '281475031393706'
and event in ('reopen', 'delFailed')
order by "timestamp"
),
count_events as (
select
count(events) as CEvents,
events,
orders
from events
group by orders, events
)
select (
(select cevents from count_events where events = 'delFailed') - (select cevents from count_events where events = 'reopen')
) as nAttempts,
orders
from count_events
group by orders
How can i count once if there are two same consecutive events?
It is a gaps-and-islands problem, you can use make to row number to check rows are two same consecutive events
Explain
one row number created by normal.
another row number created by Event column
SELECT *
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
| event | Order | timestamp | grp | rn |
|-----------|-----------------|----------------------|-----|----|
| delFailed | 281475031393706 | 2018-07-24T15:48:08Z | 1 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:36Z | 2 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:51Z | 3 | 2 |
when you create those two row you can get an upper result, then use grp - rn to get calculation the row are or are not same consecutive.
SELECT *,grp-rn
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
| event | Order | timestamp | grp | rn | grp-rn |
|-----------|-----------------|----------------------|-----|----|----------|
| delFailed | 281475031393706 | 2018-07-24T15:48:08Z | 1 | 1 | 0 |
| reopen | 281475031393706 | 2018-07-24T15:54:36Z | 2 | 1 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:51Z | 3 | 2 | 1 |
you can see when if there are two same consecutive events grp-rn column will be the same, so we can group by by grp-rn column and get count
Final query.
CREATE TABLE T(
Event VARCHAR(50),
"Order" VARCHAR(50),
Timestamp Timestamp
);
INSERT INTO T VALUES ('delFailed',281475031393706,'2018-07-24T15:48:08.000Z');
INSERT INTO T VALUES ('reopen',281475031393706,'2018-07-24T15:54:36.000Z');
INSERT INTO T VALUES ('reopen',281475031393706,'2018-07-24T15:54:51.000Z');
Query 1:
SELECT
SUM(CASE WHEN event = 'delFailed' THEN 1 END) -
SUM(CASE WHEN event = 'reopen' THEN 1 END) result
FROM (
SELECT Event,COUNT(distinct Event)
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
group by grp - rn,Event
)t1
Results:
| result |
|--------|
| 0 |
I would just use lag() to get the first event in any sequence of similar values. Then do the calculation:
select sum( (event = 'reopen')::int ) as num_reopens,
sum( (event = 'delFailed')::int ) as num_delFailed
from (select mse.*,
lag(event) over (partition by orders order by "timestamp") as prev_event
from main_source_execevent mse
where orders = '281475031393706' and
event in ('reopen', 'delFailed')
) e
where prev_event <> event or prev_event is null;

Rank function for date in Oracle SQL

I have the following code for example:
SELECT id, order_day, purchase_id FROM d
customer_id and purchase_id are unique. Each customer_id could have multiple purchase_id. Assume every one has made at least 5 orders.
Now, I just want to pull the first 5 purchase IDs of each customers ID (this depends on the earliest dates of purchases). I want the result to look like this:
id | purchase_id | rank
-------------------------
A | WERFEW43 | 1
A | ERTGDSFV | 3
A | FDGRT45 | 2
A | BRTE4TEW | 4
A | DFGDV | 5
B | DSFSF | 1
B | CF345 | 2
B | SDFSDFSDFS | 4
I thought of Ranking order_day, but my knowledge is not good enough to pull this off.
select id,purchase_id, rank() over (order by order_day)
from d
you also can try dense_rank() over (order by order_day) and row_number() over (order by order_day) and choose which one will be more suitable for you
select *
from
( SELECT
id
,order_day
,purchase_id
,row_number() -- ranking
over (partition by id -- each customer
order by order_day) as rn -- based on oldest dates
FROM d
) as dt
where rn <= 5

Group by groups of 3 by column

Say I have a table that looks like this:
| id | category_id | created_at |
| 1 | 3 | date... |
| 2 | 4 | date... |
| 3 | 1 | date... |
| 4 | 2 | date... |
| 5 | 5 | date... |
| 6 | 6 | date... |
And imagine there are a lot more entries. I'd like to grab these in a way that they are fresh, so ordering them by created_at DESC - but I'd also like to group them by category, in groups of 3!
So in pseudocode it looks something like this:
Go to category 1
-> Pick last 3
Go to category 2
-> Pick last 3
Go to category 3
-> Pick last 3
And so forth, starting over from category_id 1 when there's no other category to grab from. This will then be paginated as well so I need to make it work with offset & limit as well somehow.
I'm not at all sure where to start or what they keywords to google for are. I'd be happy with some nudges in the right direction so I can find the answer myself, or a full answer.
Another case for the window function row_number().
Just the latest 3 rows per category
SELECT id, category_id, created_at
FROM (
SELECT id, category_id, created_at
, row_number() OVER (PARTITION BY category_id
ORDER BY created_at DESC) AS rn
FROM tbl
) sub
WHERE rn < 4
ORDER BY category_id, rn;
The latest 3 rows per category, plus later rows
If you want to append the rest of the rows (your question gets fuzzy if and how):
SELECT *
FROM (
SELECT id, category_id, created_at
, row_number() OVER (PARTITION BY category_id
ORDER BY created_at DESC) AS rn
FROM tbl
) sub
ORDER BY (rn > 3), category_id, rn;
One can sort by the outcome of a boolean expression (rn > 3):
FALSE (0)
TRUE (1)
NULL (because default is NULLS LAST - not applicable here)
This way, the latest 3 rows per category come first and all the rest later.
Or use a CTE and UNION ALL:
WITH cte AS (
SELECT id, category_id, created_at
, row_number() OVER (PARTITION BY category_id
ORDER BY created_at DESC) AS rn
FROM tbl
)
)
SELECT id, category_id, created_at
FROM cte
WHERE rn < 4
ORDER BY category_id, rn
)
UNION ALL
)
SELECT id, category_id, created_at
FROM cte
WHERE rn >= 4
ORDER BY category_id, rn
);
Same result.
All parentheses required to attach ORDER BY in individual legs of a UNION query.