Calculate the streaks of visit of users limited to 7 - sql

I am trying to calculate the consecutive visits a user makes on an app. I used the rank function to determine the streaks maintained by each user. However, my requirement is that the streaks should not exceed 7.
For instance, if a user visits the app for 9 consecutive days. He will have 2 different streaks: one with count 7 and the other with 2.
Using MaxCompute. It's similar to MySQL.
I have the following table named visitors_data:
user_id visit_date
murtaza 01-01-2021
john 01-01-2021
murtaza 02-01-2021
murtaza 03-01-2021
murtaza 04-01-2021
john 01-01-2021
murtaza 05-01-2021
murtaza 06-01-2021
john 02-01-2021
john 03-01-2021
murtaza 07-01-2021
murtaza 08-01-2021
murtaza 09-01-2021
john 20-01-2021
john 21-01-2021
Output should look like this:
user_id streak
murtaza 7
murtaza 2
john 3
john 2
I was able to get the streaks by the following query, but I could not limit the streaks to 7.
WITH groups AS (
SELECT user_id,
RANK() OVER (ORDER BY user_id, visit_date) AS RANK,
visit_date,
DATEADD(visit_date, -RANK() OVER (ORDER BY user_id, visit_date), 'dd') AS date_group
FROM visitors_data
ORDER BY user_id, visit_date)
SELECT
user_id,
COUNT(*) AS streak
FROM groups
GROUP BY
user_id,
date_group
HAVING COUNT(*)>1
ORDER BY COUNT(*);

My thinking ran along similar lines to forpas':
SELECT user_id, COUNT(*) streak
FROM
(
SELECT
user_id, streak,
FLOOR((ROW_NUMBER() OVER (PARTITION BY user_id, streak ORDER BY visit_date)-1)/7) substreak
FROM
(
SELECT
user_id, visit_date,
SUM(runtot) OVER (PARTITION BY user_id ORDER BY visit_date) streak
FROM (
SELECT
user_id, visit_date,
CASE WHEN DATE_ADD(visit_date, INTERVAL -1 DAY) = LAG(visit_date) OVER (PARTITION BY user_id ORDER BY visit_date) THEN 0 ELSE 1 END as runtot
FROM visitors_data
GROUP BY user_id, visit_date
) x
) y
) z
GROUP BY user_id, streak, substreak
As an explanation of how this works; a usual trick for counting runs of successive records is to use LAG to examine the record before and if there is only e.g. one day difference then put a 0, otherwise put a 1. This then means the first record of a consecutive run is 1, and the rest are 0, so the column ends up looking like ​1,0,0,0,1,0... SUM OVER ORDER BY sums this in a "running total" fashion. This effectively means it forms a counter that ticks up every time the start of a run is encountered so a run of 4 days followed by a gap then a run of 3 days looks like 1,1,1,1,2,2,2 etc and it forms a "streak ID number".
If this is then fed into a row numbering that partitions by the streak ID number, it establishes an incrementing counter that restarts every time the streak ID changes. If we sub 1 off this so it runs from 0 instead of 1 then we can divide it by 7 to get a "sub streak ID" for our 9-long streak that is 0,0,0,0,0,0,0,1,1 (and so on. A streak of 25 would have 7 zeroes, 7 ones, 7 twos, and 4 threes)
All that remains then is to group by the user, the streak ID, the substreakID and count the result
Before the final group and count the data looks like:
Which should give some idea of how it all works

With a mix of window functions and aggregation:
SELECT user_id, COALESCE(NULLIF(MAX(counter) % 7, 0), 7) streak
FROM (
SELECT *, COUNT(*) OVER (PARTITION BY user_id, grp ORDER BY visit_date) counter
FROM (
SELECT *, SUM(flag) OVER (PARTITION BY user_id ORDER BY visit_date) grp
FROM (
SELECT *, COALESCE(DATE_ADD(visit_date, INTERVAL -1 DAY) <>
LAG(visit_date) OVER (PARTITION BY user_id ORDER BY visit_date), 1) flag
FROM (SELECT DISTINCT * FROM visitors_data) t
) t
) t
) t
GROUP BY user_id, grp, FLOOR((counter - 1) / 7)
See the demo.

You could break them up after the fact. For instance, if you never have more than 21:
SELECT user_id, LEAST(streak, 7)
FROM (SELECT user_id, COUNT(*) AS streak
FROM groups
GROUP BY user_id, date_group
HAVING COUNT(*) > 1
) gu JOIN
(SELECT 1 as n UNION ALL SELECT 2 as n UNION ALL SELECT 3 UNION ALL SELECT 4
) n
ON streak >= n * 7
ORDER BY LEAST(streak, 7);
If you have an indeterminate number range for the longest streak, you can do something similar with a recursive CTE>

Related

How to fill table with missed dates in PostgreSQL with previous data

I have a table:
date
user_id
state
8/12/2021
1
visit
9/12/2021
1
registered
12/12/2021
1
order
In this table I only have updated of state of users, but I don't see the state by some particular date. How can I add rows with missing dates and fill them with previous value, so that the table will be:
date
user_id
state
8/12/2021
1
visit
9/12/2021
1
registered
10/12/2021
1
registered
11/12/2021
1
registered
12/12/2021
1
order
Here's one attempt. The cte user_dates gets min and max dates for each user that is then fed to generate_series. I.e. each user is associated with all dates between there first and last date.
In the inner select we create a group for each first_value and consecutive null states.
In the outer select we pick the first_value for each such grp.
with user_dates(f, t, user_id) as (
select min(T.dt), max(T.dt), user_id
from T
group by user_id
)
select user_id, dt, grp, first_value(state) over (partition by user_id, grp order by dt)
from (
select ud.user_id
, cal.dt::date
, state
, count(T.state) over (partition by user_id
order by cal.dt) as grp
from user_dates ud
cross join generate_series(ud.f::timestamp, ud.t::timestamp , interval '1 day') cal (dt)
left join T
using (dt, user_id)
) as tmp
order by user_id, dt
;
user_id dt grp first_value
1 2021-12-08 1 visit
1 2021-12-09 2 registered
1 2021-12-10 2 registered
1 2021-12-11 2 registered
1 2021-12-12 3 order
You can remove grp from the select, it's merely there for informative purposes.
Fiddle

Find repeating values of a certain value

I have a table similar to:
Date
Person
Distance
2022/01/01
John
15
2022/01/02
John
0
2022/01/03
John
0
2022/01/04
John
0
2022/01/05
John
19
2022/01/01
Pete
25
2022/01/02
Pete
12
2022/01/03
Pete
0
2022/01/04
Pete
0
2022/01/05
Pete
1
I want to find all persons who have a distance of 0 for 3 or more consecutive days.
So in the above, it must return John and the count of the days with a zero distance.
I.e.
Person
Consecutive Days with Zero
John
3
I'm looking at something like this, but I think this might be way off:
Select Person, count(*),
(row_number() over (partition by Person, Date order by Person, Date))
from mytable
Provided I understand your requirement you could, for your sample data, just calculate the difference in days of a windowed min/max date:
select distinct Person, Consecutive from (
select *, DateDiff(day,
Min(date) over(partition by person),
Max(date) over(partition by person)
) + 1 Consecutive
from t
where distance = 0
)t
where Consecutive >= 3;
Example Fiddle
If you can have gaps in the dates you could try the following that only considers rows with 1 day between each date (and could probably be simplified):
with c as (
select *, Row_Number() over (partition by person order by date) rn,
DateDiff(day, Lag(date) over(partition by person order by date), date) c
from t
where distance = 0
), g as (
select Person, rn - Row_Number() over(partition by person, c order by date) grp
from c
)
select person, Count(*) + 1 consecutive
from g
group by person, grp
having Count(*) >= 2;
One option is to:
transform your "Distance" values into a boolean, where distance of 0 becomes 1 and any other value becomes zero
compute a running sum over your transformed "Distance" values in a window of three rows, using a frame specification clause
filter out any "Person" value which has at least one sum of 3.
WITH cte AS (
SELECT *, SUM(CASE WHEN Distance = 0 THEN 1 ELSE 0 END) OVER(
PARTITION BY Person
ORDER BY Date_
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS window_of_3
FROM tab
)
SELECT DISTINCT Person
FROM cte
WHERE window_of_3 = 3
Check the demo here.
Note: This solution requires your table to have no missing dates. In case missing dates is a possible scenario, then it's necessary to add missing rows corresponding to the dates not found for each "Person" value, for this solution to work.

How to find current_streaks in BQ SQL

Looking for the best way to find all current streaks of today in BigQuery ( so essentially the answer must be row_number() based but otherwise any flavor SQL should do..).
created_at | user_id
-------------+---------
2022-02-10 | 1
2022-02-09 | 1
2022-02-08 | 1
2022-02-10 | 2
2022-01-20 | 3
Desired result only showing User_ID of the Streaker and their # of days Streaked
user_id | streak
----------+---------
1 | 3
2 | 1
UserID: 2 is ignored because it's streak did not make it to today
You can add a condition outside the streak-identification code, which validates the existence of current_date() in the streak set and only display the valid streaks (i.e. ones which connect to today's date):
select user_id, array_length(array_agg(distinct created_at)) as streak from (
select
user_id,
created_at,
date_sub(created_at, interval rnk day) as grp from (
select
user_id,
date(created_at) as created_at,
dense_rank() over (partition by user_id order by created_at) as rnk
from table
)
)
group by user_id, grp
having current_date() in unnest( array_agg(distinct created_at))

Category Entry and Exit Dates per ID AND Category

I have the following table, where ID is the unique identifier. An can move from category to category, both up and down. My table records each day an ID stays in a given category. I am trying to identify the start date and the end date of an ID in a given category. The problem is that an ID can move up a category, and move back down to its original category after a certain number of days. Here is my table as an example with only 1 ID:
ID Category Date
1 1 2021-01-01
1 1 2021-01-02
...
1 1 2021-01-24
1 2 2021-01-25
...
1 2 2021-02-15
1 1 2021-02-16
...
1 1 2021-04-20
1 2 2021-04-21
When I try to get the MIN(DATE) and MAX(DATE) and group by the category and ID, it shows me that the account was in Category 1 from 2021-01-01 to 2021-04-20, and in Category 2 from 02-25 to 04-21. I am trying to track the movements of the file in each bucket step by step, meaning in my ideal result, the movements of the account will be tracked as:
ID Category StartDate EndDate
1 1 2021-01-01 2021-01-24
1 2 2021-01-25 2021-02-15
1 1 2021-02-16 2021-04-20
1 2 2021-04-21 NULL (or GETDATE())
How can I achieve this result? Any help would be appreciated. I tried using the RANK() function but because the table records every single day, it seems useless.
This is a type of gaps-and-islands problem that is most easily solved using the difference of row numbers:
select id, category, min(date), max(date)
from (select t.*,
row_number() over (partition by id order by date) as seqnum,
row_number() over (partition by id, category order by date) as seqnum_2
from t
) t
group by id, category, (seqnum - seqnum_2);
Actually, the difference of row numbers is only simplest because you have not specified the database. You can just subtract a sequence of numbers from the date to get a constant that defines each group. That looks like:
select id, category, min(date), max(date)
from (select t.*,
row_number() over (partition by id, category order by date) as seqnum
from t
) t
group by id, category, date - seqnum * interval '1 day';
However, the date arithmetic varies by database.

Teradara SQL - Operation with max-min dates

suppose I have the following data frame in Reradata SQL.
How can I get the variation between the highest and lowest date, at user level? Regards
Initial table
user date price
1 1-1 10
1 2-1 20
1 3-1 30
2 1-1 12
2 2-1 22
2 3-1 32
3 1-1 13
3 2-1 23
3 3-1 33
Final table
user var_price
1 30/10-1
2 32/12-1
3 33/13-1
Try this-
SELECT B.[user],
CAST(SUM(B.max_price) AS VARCHAR)+'/'+CAST(SUM(B.min_price) AS VARCHAR)+ '-1' var_price,
SUM(B.max_price)/SUM(B.min_price) -1 calculated_var_price
FROM
(
SELECT * FROM
(
SELECT [user],0 max_price,price min_price,ROW_NUMBER() OVER (PARTITION BY [user] ORDER BY DATE) RN
FROM your_table
)A WHERE RN = 1
UNION ALL
SELECT * FROM
(
SELECT [user],price max_price,0 min_price, ROW_NUMBER() OVER (PARTITION BY [user] ORDER BY DATE DESC) RN
FROM your_table
)A WHERE RN = 1
)B
GROUP BY B.[user]
Output is-
user var_price calculated_var_price
1 30/10-1 2
2 32/12-1 1
3 33/13-1 1
Is this what you want?
select user, max(price) / min(price) - 1
from t
group by user;
Your values are monotonically increasing, so max() and min() seems like the simplest solution.
EDIT:
You can use window functions:
select user, max(last_price) / max(first_price) - 1
from (select t.*,
first_value(price) over (partition by user order by date rows between unbounded preceding and current_row) as first_price,
first_value(price) over (partition by user order by date desc rows between unbounded preceding and current_row) as last_price
from t
) t
group by user;
select user
,price as first_price
,last_value(price)
over (paritition by user
order by date
rows between unbounded preceding and unbounded following) as last_price
from mytab
qualify
row_number() -- lowest date only
over (paritition by user
order by date) = 1
This returns the row with the lowest date and adds the price of the latest date