How to find current_streaks in BQ SQL - sql

Looking for the best way to find all current streaks of today in BigQuery ( so essentially the answer must be row_number() based but otherwise any flavor SQL should do..).
created_at | user_id
-------------+---------
2022-02-10 | 1
2022-02-09 | 1
2022-02-08 | 1
2022-02-10 | 2
2022-01-20 | 3
Desired result only showing User_ID of the Streaker and their # of days Streaked
user_id | streak
----------+---------
1 | 3
2 | 1
UserID: 2 is ignored because it's streak did not make it to today

You can add a condition outside the streak-identification code, which validates the existence of current_date() in the streak set and only display the valid streaks (i.e. ones which connect to today's date):
select user_id, array_length(array_agg(distinct created_at)) as streak from (
select
user_id,
created_at,
date_sub(created_at, interval rnk day) as grp from (
select
user_id,
date(created_at) as created_at,
dense_rank() over (partition by user_id order by created_at) as rnk
from table
)
)
group by user_id, grp
having current_date() in unnest( array_agg(distinct created_at))

Related

Calculate the streaks of visit of users limited to 7

I am trying to calculate the consecutive visits a user makes on an app. I used the rank function to determine the streaks maintained by each user. However, my requirement is that the streaks should not exceed 7.
For instance, if a user visits the app for 9 consecutive days. He will have 2 different streaks: one with count 7 and the other with 2.
Using MaxCompute. It's similar to MySQL.
I have the following table named visitors_data:
user_id visit_date
murtaza 01-01-2021
john 01-01-2021
murtaza 02-01-2021
murtaza 03-01-2021
murtaza 04-01-2021
john 01-01-2021
murtaza 05-01-2021
murtaza 06-01-2021
john 02-01-2021
john 03-01-2021
murtaza 07-01-2021
murtaza 08-01-2021
murtaza 09-01-2021
john 20-01-2021
john 21-01-2021
Output should look like this:
user_id streak
murtaza 7
murtaza 2
john 3
john 2
I was able to get the streaks by the following query, but I could not limit the streaks to 7.
WITH groups AS (
SELECT user_id,
RANK() OVER (ORDER BY user_id, visit_date) AS RANK,
visit_date,
DATEADD(visit_date, -RANK() OVER (ORDER BY user_id, visit_date), 'dd') AS date_group
FROM visitors_data
ORDER BY user_id, visit_date)
SELECT
user_id,
COUNT(*) AS streak
FROM groups
GROUP BY
user_id,
date_group
HAVING COUNT(*)>1
ORDER BY COUNT(*);
My thinking ran along similar lines to forpas':
SELECT user_id, COUNT(*) streak
FROM
(
SELECT
user_id, streak,
FLOOR((ROW_NUMBER() OVER (PARTITION BY user_id, streak ORDER BY visit_date)-1)/7) substreak
FROM
(
SELECT
user_id, visit_date,
SUM(runtot) OVER (PARTITION BY user_id ORDER BY visit_date) streak
FROM (
SELECT
user_id, visit_date,
CASE WHEN DATE_ADD(visit_date, INTERVAL -1 DAY) = LAG(visit_date) OVER (PARTITION BY user_id ORDER BY visit_date) THEN 0 ELSE 1 END as runtot
FROM visitors_data
GROUP BY user_id, visit_date
) x
) y
) z
GROUP BY user_id, streak, substreak
As an explanation of how this works; a usual trick for counting runs of successive records is to use LAG to examine the record before and if there is only e.g. one day difference then put a 0, otherwise put a 1. This then means the first record of a consecutive run is 1, and the rest are 0, so the column ends up looking like ​1,0,0,0,1,0... SUM OVER ORDER BY sums this in a "running total" fashion. This effectively means it forms a counter that ticks up every time the start of a run is encountered so a run of 4 days followed by a gap then a run of 3 days looks like 1,1,1,1,2,2,2 etc and it forms a "streak ID number".
If this is then fed into a row numbering that partitions by the streak ID number, it establishes an incrementing counter that restarts every time the streak ID changes. If we sub 1 off this so it runs from 0 instead of 1 then we can divide it by 7 to get a "sub streak ID" for our 9-long streak that is 0,0,0,0,0,0,0,1,1 (and so on. A streak of 25 would have 7 zeroes, 7 ones, 7 twos, and 4 threes)
All that remains then is to group by the user, the streak ID, the substreakID and count the result
Before the final group and count the data looks like:
Which should give some idea of how it all works
With a mix of window functions and aggregation:
SELECT user_id, COALESCE(NULLIF(MAX(counter) % 7, 0), 7) streak
FROM (
SELECT *, COUNT(*) OVER (PARTITION BY user_id, grp ORDER BY visit_date) counter
FROM (
SELECT *, SUM(flag) OVER (PARTITION BY user_id ORDER BY visit_date) grp
FROM (
SELECT *, COALESCE(DATE_ADD(visit_date, INTERVAL -1 DAY) <>
LAG(visit_date) OVER (PARTITION BY user_id ORDER BY visit_date), 1) flag
FROM (SELECT DISTINCT * FROM visitors_data) t
) t
) t
) t
GROUP BY user_id, grp, FLOOR((counter - 1) / 7)
See the demo.
You could break them up after the fact. For instance, if you never have more than 21:
SELECT user_id, LEAST(streak, 7)
FROM (SELECT user_id, COUNT(*) AS streak
FROM groups
GROUP BY user_id, date_group
HAVING COUNT(*) > 1
) gu JOIN
(SELECT 1 as n UNION ALL SELECT 2 as n UNION ALL SELECT 3 UNION ALL SELECT 4
) n
ON streak >= n * 7
ORDER BY LEAST(streak, 7);
If you have an indeterminate number range for the longest streak, you can do something similar with a recursive CTE>

SQL AWS Athena Group by Without a Column

I have this dataset
patient_id doctor_id status created_at
1 1 A 2020-10-01 10:00:00
1 1 P 2020-10-01 10:30:00
1 1 U 2020-10-01 10:35:00
1 2 A 2020-10-01 10:40:00
...
I want to group it by patient_id and doctor_id but without the status is grouped so the result will be like this
patient_id doctor_id status created_at
1 1 U 2020-10-01 10:35:00
1 2 A 2020-10-01 10:40:00
...
AWS Athena have to grouped all column but I need the last status
In Athena/Presto you can do this with the max_by function:
SELECT
patient_id,
doctor_id,
MAX_BY(status, created_at) AS last_status
FROM the_table
GROUP BY 1, 2
max_by(x, y) function returns the value of the column x for the row with the max value of column y of the group.
ROW_NUMBER provides one option here:
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY patient_id, doctor_id ORDER BY created_at DESC) rn
FROM yourTable
)
SELECT patient_id, doctor_id, status, created_at
FROM cte
WHERE rn = 1
ORDER BY patient_id, doctor_id;

time difference between transaction of user

Table: txn
customer_id | time_stamp
-------------------------
1 | 00:01:03
1 | 00:02:04
2 | 00:03:05
2 | 00:04:06
Looking to query the time difference between each first transaction and next transaction of customer_id
Results:
Customer ID | Time Diff
1 | 61
select customer_ID, ...
from txn
You want lead() . . . but date/time functions are notoriously database-specific. In SQL Server:
select t.*,
datediff(second,
time_stamp,
lead(time_stamp) over (partition by customer_id order by time_stamp)
) as diff_seconds
from t;
In BigQuery:
select t.*,
timestamp_diff(time_stamp,
lead(time_stamp) over (partition by customer_id order by time_stamp),
second
) as diff_seconds
from t;

Get last known record per month in BigQuery

Account balance collection, that shows the account balance of a customer at a given day:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 1 | -300 | 2019-10-11 |
| 1 | -200 | 2019-10-10 |
| 1 | 0 | 2019-10-09 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Notice, that customer #2 had no updates to his account balance in October.
I want to get the last account balance per customer per month. If there has been no account balance update for a customer in a given month, the last known account balance should be transferred to the current month. The result should look like that:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 2 | 200 | 2019-10-10 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Since the account balance of customer #2 was not updated in October but in September, we create a copy of the row from September changing the date to October. Any ideas how to achieve this in BigQuery?
Below is for BigQuery Standard SQL
#standardSQL
WITH customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
if to apply to sample data from your question - as it is in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
SELECT 1, -300, '2019-10-11' UNION ALL
SELECT 1, -200, '2019-10-10' UNION ALL
SELECT 2, 200, '2019-09-10' UNION ALL
SELECT 2, 100, '2019-08-11' UNION ALL
SELECT 2, 50, '2019-07-12' UNION ALL
SELECT 1, 600, '2019-09-02'
), customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id
result is
Row customer_id value timestamp
1 1 -500 2019-10-12
2 2 200 2019-10-10
3 1 600 2019-09-02
4 2 200 2019-09-10
5 1 null null
6 2 100 2019-08-11
7 1 null null
8 2 50 2019-07-12
The following query should mostly answer your question by creating a 'month-end' record for each customer for every month and getting the most recent balance:
with
-- Generate a set of months
month_begins as (
select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),
-- Get the month ends
month_ends as (
select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),
-- Cross Join and group so we get 1 customer record for every month to account for
-- situations where customer doesn't change balance in a month
user_month_ends as (
select
customer_id,
month_end_date
from `project.dataset.table`
cross join month_ends
group by 1,2
),
-- Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
select
customer_id,
value,
timestamp,
month_end_date
from `project.dataset.table`
inner join user_month_ends using(customer_id)
where timestamp <= month_end_date
),
-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
select
*,
row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
from values_prior_to_month_end
),
-- Finally, select only the most recent record for each customer per month
final as (
select
* except(my_row)
from ordered
where my_row = 1
)
select * from final
order by customer_id, month_end_date desc
A few caveats:
I did not order results to match your desired result set, and I also kept a month-end date to illustrate the concept. You can easily change the ordering and exclude unneeded fields.
In the month_begins CTE, I set a range of months into the future, so your result set will contain the most recent balance of 'future months'. To make this a bit prettier, consider changing '2019-12-01' to 'current_date()' and your query will always return to the end of the current month.
Your timestamp field looks to be dates, so I used date logic, but you should be able to apply the same principles to use timestamp logic if your underlying fields are actual timestamps.
In your result set, I'm not sure why your 2nd row (customer 2) would have a timestamp of '2019-10-10', that seems arbitrary as customer 2 has no 2nd balance record.
I purposefully split the logic into several CTEs so I could comment on each step easier, you could definitely perform several steps in the same code block for a more condensed query.

Select distinct users group by time range

I have a table with the following info
|date | user_id | week_beg | month_beg|
SQL to create table with test values:
CREATE TABLE uniques
(
date DATE,
user_id INT,
week_beg DATE,
month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01')
INPUT TABLE:
| date | user_id | week_beg | month_beg |
| 2013-01-01 | 1 | 2012-12-30 | 2013-01-01 |
| 2013-01-03 | 3 | 2012-12-30 | 2013-01-01 |
| 2013-01-06 | 4 | 2013-01-06 | 2013-01-01 |
| 2013-01-07 | 4 | 2013-01-06 | 2013-01-01 |
OUTPUT TABLE:
| date | time_series | cnt |
| 2013-01-01 | D | 1 |
| 2013-01-01 | W | 1 |
| 2013-01-01 | M | 1 |
| 2013-01-03 | D | 1 |
| 2013-01-03 | W | 2 |
| 2013-01-03 | M | 2 |
| 2013-01-06 | D | 1 |
| 2013-01-06 | W | 1 |
| 2013-01-06 | M | 3 |
| 2013-01-07 | D | 1 |
| 2013-01-07 | W | 1 |
| 2013-01-07 | M | 3 |
I want to calculate the number of distinct user_id's for a date:
For that date
For that week up to that date (Week to date)
For the month up to that date (Month to date)
1 is easy to calculate.
For 2 and 3 I am trying to use such queries:
SELECT
date,
'W' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
FROM user_subtitles
SELECT
date,
'M' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
FROM user_subtitles
Postgres does not allow window functions for DISTINCT calculation, so this approach does not work.
I have also tried out a GROUP BY approach, but it does not work as it gives me numbers for whole week/months.
Whats the best way to approach this problem?
Count all rows
SELECT date, '1_D' AS time_series, count(DISTINCT user_id) AS cnt
FROM uniques
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM uniques
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM uniques
ORDER BY 1, time_series
Your columns week_beg and month_beg are 100 % redundant and can easily be replaced by
date_trunc('week', date + 1) - 1 and date_trunc('month', date) respectively.
Your week seems to start on Sunday (off by one), therefore the + 1 .. - 1.
The default frame of a window function with ORDER BY in the OVER clause uses is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That's exactly what you need.
Use UNION ALL, not UNION.
Your unfortunate choice for time_series (D, W, M) does not sort well, I renamed to make the final ORDER BY easier.
This query can deal with multiple rows per day. Counts include all peers for a day.
More about DISTINCT ON:
Select first row in each GROUP BY group?
DISTINCT users per day
To count every user only once per day, use a CTE with DISTINCT ON:
WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM x
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W'
,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1)
ORDER BY date)
FROM x
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M'
,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date)
FROM x
ORDER BY 1, 2
DISTINCT users over dynamic period of time
You can always resort to correlated subqueries. Tend to be slow with big tables!
Building on the previous queries:
WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2)
,d AS (
SELECT date
,(date_trunc('week', date + 1)::date - 1) AS week_beg
,date_trunc('month', date)::date AS month_beg
FROM uniques
GROUP BY 1
)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM du
GROUP BY 1
UNION ALL
SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.week_beg AND d.date )
FROM d
GROUP BY date, week_beg
UNION ALL
SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.month_beg AND d.date)
FROM d
GROUP BY date, month_beg
ORDER BY 1,2;
SQL Fiddle for all three solutions.
Faster with dense_rank()
#Clodoaldo came up with a major improvement: use the window function dense_rank(). Here is another idea for an optimized version. It should be even faster to exclude daily duplicates right away. The performance gain grows with the number of rows per day.
Building on a simplified and sanitized data model
- without the redundant columns
- day as column name instead of date
date is a reserved word in standard SQL and a basic type name in PostgreSQL and shouldn't be used as identifier.
CREATE TABLE uniques(
day date -- instead of "date"
,user_id int
);
Improved query:
WITH du AS (
SELECT DISTINCT ON (1, 2)
day, user_id
,date_trunc('week', day + 1)::date - 1 AS week_beg
,date_trunc('month', day)::date AS month_beg
FROM uniques
)
SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m
FROM (
SELECT user_id, day
,dense_rank() OVER(PARTITION BY week_beg ORDER BY user_id) AS w
,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m
FROM du
) s
GROUP BY day
ORDER BY day;
SQL Fiddle demonstrating the performance of 4 faster variants. It depends on your data distribution which is fastest for you.
All of them are about 10x as fast as the correlated subqueries version (which isn't bad for correlated subqueries).
Without correlated subqueries. SQL Fiddle
with u as (
select
"date", user_id,
date_trunc('week', "date" + 1)::date - 1 week_beg,
date_trunc('month', "date")::date month_beg
from uniques
)
select
"date", count(distinct user_id) D,
max(week_dr) W, max(month_dr) M
from (
select
user_id, "date",
dense_rank() over(partition by week_beg order by user_id) week_dr,
dense_rank() over(partition by month_beg order by user_id) month_dr
from u
) s
group by "date"
order by "date"
Try
SELECT
*
FROM
(
SELECT dates, count(user_id), 'D' as timesereis FROM users_data GROUP BY dates
UNION
SELECT max(dates), count(user_id), 'W' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
UNION
SELECT max(dates), count(user_id), 'M' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
) tEMP order by dates, timesereis
SQLFIDDLE
Try queries like this
SELECT count(distinct user_id), date_format(date, '%Y-%m-%d') as date_period
FROM uniques
GROUP By date_period