Count in time window for each present day in Clickhouse - sql

I have a table with logs of users who used certain service. Something like the table below, each row is timestamp of activity and user id.
user_id
timestamp
831
2022-06-22 04:37:10
789
2022-06-22 12:38:57
831
2022-06-22 16:40:10
I want to calculate number of unique users in each day, but not just in that day, but include a week prior. Basically, moving window unique count: for day "x" count should be in window of "x-7 days":"x".
As I see in docs,
INTERVAL syntax for DateTime RANGE OFFSET frame: not supported, specify the number of seconds instead (RANGE works with any numeric type).
easy way of using interval with passing something like RANGE INTERVAL 7 day PRECEDING is not supported, and they suggest to use range with passing seconds, but I don't really have experience with range in sql, so I don't really get how do you pass seconds there. My current code:
with cleaned_table as (
select
user_id,
date_trunc('day', timestamp) as day
from
table
)
SELECT
day,
uniqExact(user_id) OVER (
PARTITION by day ORDER BY day range ???
)
FROM
cleaned_table
Also, ideally, I have a feeling that I should add group by somewhere since I need only one row per each day, not a row for each row in initial table, and without grouping I'm doing recalculation(?) for each row instead of calculating for each day once.

create table t(user_id Int64, timestamp DateTime) Engine = Memory as select * from values((831, '2022-06-22 04:37:10'), (789,'2022-06-22 12:38:57'), (831,'2022-06-22 16:40:10'), (1,'2022-06-21 12:38:57'), (2,'2022-06-20 16:40:10'));
SELECT
day,
finalizeAggregation(u) AS uniqByDay,
uniqMerge(u) OVER (ORDER BY day ASC RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS uniqBy6Days
FROM
(
SELECT
toDate(timestamp) AS day,
uniqState(user_id) AS u
FROM t
GROUP BY day
)
ORDER BY day ASC
┌────────day─┬─uniqByDay─┬─uniqBy6Days─┐
│ 2022-06-20 │ 1 │ 1 │
│ 2022-06-21 │ 1 │ 2 │
│ 2022-06-22 │ 2 │ 4 │
└────────────┴───────────┴─────────────┘
see How to obtain p95 of a day and p95 of the last 7 days of that day from Clickhouse through an SQL query?

I'll mark this as an answer, but I'd be happy if anyone knows how to optimize this solution by incorporating group by or other methods to not recalculate window function for each row and calculate it only once for each day.
Anyway, RANGE BETWEEN 6 PRECEDING and current row is what I was searching for, worked just fine. Also added ::date to cast timestamp into date dtype, and DISTINCT day allows directly picking only one row for each day instead of running group by with any one more time.
with cleaned_table as (
select
user_id,
date_trunc('day', timestamp)::date as day
from
table
)
SELECT
DISTINCT day,
uniqExact(user_id) OVER (
ORDER BY
day ASC RANGE BETWEEN 6 PRECEDING
and current row
) as users
FROM
cleaned_table

Related

adjust recursive sql query to exclude holidays and weekends

I have a dataset like this called data_per_day
instructional_day
points
2023-01-24
2
2023-01-23
2
2023-01-20
1
2023-01-19
0
and so on. the table shows weekdays (days minus holidays and weekends) and the number of points someone has earned. 1 is the start of a streak and 0 is the end of a streak. 2 is max points after a streak has started.
I need to find how long is the latest streak. so in this case the result should be 3
I created a recursive cte but the query returns 2 as the streak count because i'm using lag mechanism with days. instead I need to adjust so that the instructional days are used rather than all dates.
RECURSIVE cte AS (
SELECT
student_unique_id,
instructional_day,
points,
1 AS cnt
FROM
`data_per_day`
WHERE
instructional_day = DATE_ADD(CURRENT_DATE('America/Chicago'), INTERVAL -1 DAY)
UNION ALL
SELECT
a.student_unique_id,
a.instructional_day,
a.points,
c.cnt+1
FROM (
SELECT
*
FROM
`data_per_day`
WHERE
points > 0 ) a
INNER JOIN
cte c
ON
a.student_unique_id = c.student_unique_id
AND a.instructional_day = c.instructional_day - INTERVAL '1' day )
SELECT
student_unique_id,
MAX(cnt) AS streak
FROM
cte --
WHERE
student_unique_id = "419"
GROUP BY
student_unique_id
How do I adjust the query?
This is not a trivial coding exercise, so I won't actually write the code and provide it.
What you have here is a gaps and islands question. You want to identify the largest "island" of days with points within a date range. Depending upon what dates are contained in your data, you may need to generate a list of sequential dates that meet your criteria.
One problem I see is that you are trying to combine the steps to generate the date range (the recursive CTE) with the points. You'll need to separate those steps.
Define the date range.
Generate the dates within the range.
Filter the dates with isweekday = 'no' and isholiday = 'no'. You will probably want to add a row number during this step.
[left] join the dates to your data, including coalesce(points, 0)
Filter the data to points > 0.
Identify the islands.
Identify the largest island per student.

Count rows with equal values in a window function

I have a time series in a SQLite Database and want to analyze it.
The important part of the time series consists of a column with different but not unique string values.
I want to do something like this:
Value concat countValue
A A 1
A A,A 1
B A,A,B 1
B A,B,B 2
B B,B,B 3
C B,B,C 1
B B,C,B 2
I don't know how to get the countValue column. It should count all Values of the partition equal to the current rows Value.
I tried this but it just counts all Values in the partition and not the Values equal to this rows Value.
SELECT
Value,
group_concat(Value) OVER wind AS concat,
Sum(Case When Value Like Value Then 1 Else 0 End) OVER wind AS countValue
FROM TimeSeries
WINDOW
wind AS (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
ORDER BY
date
;
The query is also limited by these factors:
The query should work with any amount of unique Values
The query should work with any Partition Size (ROWS BETWEEN n PRECEDING AND CURRENT ROW)
Is this even possible using only SQL?
Here is an approach using string functions:
select
value,
group_concat(value) over wind as concat,
(
length(group_concat(value) over wind) - length(replace(group_concat(value) over wind, value, ''))
) / length(value) cnt_value
from timeseries
window wind as (order by date rows between 2 preceding and current row)
order by date;

Rolling 31 day average including previous 31 days from BigQuery

I've created a query that returns a counts the number of rows (records) for the last 31 days (based on a timestamp field) and include the previous 31 days before that period as well, eg. produce a query that returns both. I now have the following query:
SELECT
COUNT(*) OVER(ORDER BY datetime DESC RANGE BETWEEN 2678400000 PRECEDING AND CURRENT ROW) AS rolling_avg_31_days,
COUNT(*) OVER(ORDER BY datetime DESC RANGE BETWEEN 5356800000 PRECEDING AND CURRENT ROW) AS rolling_avg_62_days
FROM `p`
ORDER BY rolling_avg_31_days DESC LIMIT 1
And it returns some data, but not really the data I was hoping for:
rolling_avg_31_days | rolling_avg_62_days
8,422,783 | 9,790,304
If I query the same table with (rolling 62 days):
SELECT COUNT(*) FROM `p`
WHERE datetime > UNIX_MILLIS(CURRENT_TIMESTAMP)-5356800000 AND datetime < UNIX_MILLIS(CURRENT_TIMESTAMP)-2678400000'
I get a value of 6,192,920
I'm not sure what I'm doing wrong. Any help is much appreciated!
So, the first query is correct and gives you rolling counts (31 and 62 days) based on the timestamp field - also, because of order by .. desc and limit 1 you are getting the most row that has biggest rolling_avg_31_days which is not necessarily row for the most recent () datetime
The second query just produces count of rows between 62 and 31 days based on the current timestamp - which is as explain above is not what first query produces - thus the discrepancy
To further troubleshoot or to try to understand difference - change ORDER BY rolling_avg_31_days DESC LIMIT 1 to ORDER BY datetime DESC LIMIT 1 and also add datetime to select statement so you can see if it belong to current date or close to current statement so results are comparable
Instead of going with the above, I've decided to change the query to be a bit simpler:
SELECT
(SELECT COUNT(DISTINCT(wasabi_user_id)) FROM `p` WHERE datetime > UNIX_MILLIS(CURRENT_TIMESTAMP)-5356800000 AND datetime < UNIX_MILLIS(CURRENT_TIMESTAMP)-2678400000) as _62days,
(SELECT COUNT(DISTINCT(wasabi_user_id)) FROM `p` WHERE datetime > UNIX_MILLIS(CURRENT_TIMESTAMP)-2678400000) AS _31days
FROM `mycujoo_kafka_public.v_web_event_pageviews` LIMIT 1
Thanks #Mikhail for the help though!

SQL Query to find out Sequence in next or subsequent rows of a Table based on a condition

I have an SQL Table with following structure
Timestamp(DATETIME)|AuditEvent
---------|----------
T1|Login
T2|LogOff
T3|Login
T4|Execute
T5|LogOff
T6|Login
T7|Login
T8|Report
T9|LogOff
Want the T-SQL way to find out What is the time that the user has logged into the system i.e. Time inbetween Login Time and Logoff Time for each session in a given day.
Day (Date)|UserTime(In Hours) (Logoff Time - LogIn Time)
--------- | -------
Jun 12 | 2
Jun 12 | 3
Jun 13 | 5
I tried using two temporary tables and Row Numbers but could not get it since the comparison was a time i.e. finding out the next Logout event with timestamp is greater than the current row's Login Event.
You need to group the records. I would suggest counting logins or logoffs. Here is one approach to get the time for each "session":
select min(case when auditevent = 'login' then timestamp end) as login_time,
max(timestamp) as logoff_time
from (select t.*,
sum(case when auditevent = 'logoff' then 1 else 0 end) over (order by timestamp desc) as grp
from t
) t
group by grp;
You then have to do whatever you want to get the numbers per day. It is unclear what those counts are.
The subquery does a reverse count. It counts the number of "logoff" records that come on or after each record. For records in the same "session", this count is the same, and suitable for grouping.

BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)

I'm trying to compute a 28 day moving sum in BigQuery using the LAG function.
The top answer to this question
Bigquery SQL for sliding window aggregate
from Felipe Hoffa indicates that that you can use the LAG function. An example of this would be:
SELECT
spend + spend_lagged_1day + spend_lagged_2day + spend_lagged_3day + ... + spend_lagged_27day as spend_28_day_sum,
user,
date
FROM (
SELECT spend,
LAG(spend, 1) OVER (PARTITION BY user ORDER BY date) spend_lagged_1day,
LAG(spend, 2) OVER (PARTITION BY user ORDER BY date) spend_lagged_2day,
LAG(spend, 3) OVER (PARTITION BY user ORDER BY date) spend_lagged_3day,
...
LAG(spend, 28) OVER (PARTITION BY user ORDER BY date) spend_lagged_day,
user,
date
FROM user_spend
)
Is there a way to do this without having to write out 28 lines of SQL!
The BigQuery documentation doesn't do a good job of explaining the complexity of window functions that the tool supports because it doesn't specify what expressions can appear after ROWS or RANGE. It actually supports the SQL 2003 standard for window functions, which you can find documented other places on the web, such as here.
That means you can get the effect you want with a single window function. The range is 27 because it's how many rows before the current one to include in the sum.
SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;
A RANGE bound can also be extremely useful. If your table was missing dates for some user, then 27 PRECEDING rows would go back more than 27 days, but RANGE will produce a window based on the date values themselves. In the following query, the date field is a BigQuery TIMESTAMP and the range is specified in microseconds. I'd advise that whenever you do date math like this in BigQuery, you test it thoroughly to make sure it's giving you the expected answer.
SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date RANGE BETWEEN 27 * 24 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;
Bigquery: How to get a rolling time range in a window clause.....
This is an old post, but I spend a long time searching for a solution, and this post came up so maybe this will help someone.
IF your partition of your window clause does not have a record for every day, you need to use the RANGE clause to accurately get a rolling time range, (ROWS would search the number records, which would to go too far back since you don't have a record for every day in your PARTITION BY). The problem is that in Bigquery RANGE clause does not support Dates.
From BigQuery's documentation:
numeric_expression must have numeric type. DATE and TIMESTAMP are not currently supported. In addition, the numeric_expression must be a constant, non-negative integer or a parameter.
The workaround I found was to use the UNIX_DATE(date_expression) in the ORDER BY clause along with a RANGE clause:
SUM(value) OVER (PARTITION BY Column1 ORDER BY UNIX_DATE(Date) RANGE BETWEEN 5 PRECEDING AND CURRENT ROW
Here is an alternate take that I found to be flexible and effective:
WITH users AS
(SELECT 'Isabella' as user, 1 as spend, DATE(2020, 03, 28) as date
UNION ALL SELECT 'Isabella', 2, DATE(2020, 03, 29)
UNION ALL SELECT 'Daniel', 3, DATE(2020, 03, 24)
UNION ALL SELECT 'Andrew', 4, DATE(2020, 03, 23)
UNION ALL SELECT 'Daniel', 5, DATE(2020, 03, 11)
UNION ALL SELECT 'Jose', 6, DATE(2020, 03, 17))
SELECT
user,
max(sum(case date_diff(date(2020,04,15), date, day) between 0 and 28
when true then spend else 0 end)) over(partition by user) as spend_28_day_sum
FROM users
group by user
+------------------------------+
| user | spend_28_day_sum |
+------------------------------+
| Andrew | 4 |
| Daniel | 3 |
| Isabella | 3 |
| Jose | 0 |
+------------------------------+
You could change the specified date for the "window function" to current_date() or cross join with a generated date array to see how users change over time.
I found a clean and elegant way to do this, even if you have missing data in the last days.
SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY UNIX_DATE(date) RANGE BETWEEN 27 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;
The UNIX_DATE() returns the number of days since 1970-01-01, so we can easily compute how many days back to go by using it combined with the RANGE() function.