I have a dataset like this called data_per_day
instructional_day
points
2023-01-24
2
2023-01-23
2
2023-01-20
1
2023-01-19
0
and so on. the table shows weekdays (days minus holidays and weekends) and the number of points someone has earned. 1 is the start of a streak and 0 is the end of a streak. 2 is max points after a streak has started.
I need to find how long is the latest streak. so in this case the result should be 3
I created a recursive cte but the query returns 2 as the streak count because i'm using lag mechanism with days. instead I need to adjust so that the instructional days are used rather than all dates.
RECURSIVE cte AS (
SELECT
student_unique_id,
instructional_day,
points,
1 AS cnt
FROM
`data_per_day`
WHERE
instructional_day = DATE_ADD(CURRENT_DATE('America/Chicago'), INTERVAL -1 DAY)
UNION ALL
SELECT
a.student_unique_id,
a.instructional_day,
a.points,
c.cnt+1
FROM (
SELECT
*
FROM
`data_per_day`
WHERE
points > 0 ) a
INNER JOIN
cte c
ON
a.student_unique_id = c.student_unique_id
AND a.instructional_day = c.instructional_day - INTERVAL '1' day )
SELECT
student_unique_id,
MAX(cnt) AS streak
FROM
cte --
WHERE
student_unique_id = "419"
GROUP BY
student_unique_id
How do I adjust the query?
This is not a trivial coding exercise, so I won't actually write the code and provide it.
What you have here is a gaps and islands question. You want to identify the largest "island" of days with points within a date range. Depending upon what dates are contained in your data, you may need to generate a list of sequential dates that meet your criteria.
One problem I see is that you are trying to combine the steps to generate the date range (the recursive CTE) with the points. You'll need to separate those steps.
Define the date range.
Generate the dates within the range.
Filter the dates with isweekday = 'no' and isholiday = 'no'. You will probably want to add a row number during this step.
[left] join the dates to your data, including coalesce(points, 0)
Filter the data to points > 0.
Identify the islands.
Identify the largest island per student.
I have a time series in a SQLite Database and want to analyze it.
The important part of the time series consists of a column with different but not unique string values.
I want to do something like this:
Value concat countValue
A A 1
A A,A 1
B A,A,B 1
B A,B,B 2
B B,B,B 3
C B,B,C 1
B B,C,B 2
I don't know how to get the countValue column. It should count all Values of the partition equal to the current rows Value.
I tried this but it just counts all Values in the partition and not the Values equal to this rows Value.
SELECT
Value,
group_concat(Value) OVER wind AS concat,
Sum(Case When Value Like Value Then 1 Else 0 End) OVER wind AS countValue
FROM TimeSeries
WINDOW
wind AS (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
ORDER BY
date
;
The query is also limited by these factors:
The query should work with any amount of unique Values
The query should work with any Partition Size (ROWS BETWEEN n PRECEDING AND CURRENT ROW)
Is this even possible using only SQL?
Here is an approach using string functions:
select
value,
group_concat(value) over wind as concat,
(
length(group_concat(value) over wind) - length(replace(group_concat(value) over wind, value, ''))
) / length(value) cnt_value
from timeseries
window wind as (order by date rows between 2 preceding and current row)
order by date;
I've created a query that returns a counts the number of rows (records) for the last 31 days (based on a timestamp field) and include the previous 31 days before that period as well, eg. produce a query that returns both. I now have the following query:
SELECT
COUNT(*) OVER(ORDER BY datetime DESC RANGE BETWEEN 2678400000 PRECEDING AND CURRENT ROW) AS rolling_avg_31_days,
COUNT(*) OVER(ORDER BY datetime DESC RANGE BETWEEN 5356800000 PRECEDING AND CURRENT ROW) AS rolling_avg_62_days
FROM `p`
ORDER BY rolling_avg_31_days DESC LIMIT 1
And it returns some data, but not really the data I was hoping for:
rolling_avg_31_days | rolling_avg_62_days
8,422,783 | 9,790,304
If I query the same table with (rolling 62 days):
SELECT COUNT(*) FROM `p`
WHERE datetime > UNIX_MILLIS(CURRENT_TIMESTAMP)-5356800000 AND datetime < UNIX_MILLIS(CURRENT_TIMESTAMP)-2678400000'
I get a value of 6,192,920
I'm not sure what I'm doing wrong. Any help is much appreciated!
So, the first query is correct and gives you rolling counts (31 and 62 days) based on the timestamp field - also, because of order by .. desc and limit 1 you are getting the most row that has biggest rolling_avg_31_days which is not necessarily row for the most recent () datetime
The second query just produces count of rows between 62 and 31 days based on the current timestamp - which is as explain above is not what first query produces - thus the discrepancy
To further troubleshoot or to try to understand difference - change ORDER BY rolling_avg_31_days DESC LIMIT 1 to ORDER BY datetime DESC LIMIT 1 and also add datetime to select statement so you can see if it belong to current date or close to current statement so results are comparable
Instead of going with the above, I've decided to change the query to be a bit simpler:
SELECT
(SELECT COUNT(DISTINCT(wasabi_user_id)) FROM `p` WHERE datetime > UNIX_MILLIS(CURRENT_TIMESTAMP)-5356800000 AND datetime < UNIX_MILLIS(CURRENT_TIMESTAMP)-2678400000) as _62days,
(SELECT COUNT(DISTINCT(wasabi_user_id)) FROM `p` WHERE datetime > UNIX_MILLIS(CURRENT_TIMESTAMP)-2678400000) AS _31days
FROM `mycujoo_kafka_public.v_web_event_pageviews` LIMIT 1
Thanks #Mikhail for the help though!
I have an SQL Table with following structure
Timestamp(DATETIME)|AuditEvent
---------|----------
T1|Login
T2|LogOff
T3|Login
T4|Execute
T5|LogOff
T6|Login
T7|Login
T8|Report
T9|LogOff
Want the T-SQL way to find out What is the time that the user has logged into the system i.e. Time inbetween Login Time and Logoff Time for each session in a given day.
Day (Date)|UserTime(In Hours) (Logoff Time - LogIn Time)
--------- | -------
Jun 12 | 2
Jun 12 | 3
Jun 13 | 5
I tried using two temporary tables and Row Numbers but could not get it since the comparison was a time i.e. finding out the next Logout event with timestamp is greater than the current row's Login Event.
You need to group the records. I would suggest counting logins or logoffs. Here is one approach to get the time for each "session":
select min(case when auditevent = 'login' then timestamp end) as login_time,
max(timestamp) as logoff_time
from (select t.*,
sum(case when auditevent = 'logoff' then 1 else 0 end) over (order by timestamp desc) as grp
from t
) t
group by grp;
You then have to do whatever you want to get the numbers per day. It is unclear what those counts are.
The subquery does a reverse count. It counts the number of "logoff" records that come on or after each record. For records in the same "session", this count is the same, and suitable for grouping.
I'm trying to compute a 28 day moving sum in BigQuery using the LAG function.
The top answer to this question
Bigquery SQL for sliding window aggregate
from Felipe Hoffa indicates that that you can use the LAG function. An example of this would be:
SELECT
spend + spend_lagged_1day + spend_lagged_2day + spend_lagged_3day + ... + spend_lagged_27day as spend_28_day_sum,
user,
date
FROM (
SELECT spend,
LAG(spend, 1) OVER (PARTITION BY user ORDER BY date) spend_lagged_1day,
LAG(spend, 2) OVER (PARTITION BY user ORDER BY date) spend_lagged_2day,
LAG(spend, 3) OVER (PARTITION BY user ORDER BY date) spend_lagged_3day,
...
LAG(spend, 28) OVER (PARTITION BY user ORDER BY date) spend_lagged_day,
user,
date
FROM user_spend
)
Is there a way to do this without having to write out 28 lines of SQL!
The BigQuery documentation doesn't do a good job of explaining the complexity of window functions that the tool supports because it doesn't specify what expressions can appear after ROWS or RANGE. It actually supports the SQL 2003 standard for window functions, which you can find documented other places on the web, such as here.
That means you can get the effect you want with a single window function. The range is 27 because it's how many rows before the current one to include in the sum.
SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;
A RANGE bound can also be extremely useful. If your table was missing dates for some user, then 27 PRECEDING rows would go back more than 27 days, but RANGE will produce a window based on the date values themselves. In the following query, the date field is a BigQuery TIMESTAMP and the range is specified in microseconds. I'd advise that whenever you do date math like this in BigQuery, you test it thoroughly to make sure it's giving you the expected answer.
SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date RANGE BETWEEN 27 * 24 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;
Bigquery: How to get a rolling time range in a window clause.....
This is an old post, but I spend a long time searching for a solution, and this post came up so maybe this will help someone.
IF your partition of your window clause does not have a record for every day, you need to use the RANGE clause to accurately get a rolling time range, (ROWS would search the number records, which would to go too far back since you don't have a record for every day in your PARTITION BY). The problem is that in Bigquery RANGE clause does not support Dates.
From BigQuery's documentation:
numeric_expression must have numeric type. DATE and TIMESTAMP are not currently supported. In addition, the numeric_expression must be a constant, non-negative integer or a parameter.
The workaround I found was to use the UNIX_DATE(date_expression) in the ORDER BY clause along with a RANGE clause:
SUM(value) OVER (PARTITION BY Column1 ORDER BY UNIX_DATE(Date) RANGE BETWEEN 5 PRECEDING AND CURRENT ROW
Here is an alternate take that I found to be flexible and effective:
WITH users AS
(SELECT 'Isabella' as user, 1 as spend, DATE(2020, 03, 28) as date
UNION ALL SELECT 'Isabella', 2, DATE(2020, 03, 29)
UNION ALL SELECT 'Daniel', 3, DATE(2020, 03, 24)
UNION ALL SELECT 'Andrew', 4, DATE(2020, 03, 23)
UNION ALL SELECT 'Daniel', 5, DATE(2020, 03, 11)
UNION ALL SELECT 'Jose', 6, DATE(2020, 03, 17))
SELECT
user,
max(sum(case date_diff(date(2020,04,15), date, day) between 0 and 28
when true then spend else 0 end)) over(partition by user) as spend_28_day_sum
FROM users
group by user
+------------------------------+
| user | spend_28_day_sum |
+------------------------------+
| Andrew | 4 |
| Daniel | 3 |
| Isabella | 3 |
| Jose | 0 |
+------------------------------+
You could change the specified date for the "window function" to current_date() or cross join with a generated date array to see how users change over time.
I found a clean and elegant way to do this, even if you have missing data in the last days.
SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY UNIX_DATE(date) RANGE BETWEEN 27 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;
The UNIX_DATE() returns the number of days since 1970-01-01, so we can easily compute how many days back to go by using it combined with the RANGE() function.