Bigquery SQL for sliding window aggregate - sql

Hi I have a table that looks like this
Date Customer Pageviews
2014/03/01 abc 5
2014/03/02 xyz 8
2014/03/03 abc 6
I want to get page view aggregates grouped by week but showing aggregates for past 30 days - (sliding window aggregates with window-size of 30 days for every week)
I am using google bigquery
EDIT: Gordon - re your comment about "Customer", Actually what I need is slightly more complicated thats why I included customer in the table above. I am looking to get the number of customers who had >n pageviews in a 30day window every week. something like this
Date Customers>10 pageviews in 30day window
2014/02/01 10
2014/02/08 5
2014/02/15 6
2014/02/22 15
However to keep it simple, I will work my way if I could just get a sliding window aggregate of pageviews ignoring customers altogether. something like this
Date count of pageviews in 30day window
2014/02/01 50
2014/02/08 55
2014/02/15 65
2014/02/22 75

How about this:
SELECT changes + changes1 + changes2 + changes3 changes28days, login, USEC_TO_TIMESTAMP(week)
FROM (
SELECT changes,
LAG(changes, 1) OVER (PARTITION BY login ORDER BY week) changes1,
LAG(changes, 2) OVER (PARTITION BY login ORDER BY week) changes2,
LAG(changes, 3) OVER (PARTITION BY login ORDER BY week) changes3,
login,
week
FROM (
SELECT SUM(payload_pull_request_changed_files) changes,
UTC_USEC_TO_WEEK(created_at, 1) week,
actor_attributes_login login,
FROM [publicdata:samples.github_timeline]
WHERE payload_pull_request_changed_files > 0
GROUP BY week, login
))
HAVING changes28days > 0
For each user it counts how many changes they have submitted per week. Then with LAG() we can peek into the next row, how many changes they submitted the -1, -2, and -3 week. Then we just add those 4 weeks to see how many changes were submitted on the last 28 days.
Now you can wrap everything in a new query to filter users with changes>X, and count them.

I have created the following "Times" table:
Table Details: Dim_Periods
Schema
Date TIMESTAMP
Year INTEGER
Month INTEGER
day INTEGER
QUARTER INTEGER
DAYOFWEEK INTEGER
MonthStart TIMESTAMP
MonthEnd TIMESTAMP
WeekStart TIMESTAMP
WeekEnd TIMESTAMP
Back30Days TIMESTAMP -- the date 30 days before "Date"
Back7Days TIMESTAMP -- the date 7 days before "Date"
and I use such query to handle "running sums"
SELECT Date,Count(*) as MovingCNT
FROM
(SELECT Date,
Back7Days
FROM DWH.Dim_Periods
where Date < timestamp(current_date()) AND
Date >= (DATE_ADD (CURRENT_TIMESTAMP(), -5, 'month'))
)P
CROSS JOIN EACH
(SELECT repository_url,repository_created_at
FROM publicdata:samples.github_timeline
) L
WHERE timestamp(repository_created_at)>= Back7Days
AND timestamp(repository_created_at)<= Date
GROUP EACH BY Date
Note that it can be used for "Month to date", Week to Date" "30 days back" etc. aggregations as well.
However, performance is not the best and the query can take a while on larger data sets due to the Cartesian join.
Hope this helps

Related

Write a SQL Query in Google Big Query which pulls all values from last week, all values from 2 weeks ago, and calculate percent change between them

I'm trying to query a table comparing order numbers from last week (Sunday to Saturday) vs 2 weeks ago, and calculate percent change between the two. My thought process so far has been to group my date column by week, then use a lag function to pull last week and the previous week in to the same row. From there use basic arithmetic functions to calculate percent change. In practice, I haven't been able to get a working query, but I picture the table to look as follows:
Week
Orders
Orders - Previous Week
% Change
2023-02-05
5
10
-0.5
2023-01-29
10
2
+5.0
2023-01-29
2
Important to note that the days in last week should not change regardless of what day it is today (i.e not use today -7 days to calculate last week, and -14 days to calculate 2 weeks ago)
My query so far:
SELECT
min(date) as date,
orders,
coalesce(lag(order) over (order by (date), 0)) as Orders - Previous Week
FROM `table`
WHERE date BETWEEN '2023-01-01' AND current_date()
group by date_trunc(date, WEEK)
ORDER BY date desc
I realize I'm not using coalesce and my lag function correctly, but a bit lost on how to correct it
To calculate the percent change, you can use the following query:
sql
Copy code
SELECT
min(date) as Week,
sum(orders) as Orders,
coalesce(sum(lag(orders) over (order by date_trunc(date, WEEK))), 0) as "Orders - Previous Week",
(sum(orders) - coalesce(sum(lag(orders) over (order by date_trunc(date, WEEK))), 0)) / coalesce(sum(lag(orders) over (order by date_trunc(date, WEEK))), 0) as "% Change"
FROM `table`
WHERE date BETWEEN '2023-01-01' AND current_date()
group by date_trunc(date, WEEK)
ORDER BY Week desc
In this query, the sum function is used to aggregate the orders by week. The coalesce function is used to handle the case where there is no previous week data, and default to 0. The percent change calculation uses the same formula you described.

Average ticket response time by week with SQL query

In my Spiceworks database there is a table, tickets, with two columns I am concerned with, first_response_secs and created_at.
I have been tasked with finding the average response time of tickets for every week.
So if I run the following query:
select AVG(first_response_secs) from (
select first_response_secs,created_at
from tickets
where created_at BETWEEN '2017-03-19' and '2017-03-25'
)
I will get back the average first response seconds for that week. But that's as far as my limited SQL gets me. I need 6 months worth of data and I don't want to manually edit the date range and rerun the query 24 times.
I would like to write a query that will return output similar to the following:
WEEK AVERAGE RESPONSE TIME(secs)
-----------------------------------------------------------
2017-02-26 - 2017-03-04 21447
2017-03-05 - 2017-03-11 20564
2017-03-12 - 2017-03-18 25883
2017-03-19 - 2017-03-25 12244
Or something like that, back 6 months.
Weeks are tricky. How about:
select min(created_at) as weekstart, first_response_secs, created_at
from tickets
group by floor(julianday('2017-03-25) - julianday(created_at)) % 7 = 0
order by weekstart
One dirty way is to use case to define week boundaries:
select week, avg(first_response_secs)
from (
select case
when created_at between '2017-02-26' and '2017-03-04' then '2017-02-26 - 2017-03-04'
when created_at between '2017-03-05' and '2017-03-11' then '2017-03-05 - 2017-03-11'
when created_at between '2017-03-12' and '2017-03-18' then '2017-03-12 - 2017-03-18'
when created_at between '2017-03-19' and '2017-03-25' then '2017-03-19 - 2017-03-25'
end as week,
first_response_secs
from tickets
) t
group by week;
Demo
Note that this method is a general purpose one and can be modified to change the boundaries as you wish.

Sum of shifting range in SQL Query

I am trying to write an efficient query to get the sum of the previous 7 days worth of values from a relational DB table, and record each total against the final date in the 7 day period (e.g. the 'WeeklyTotals Table' in the example below). For example, in my WeeklyTotals query, I would like the value for February 15th to be 333, since that is the total sum of users from Feb 9th - Feb 15th, and so on:
I have a base query which gets me my previous weeks users for today's date (simplified for the sake of the example):
SELECT Date, Sum("Total Users")
FROM "UserRecords"
WHERE (dateadd(hour, -8, "UserRecords"."Date") BETWEEN
dateadd(hour, -8, sysdate) - INTERVAL '7 DAY' AND dateadd(hour, -8, sysdate);
The problem is, this only get's me the total for today's date. I need a query which will get me this information for the previous seven days.
I know I can make a view for each date (since I only need the previous seven entries) and join them all together, but that seems really inefficient (I'll have to create/update 7 views, and then do all the inner join operations). I am wondering if there's a more efficient way to achieve this.
Provided there are no gaps, you can use a running total with SUM OVER including the six previous rows. Use ROW_NUMBER to exclude the first six records, as their totals don't represent complete weeks.
select log_date, week_total
from
(
select
log_date,
sum(total_users) over (order by log_date rows 6 preceding) as week_total,
row_number() over (order by log_date) as rn
from mytable
where log_date > 0
)
where rn >= 7
order by log_date;
UPDATE: In case there are gaps, it should be
sum(total_users) over (order by log_date range interval '6' day preceding)
but I don't know whether PostgreSQL supports this already. (Moreover the ROW_NUMBER exclusion wouldn't work then and would have to be replaced by something else.)
Here's a a query that self joins to the previous 6 days and sums the value to get the weekly totals:
select u1.date, sum(u2.total_users) as weekly_users
from UserRecords u1
join UserRecords u2
on u1.date - u2.date < 7
and u1.date >= u2.date
group by u1.date
order by u1.date
You can use the SUM over Window function, with the expression using Date Part, of week.
Self joins are much slower than Window functions.

use of week of year & subsquend in bigquery

I need to show distinct users per week. I have a date-visit column, and a user id, it is a big table with 1 billion rows.
I can change the date column from the CSVs to year,month, day columns. but how do I deduce the week from that in the query.
I can calculate the week from the CSV, but this is a big process step.
I also need to show how many distinct users visit day after day, looking for workaround as there is no date type.
any ideas?
To get the week of year number:
SELECT STRFTIME_UTC_USEC(TIMESTAMP('2015-5-19'), '%W')
20
If you have your date as a timestamp (i.e microseconds since the epoch) you can use the UTC_USEC_TO_DAY/UTC_USEC_TO_WEEK functions. Alternately, if you have an iso-formatted date string (e.g. "2012/03/13 19:00:06 -0700") you can call PARSE_UTC_USEC to turn the string into a timestamp and then use that to get the week or day.
To see an example, try:
SELECT LEFT((format_utc_usec(day)),10) as day, cnt
FROM (
SELECT day, count(*) as cnt
FROM (
SELECT UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)) as day
FROM [publicdata:samples.github_timeline])
GROUP BY day
ORDER BY cnt DESC)
To show week, just change UTC_USEC_TO_DAY(...) to UTC_USEC_TO_WEEK(..., 0) (the 0 at the end is to indicate the week starts on Sunday). See the documentation for the above functions at https://developers.google.com/bigquery/docs/query-reference for more information.

PostgreSQL - Getting statistical data

I need to collect some statistical information in my application.
I have a table of users (tb_user)
Every time a new user accesses the application, it adds a new record in this table, ie, one line for each user. The main field are id and date_hour (timestamp for the first time user accessed the application).
tb_user
id (bigint) | date_time (timestamp with time zone)
1 | 2012-01-29 11:29:50.359-03
2 | 2012-01-31 14:27:10.359-03
I need get:
amount average users by day, week and month
Example:
by day: 55.45
by week : XX.XX
month: XX.XX
EDIT:
My best solution was:
WITH daily_count AS (SELECT COUNT(id) AS user_count FROM tb_user)
SELECT user_count, tbaux2.days, (user_count/tbaux2.days) FROM daily_count,
(SELECT EXTRACT(DAY FROM (t2.diff) ) + 1 AS days
FROM
(with tbaux AS(SELECT min(date_time) AS min FROM tb_user)
SELECT (now() - min) AS diff
FROM tbaux) AS t2) AS tbaux2
GROUP BY user_count, tbaux2.days
But this solution only worked with EXTRACT (DAY ... With weeks and month did not work
Any help is welcome.
Alternatively:
SELECT user_count, tbaux2.days, (user_count/tbaux2.days) AS userPerDay, ((user_count/tbaux2.days) * 7) AS userPerWeek, ((user_count/tbaux2.days) * 30) AS userPerMonth
EDIT 2:
Based on responses from #Bruno, there are some considerations:
When I asked the question, in really I requested a way to select data by day, month and year. I believe that the search that I posted and #Bruno refined, should be interpreted as average of "a day, every 7 days and every 30 days" and not by days, weeks and months. I believe that if it is interpreted in this way, there not will be problems of gender-quoted in example (10% drop). I believe this approach of "every" is answer I need in moment, so will sign this answer.
I suggest as an improvement of post:
Consider only closed day in result (not collect users of the current day, and not counting the current day in division)
The result is two numeric digits.
New research considering a data really per week and per month.
Thanks.
You should look into aggregate functions (min, max, count, avg), which go hand in hand with GROUP BY. For date-based aggregations, date_trunc is also useful.
For example, this will return the number of rows per day:
SELECT date_trunc('day', date_time) AS day_start,
COUNT(id) AS user_count FROM tb_user
GROUP BY date_trunc('day', date_time);
You can then do the daily average using something like this (with a CTE):
WITH daily_count AS (SELECT date_trunc('day', date_time) AS day_start,
COUNT(id) AS user_count FROM tb_user
GROUP BY date_trunc('day', date_time))
SELECT AVG(user_count) FROM daily_count;
Use 'week' instead of day for the weekly counts, and so on (see date_trunc documentation).
EDIT: (Following comment: average up to and including 5/1/2012, i.e. before the 6th.)
WITH daily_count AS (SELECT date_trunc('day', date_time) AS day_start,
COUNT(id) AS user_count
FROM tb_user
WHERE date_time >= DATE('2012-01-01') AND date_time < DATE('2012-01-06')
GROUP BY date_trunc('day', date_time))
SELECT SUM(user_count)/(DATE('2012-01-06') - DATE('2012-01-01')) FROM daily_count;
What's above is over-complicated, in this case. This should give you the same result:
SELECT COUNT(id)/(DATE('2012-01-06') - DATE('2012-01-01'))
FROM tb_user
WHERE date_time >= DATE('2012-01-01') AND date_time < DATE('2012-01-06');
EDIT 2: After your edit, I guess what you're after is just a single global average for the entire period of existence of your database, rather than groups by month/week/day.
This should give you the average number of rows per day:
WITH total_min_max AS (SELECT
COUNT(id) AS total_visits,
MIN(date_time) AS first_date_time,
MAX(date_time) AS last_date_time,
FROM tb_user)
SELECT total_visits/((last_date_time::date-first_date_time::date)+1) AS users_per_day
FROM total_min_max
(I would replace last_date_time with NOW() to make the average over the time until now, rather than until the last visit, if there's no recent visit.)
Then, for daily, weekly, and "monthly":
WITH daily_avg AS (
WITH total_min_max AS (SELECT
COUNT(id) AS total_visits,
MIN(date_time) AS first_date_time,
MAX(date_time) AS last_date_time,
FROM tb_user)
SELECT total_visits/((last_date_time::date-first_date_time::date)+1) AS users_per_day
FROM total_min_max)
SELECT
users_per_day,
(users_per_day * 7) AS users_per_week,
(users_per_month * 30) AS users_per_month
FROM daily_avg
This being said, conclusions you draw from such statistics might not be great, especially if you want to see how it changes.
I would also normalise the data per day rather than assuming 30 days in a month (if not per hour, because not all days have 24 hours). Say you have 10 visits per day in Jan 2011 and 10 visits per day in Feb 2011. That gives you 310 visits in Jan and 280 visits in Feb. If you don't pay attention, you could think you've had a almost a 10% drop in terms of number of visitors, so something went wrong in Feb, when really, this isn't the case.