Calculate 7, 14 and 30 day moving average in bigquery - google-bigquery

I am playing around with bigquery. I have IoT uptime recordings as input:
+---------------+-------------+----------+------------+
| device_id | reference | uptime | timestamp |
+---------------+-------------+----------+------------+
| 1 | 1000-5 | 0.7 | 2019-02-12 |
| 2 | 1000-6 | 0.9 | 2019-02-12 |
| 1 | 1000-5 | 0.8 | 2019-02-11 |
| 2 | 1000-6 | 0.95 | 2019-02-11 |
+---------------+-------------+----------+------------+
I want to calculate the 7, 14 and 30 day moving average of the uptime grouped by device. The output should look as follows:
+---------------+-------------+---------+--------+--------+
| device_id | reference | avg_7 | avg_14 | avg_30 |
+---------------+-------------+---------+--------+--------+
| 1 | 1000-5 | 0.7 | .. | .. |
| 2 | 1000-6 | 0.9 | .. | .. |
+---------------+-------------+---------+--------+--------+
What I have tried:
SELECT
device_id,
AVG(uptime) OVER (ORDER BY day RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, uptime, UNIX_DATE(DATE(timestamp)) as day FROM `uptime_recordings`
)
GROUP BY device_id, uptime, day
I have recordings for 1000 distinct devices and 200k readings. The grouping does not work and the query returns 200k records instead of 1000. Any ideas whats wrong?

I have recordings for 1000 distinct devices and 200k readings. The grouping does not work and the query returns 200k records instead of 1000. Any ideas whats wrong?
Instead of GROUP BY device_id, uptime, day do GROUP BY device_id, day.
A full working query:
WITH data
AS (
SELECT title device_id, views uptime, datehour timestamp
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-09'
AND wiki='br'
AND title='Chile'
)
SELECT device_id, day
, AVG(uptime) OVER (PARTITION BY device_id ORDER BY UNIX_DATE(day) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, AVG(uptime) uptime, (DATE(timestamp)) as day
FROM `data`
GROUP BY device_id, day
)
Edit: As per requested in the comments, not sure what's the goal of summarizing all of the 7d averages:
WITH data
AS (
SELECT title device_id, views uptime, datehour timestamp
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-09'
AND wiki='br'
AND title IN ('Chile', 'Saozneg')
)
SELECT device_id, AVG(avg_7d) avg_avg_7d
FROM (
SELECT device_id, day
, AVG(uptime) OVER (PARTITION BY device_id ORDER BY UNIX_DATE(day) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, AVG(uptime) uptime, (DATE(timestamp)) as day
FROM `data`
GROUP BY device_id, day
)
)
GROUP BY device_id

Related

percentage per month Bigquery

I am working in Bigquery and I need the percentages for each result for each month, I have the following query but the percentage is calculated with respect to the total, I have tried to add a PARTITION BY in the OVER clause but it does not work.
SELECT CAST(TIMESTAMP_TRUNC(CAST((created_at) AS TIMESTAMP), MONTH) AS DATE) AS `month`,
result,
count(*) * 100.0 / sum(count(1)) over() as percentage
FROM table_name
GROUP BY 1,2
ORDER BY 1
month
result
percentage
2021-01
0001
50
2021-01
0000
50
2021-02
00001
33.33
2021-02
0000
33.33
2021-02
0002
33.33
Using the data that you shared as:
WITH data as(
SELECT "2021-01-01" as created_at,"0001" as result UNION ALL
SELECT "2021-01-01","0000" UNION ALL
SELECT "2021-02-01","00001"UNION ALL
SELECT "2021-02-01","0000"UNION ALL
SELECT "2021-02-01","0002"
)
I used a subquery to help you to deal with the month field and then use that field to partition by and then group them by month, and result.
d as (SELECT CAST(TIMESTAMP_TRUNC(CAST((created_at) AS TIMESTAMP), MONTH) AS DATE) AS month,
result, created_at
from DATA
)
SELECT d.month,
d.result,
count(*) * 100.0 / sum(count(1)) over(partition by month) as percentage
FROM d
GROUP BY 1, 2
ORDER BY 1
The output is the following:
This example is code on dbFiddle SQL server, but according to the documentation google-bigquery has the function COUNT( ~ ) OVER ( PARTITION BY ~ )
create table table_name(month char(7), result int)
insert into table_name values
('2021-01',50),
('2021-01',30),
('2021-01',20),
('2021-02',70),
('2021-02',80);
select
month,
result,
sum(result) over (partition by month) month_total,
100 * result / sum(result) over (partition by month) per_cent
from table_name
order by month, result;
month | result | month_total | per_cent
:------ | -----: | ----------: | -------:
2021-01 | 20 | 100 | 20
2021-01 | 30 | 100 | 30
2021-01 | 50 | 100 | 50
2021-02 | 70 | 150 | 46
2021-02 | 80 | 150 | 53
db<>fiddle here

Select most popular hour per country based on number of sales

I want to get the most popular hour for each country based on max value of count(id) which tells how many purchases were made.
I've tried getting the max value of purchases and converted the timestamp into hours, but it always returns each hour for each country when I want only a single hour (the one with most purchases) per country.
The table is like:
id | country | time
1 | AE | 19:20:00.00000
1 | AE | 20:13:00.00000
3 | GB | 23:17:00.00000
4 | IN | 10:23:00.00000
6 | IN | 02:01:00.00000
7 | RU | 05:54:00.00000
2 | RU | 16:34:00.00000
SELECT max(purchases), country, tss
FROM (
SELECT time_trunc(time, hour) AS tss,
count(id) as purchases,
country
FROM spending
WHERE dt > date_sub(current_date(), interval 30 DAY)
GROUP BY tss, country
)
GROUP BY tss, country
Expected output:
amount of purchases | Country | Most popular Hour
34 | GB | 16:00
445 | US | 21:00
You can use window functions along with group by. Notice that it uses RANK function so, for example, if one particular country has same amount of sales at 11AM and 2PM it'll return both hours for that country.
WITH cte AS (
SELECT country
, time_trunc(time, hour) AS hourofday
, COUNT(id) AS purchases
, RANK() OVER(PARTITION BY country ORDER BY COUNT(id) DESC) AS rnk
FROM t
GROUP BY country, time_trunc(time, hour)
)
SELECT *
FROM cte
WHERE rnk = 1

Start SUM aggregation at a certain threshold in bigquery

The energy usage of a device is logged hourly:
+--------------+-----------+-----------------------+
| energy_usage | device_id | timestamp |
+--------------+-----------+-----------------------+
| 10 | 1 | 2019-02-12T01:00:00 |
| 16 | 2 | 2019-02-12T01:00:00 |
| 26 | 1 | 2019-03-12T02:00:00 |
| 24 | 2 | 2019-03-12T02:00:00 |
+--------------+-----------+-----------------------+
My goal is:
Create two columns, one for energy_usage_day (8am-8pm) and another for energy_usage_night (8pm-8am)
Create a monthly aggregate, group by device_id and sum up the energy usage
So the result might look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 80 | 30 | 50 | 1 | 2 | 2019 |
| 130 | 60 | 70 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
Following query produces such results:
SELECT SUM(energy_usage) energy_usage
, SUM(IF(EXTRACT(HOUR FROM timestamp) BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_day
, SUM(IF(EXTRACT(HOUR FROM timestamp) NOT BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_night
, device_id
, EXTRACT(MONTH FROM timestamp) month, EXTRACT(YEAR FROM timestamp) year
FROM `data`
GROUP BY device_id, month, year
Say I am only interested in energy usage aggregates above a certain threshold, e.g. 50. I want to start the SUM at a total energy usage of 50. The result should look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 30 | 10 | 20 | 1 | 2 | 2019 |
| 80 | 50 | 30 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
In other words: the query should start summing up energy_usage, energy_usage_day and energy_usage_night only when energy_usage reaches the threshold of 50.
Is this possible in bigquery?
Below is for BigQuery Standard SQL and logic is that it starts aggregate usage ONLY after it reaches 50 (per device per month)
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
)
SELECT SUM(energy_usage) energy_usage,
SUM(IF(day_hour, energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp
WHERE qualified
GROUP BY device_id, month, year
Say the current SUM of usage is 49 and the next usage entry has a value of 2. The SUM will be 51. As a result usage of 2 will be added to the SUM. Instead only half of 1 should've been added. Can we solve such problem in BigQuery SQL?
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
SUM(energy_usage) OVER(win) - 50 rolling_sum,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
), temp_with_adjustments AS (
SELECT *,
IF(
ROW_NUMBER() OVER(PARTITION BY device_id, month, year ORDER BY `timestamp`) = 1,
rolling_sum,
energy_usage
) AS adjusted_energy_usage
FROM temp
WHERE qualified
)
SELECT SUM(adjusted_energy_usage) energy_usage,
SUM(IF(day_hour, adjusted_energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, adjusted_energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp_with_adjustments
GROUP BY device_id, month, year
As you can see, I've just added logic for temp_with_adjustments (and rolling_sum in the temp to support this) - the rest is the same

Get sum over last entries per day per article

Let's say there is a table structured like this:
ID | article_id | article_count | created_at
---|------------------------------------------
1 | 1 | 10 | 2019-03-20T18:20:03.685059Z
2 | 1 | 22 | 2019-03-20T19:20:03.685059Z
3 | 2 | 32 | 2019-03-20T18:20:03.685059Z
4 | 2 | 20 | 2019-03-20T19:20:03.685059Z
5 | 1 | 3 | 2019-03-21T18:20:03.685059Z
6 | 1 | 15 | 2019-03-21T19:20:03.685059Z
7 | 2 | 3 | 2019-03-21T18:20:03.685059Z
8 | 2 | 30 | 2019-03-21T19:20:03.685059Z
The goal now is to sum over all article_count of all article_ids for the last entries per day and give back this total count per day. So in the case above I'd like to get a result showing:
total | date
--------|------------
42 | 2019-03-20
45 | 2019-03-21
So far, I tried something like:
SELECT SUM(article_count), DATE_TRUNC('day', created_at)
FROM myTable
WHERE created_at IN
(
SELECT DISTINCT ON (a.created_at::date, article_id::int) created_at
FROM myTable a
ORDER BY created_at::date DESC, article_id, created_at DESC
)
GROUP BY DATE_TRUNC('day', created_at)
In the distinct query I tried to pull only the latest entries per day per article_id and then match the created_at to sum up all the article_count values.
This does not work - it still outputs the sum of the whole day instead of sum up over the last entries.
Besides that I am quite sure that there might be a more elegant way than the where condition.
Thanks in advance (as well for any explanation).
I think you just want to filter down to the last entry per day for each article:
SELECT DATE_TRUNC('day', created_at), SUM(article_count)
FROM (SELECT DISTINCT ON (a.created_at::date, article_id::int) a.*
FROM myTable a
ORDER BY article_id, created_at::date DESC, created_at DESC
) a
GROUP BY DATE_TRUNC('day', created_at);
You are looking for rank function:
WITH cte
AS (SELECT article_id,
article_count,
Date_trunc('day', created_at) AS some_date,
Row_number ()
OVER (
partition BY article_id, Date_trunc( 'day', created_at)
ORDER BY created_at DESC ) AS n
FROM mytable)
SELECT Sum(article_count) AS total,
some_date
FROM cte
WHERE n = 1
GROUP BY some_date
Just add the first of each day / article.
Check it at https://rextester.com/INODNS67085

Query to find records that where created one after another in bigquery

I am playing around with bigquery. Following input is given:
+---------------+---------+---------+--------+----------------------+
| customer | agent | value | city | timestamp |
+---------------+---------+---------+--------+----------------------+
| 1 | 1 | 106 | LA | 2019-02-12 03:05pm |
| 1 | 1 | 251 | LA | 2019-02-12 03:06pm |
| 3 | 2 | 309 | NY | 2019-02-12 06:41pm |
| 1 | 1 | 654 | LA | 2019-02-12 05:12pm |
+---------------+---------+---------+--------+----------------------+
I want to find transactions that where issued one after another (say within 5 minutes) by one and the same agent. So the output for the above table should look like:
+---------------+---------+---------+--------+----------------------+
| customer | agent | value | city | timestamp |
+---------------+---------+---------+--------+----------------------+
| 1 | 1 | 106 | LA | 2019-02-12 03:05pm |
| 1 | 1 | 251 | LA | 2019-02-12 03:06pm |
+---------------+---------+---------+--------+----------------------+
The query should somehow group by agent and find such transactions. However the result is not really grouped as you can see from the output. My first thought was using the LEAD function, but I am not sure. Do you have any ideas?
Ideas for a query:
sort by agent and timestamp DESC
start with the first row, look at the following row (using LEAD?)
check if timestamp difference is less than 5 minutes
if so, this two rows should be in the output
continue with next (2nd) row
When the 2nd and 3rd row match the criteria, too, the 2nd row will get into the output, which would cause duplicate rows. I am not sure how to avoid that, yet.
There must be an easier way but does this achieve what you are after?
CTE2 AS (
SELECT customer, agent, value, city, timestamp,
lead(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lead,
lead(customer,1) OVER (PARTITION BY agent ORDER BY timestamp) customer_lead,
lead(value,1) OVER (PARTITION BY agent ORDER BY timestamp) value_lead,
lead(city,1) OVER (PARTITION BY agent ORDER BY timestamp) city_lead,
lag(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lag
FROM CTE
)
SELECT agent,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(customer as string),', ',cast(customer_lead as string)),cast(customer as string)) customer,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(value as string),', ',cast(value_lead as string)),cast(value as string)) value,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(city as string),', ',cast(city_lead as string)),cast(city as string)) cities,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(timestamp as string),', ',cast(timestamp_lead as string)),cast(timestamp as string)) timestamps
FROM CTE2
WHERE (timestamp_diff(timestamp_lead,timestamp,MINUTE)<5 OR NOT timestamp_diff(timestamp,timestamp_lag,MINUTE)<5)
Below is for BigQuery Standard SQL
#standardSQL
SELECT * FROM (
SELECT *,
IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5,
LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts),
NULL).*
FROM `project.dataset.yourtable`
)
WHERE NOT next_customer IS NULL
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer, 1 agent, 106 value,'LA' city, '2019-02-12 03:05pm' ts UNION ALL
SELECT 1, 1, 251,'LA', '2019-02-12 03:06pm' UNION ALL
SELECT 3, 2, 309,'NY', '2019-02-12 06:41pm' UNION ALL
SELECT 1, 1, 654,'LA', '2019-02-12 05:12pm'
), temp AS (
SELECT customer, agent, value, city, PARSE_TIMESTAMP('%Y-%m-%d %I:%M%p', ts) ts
FROM `project.dataset.table`
)
SELECT * FROM (
SELECT *,
IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5,
LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts),
NULL).*
FROM temp
)
WHERE NOT next_customer IS NULL
-- ORDER BY ts
with result
Row customer agent value city ts next_customer next_value
1 1 1 106 LA 2019-02-12 15:05:00 UTC 1 251