Start SUM aggregation at a certain threshold in bigquery

Start SUM aggregation at a certain threshold in bigquery - google-bigquery

The energy usage of a device is logged hourly:
+--------------+-----------+-----------------------+
| energy_usage | device_id | timestamp |
+--------------+-----------+-----------------------+
| 10 | 1 | 2019-02-12T01:00:00 |
| 16 | 2 | 2019-02-12T01:00:00 |
| 26 | 1 | 2019-03-12T02:00:00 |
| 24 | 2 | 2019-03-12T02:00:00 |
+--------------+-----------+-----------------------+
My goal is:
Create two columns, one for energy_usage_day (8am-8pm) and another for energy_usage_night (8pm-8am)
Create a monthly aggregate, group by device_id and sum up the energy usage
So the result might look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 80 | 30 | 50 | 1 | 2 | 2019 |
| 130 | 60 | 70 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
Following query produces such results:
SELECT SUM(energy_usage) energy_usage
, SUM(IF(EXTRACT(HOUR FROM timestamp) BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_day
, SUM(IF(EXTRACT(HOUR FROM timestamp) NOT BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_night
, device_id
, EXTRACT(MONTH FROM timestamp) month, EXTRACT(YEAR FROM timestamp) year
FROM `data`
GROUP BY device_id, month, year
Say I am only interested in energy usage aggregates above a certain threshold, e.g. 50. I want to start the SUM at a total energy usage of 50. The result should look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 30 | 10 | 20 | 1 | 2 | 2019 |
| 80 | 50 | 30 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
In other words: the query should start summing up energy_usage, energy_usage_day and energy_usage_night only when energy_usage reaches the threshold of 50.
Is this possible in bigquery?

Below is for BigQuery Standard SQL and logic is that it starts aggregate usage ONLY after it reaches 50 (per device per month)
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
)
SELECT SUM(energy_usage) energy_usage,
SUM(IF(day_hour, energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp
WHERE qualified
GROUP BY device_id, month, year
Say the current SUM of usage is 49 and the next usage entry has a value of 2. The SUM will be 51. As a result usage of 2 will be added to the SUM. Instead only half of 1 should've been added. Can we solve such problem in BigQuery SQL?
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
SUM(energy_usage) OVER(win) - 50 rolling_sum,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
), temp_with_adjustments AS (
SELECT *,
IF(
ROW_NUMBER() OVER(PARTITION BY device_id, month, year ORDER BY `timestamp`) = 1,
rolling_sum,
energy_usage
) AS adjusted_energy_usage
FROM temp
WHERE qualified
)
SELECT SUM(adjusted_energy_usage) energy_usage,
SUM(IF(day_hour, adjusted_energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, adjusted_energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp_with_adjustments
GROUP BY device_id, month, year
As you can see, I've just added logic for temp_with_adjustments (and rolling_sum in the temp to support this) - the rest is the same

Related

percentage per month Bigquery

I am working in Bigquery and I need the percentages for each result for each month, I have the following query but the percentage is calculated with respect to the total, I have tried to add a PARTITION BY in the OVER clause but it does not work.
SELECT CAST(TIMESTAMP_TRUNC(CAST((created_at) AS TIMESTAMP), MONTH) AS DATE) AS `month`,
result,
count(*) * 100.0 / sum(count(1)) over() as percentage
FROM table_name
GROUP BY 1,2
ORDER BY 1
month
result
percentage
2021-01
0001
50
2021-01
0000
50
2021-02
00001
33.33
2021-02
0000
33.33
2021-02
0002
33.33

Using the data that you shared as:
WITH data as(
SELECT "2021-01-01" as created_at,"0001" as result UNION ALL
SELECT "2021-01-01","0000" UNION ALL
SELECT "2021-02-01","00001"UNION ALL
SELECT "2021-02-01","0000"UNION ALL
SELECT "2021-02-01","0002"
)
I used a subquery to help you to deal with the month field and then use that field to partition by and then group them by month, and result.
d as (SELECT CAST(TIMESTAMP_TRUNC(CAST((created_at) AS TIMESTAMP), MONTH) AS DATE) AS month,
result, created_at
from DATA
)
SELECT d.month,
d.result,
count(*) * 100.0 / sum(count(1)) over(partition by month) as percentage
FROM d
GROUP BY 1, 2
ORDER BY 1
The output is the following:

This example is code on dbFiddle SQL server, but according to the documentation google-bigquery has the function COUNT( ~ ) OVER ( PARTITION BY ~ )
create table table_name(month char(7), result int)
insert into table_name values
('2021-01',50),
('2021-01',30),
('2021-01',20),
('2021-02',70),
('2021-02',80);
select
month,
result,
sum(result) over (partition by month) month_total,
100 * result / sum(result) over (partition by month) per_cent
from table_name
order by month, result;
month | result | month_total | per_cent
:------ | -----: | ----------: | -------:
2021-01 | 20 | 100 | 20
2021-01 | 30 | 100 | 30
2021-01 | 50 | 100 | 50
2021-02 | 70 | 150 | 46
2021-02 | 80 | 150 | 53
db<>fiddle here

How to calculate occurrence depending on months/years

My table looks like that:
ID | Start | End
1 | 2010-01-02 | 2010-01-04
1 | 2010-01-22 | 2010-01-24
1 | 2011-01-31 | 2011-02-02
2 | 2012-05-02 | 2012-05-08
3 | 2013-01-02 | 2013-01-03
4 | 2010-09-15 | 2010-09-20
4 | 2010-09-30 | 2010-10-05
I'm looking for a way to count the number of occurrences for each ID in a Year per Month.
But what is important, If some record has a Start date in the following month compared to the End date (of course from the same year) then occurrence should be counted for both months [e.g. ID 1 in the 3rd row has a situation like that. So in this situation, the occurrence for this ID should be +1 for January and +1 for February].
So I'd like to have it in this way:
Year | Month | Id | Occurrence
2010 | 01 | 1 | 2
2010 | 09 | 4 | 2
2010 | 10 | 4 | 1
2011 | 01 | 1 | 1
2011 | 02 | 1 | 1
2012 | 05 | 2 | 1
2013 | 01 | 3 | 1
I created only this for now...
CREATE TABLE IF NOT EXISTS counts AS
(SELECT
id,
YEAR (CAST(Start AS DATE)) AS Year_St,
MONTH (CAST(Start AS DATE)) AS Month_St,
YEAR (CAST(End AS DATE)) AS Year_End,
MONTH (CAST(End AS DATE)) AS Month_End
FROM source)
And I don't know how to move with that further. I'd appreciate your help.
I'm using Spark SQL.

Try the following strategy to achieve this:
Note:
I have created few intermediate tables. If you wish you can use sub-query or CTE depending on the permissions
I have taken care of 2 scenarios you mentioned (whether to count it as 1 occurrence or 2 occurrence) as you explained
Query:
Firstly, creating a table with flags to decide whether start and end date are falling on same year and month (1 means YES, 2 means NO):
/* Creating a table with flags whether to count the occurrences once or twice */
CREATE TABLE flagged as
(
SELECT *,
CASE
WHEN Year_st = Year_end and Month_st = Month_end then 1
WHEN Year_st = Year_end and Month_st <> Month_end then 2
Else 0
end as flag
FROM
(
SELECT
id,
YEAR (CAST(Start AS DATE)) AS Year_St,
MONTH (CAST(Start AS DATE)) AS Month_St,
YEAR (CAST(End AS DATE)) AS Year_End,
MONTH (CAST(End AS DATE)) AS Month_End
FROM source
) as calc
)
Now the flag in the above table will have 1 if year and month are same for start and end 2 if month differs. You can have more categories of flag if you have more scenarios.
Secondly, counting the occurrences for flag 1. As we know year and month are same for flag 1, we can take either of it. I have taken start:
/* Counting occurrences only for flag 1 */
CREATE TABLE flg1 as (
SELECT distinct id, year_st, month_st, count(*) as occurrence
FROM flagged
where flag=1
GROUP BY id, year_st, month_st
)
Similarly, counting the occurrences for flag 2. Since month differs for both the dates, we can UNION them before counting to get both the dates in same column:
/* Counting occurrences only for flag 2 */
CREATE TABLE flg2 as
(
SELECT distinct id, year_dt, month_dt, count(*) as occurrence
FROM
(
select ID, year_st as year_dt, month_st as month_dt FROM flagged where flag=2
UNION
SELECT ID, year_end as year_dt, month_end as month_dt FROM flagged where flag=2
) as unioned
GROUP BY id, year_dt, month_dt
)
Finally, we just have to SUM the occurrences from both the flags. Note that we use UNION ALL here to combine both the tables. This is very important because we need to count duplicates as well:
/* UNIONING both the final tables and summing the occurrences */
SELECT distinct year, month, id, SUM(occurrence) as occurrence
FROM
(
SELECT distinct id, year_st as year, month_st as month, occurrence
FROM flg1
UNION ALL
SELECT distinct id, year_dt as year, month_dt as month, occurrence
FROM flg2
) as fin_unioned
GROUP BY id, year, month
ORDER BY year, month, id, occurrence desc
Output of above query will be your expected output. I know this is not an optimized one, yet it works perfect. I will update if I come across optimized strategy. Comment if you have question.
db<>fiddle link here

Not sure if this works in Spark SQL.
But if the ranges aren't bigger than 1 month, then just add the extra to the count via a UNION ALL.
And the extra are those with the end in a higher month than the start.
SELECT YearOcc, MonthOcc, Id
, COUNT(*) as Occurrence
FROM
(
SELECT Id
, YEAR(CAST(Start AS DATE)) as YearOcc
, MONTH(CAST(Start AS DATE)) as MonthOcc
FROM source
UNION ALL
SELECT Id
, YEAR(CAST(End AS DATE)) as YearOcc
, MONTH(CAST(End AS DATE)) as MonthOcc
FROM source
WHERE MONTH(CAST(Start AS DATE)) < MONTH(CAST(End AS DATE))
) q
GROUP BY YearOcc, MonthOcc, Id
ORDER BY YearOcc, MonthOcc, Id
YearOcc | MonthOcc | Id | Occurrence
------: | -------: | -: | ---------:
2010 | 1 | 1 | 2
2010 | 9 | 4 | 2
2010 | 10 | 4 | 1
2011 | 1 | 1 | 1
2011 | 2 | 1 | 1
2012 | 5 | 2 | 1
2013 | 1 | 3 | 1
db<>fiddle here

BigQuery SQL - Concatenate two columns if they are on consecutive days

I am looking for a way to adjust this sql query running in BigQuery to return single count total for Sent EventTypes that happen two or even three days in a row.
SELECT date(EventDate) as EventDate, EventType, count(*) as count FROM `Database.Table`
where date(EventDate) > DATE_SUB (CURRENT_DATE, INTERVAL 100 DAY)
Group by 1,2
ORDER by 1,2
Response from above Query:
| Row | EventDate | EventType | count |
| ------ | --------- |-----------|-------|
| 1 | 2019-02-06| Sent | 4 |
| 2 | 2019-02-07| Sent | 5 |
| 3 | 2019-02-12| NotSent | 7 |
| 4 | 2019-02-13| Bounces | 22 |
| 5 | 2019-02-14| Bounces | 22 |
| 6 | 2019-03-06| Sent | 2 |
| 7 | 2019-03-07| Sent | 4 |
| 8 | 2019-03-07| NotSent | 5 |
| 9 | 2019-03-12| Bounces | 7 |
| 10 | 2019-03-13| Sent | 22 |
| 11 | 2019-04-05| Sent | 2 |
Response I would like to get to:
| Row | EventDate | EventType | count |
| ------ | --------- |-----------|-------|
| 1 | 2019-02-06| Sent | 9 |
| 2 | 2019-02-12| NotSent | 7 |
| 3 | 2019-02-13| Bounces | 22 |
| 4 | 2019-02-14| Bounces | 22 |
| 5 | 2019-03-06| Sent | 6 |
| 6 | 2019-03-07| NotSent | 5 |
| 7 | 2019-03-12| Bounces | 7 |
| 8 | 2019-03-13| Sent | 22 |
| 9 | 2019-04-05| Sent | 2 |
Something along those line, so I am able to concatenate two counts with the EventType of 'Sent' for consecutive days, and show other EventTypes without concatenating them, such as Bounces and NotSent.

I wrote a query that merges all consecutive 2 days in the table.
It gives the exact same output you want.
I think you meant '2019-03-06' in the 5th row, so I fixed it in my dummy data section.
WITH
data AS (
SELECT CAST('2019-02-06' as date) as EventDate, 4 as count union all
SELECT CAST('2019-02-07' as date) as EventDate, 5 as count union all
SELECT CAST('2019-02-12' as date) as EventDate, 7 as count union all
SELECT CAST('2019-02-13' as date) as EventDate, 22 as count union all
SELECT CAST('2019-03-06' as date) as EventDate, 2 as count
),
data_with_steps AS (
SELECT *,
IF(DATE_DIFF(EventDate, LAG(EventDate) OVER (ORDER BY EventDate), day) > 2, 1, 0) as new_step
FROM data
),
data_grouped AS (
SELECT *,
SUM(new_step) OVER (ORDER BY EventDate) as step_group
FROM data_with_steps
)
SELECT MIN(EventDate) as EventDate, sum(count) as count
FROM data_grouped
GROUP BY step_group
So, how does it work?
First, I calculate the date difference to previous day. If it's more than 2 days, I set value 1, otherwise 0 for the new column new_step.
Then, I calculate the cumulative sum of new_step column and name it as step_group.
The output of the first two steps is:
At final step, I group table by step_group and get minimum date as event date, and sum counts to obtain group count.
Edit:
To add other events without grouping by, I added a new version.
I think the most intuitive and easiest way is to use Union All for that problem.
So you can use that updated query to include other events without grouping.
WITH
data AS (
SELECT CAST('2019-02-06' as date) as EventDate, 'Sent' as EventType, 4 as count union all
SELECT CAST('2019-02-07' as date) as EventDate, 'Sent' as EventType, 5 as count union all
SELECT CAST('2019-02-12' as date) as EventDate, 'Sent' as EventType, 7 as count union all
SELECT CAST('2019-02-13' as date) as EventDate, 'Sent' as EventType, 22 as count union all
SELECT CAST('2019-03-06' as date) as EventDate, 'Sent' as EventType, 2 as count union all
SELECT CAST('2019-02-12' as date) as EventDate, 'NotSent' as EventType, 7 as count union all
SELECT CAST('2019-03-07' as date) as EventDate, 'NotSent' as EventType, 5 as count union all
SELECT CAST('2019-02-13' as date) as EventDate, 'Bounces' as EventType, 22 as count union all
SELECT CAST('2019-02-14' as date) as EventDate, 'Bounces' as EventType, 22 as count union all
SELECT CAST('2019-03-12' as date) as EventDate, 'Bounces' as EventType, 7 as count
),
data_with_steps AS (
SELECT *,
IF(DATE_DIFF(EventDate, LAG(EventDate) OVER (ORDER BY EventDate), day) > 2, 1, 0) as new_step
FROM data
WHERE EventType = 'Sent'
),
data_grouped AS (
SELECT *,
SUM(new_step) OVER (ORDER BY EventDate) as step_group
FROM data_with_steps
)
SELECT EventType, MIN(EventDate) as EventDate, sum(count) as count
FROM data_grouped
GROUP BY EventType, step_group
UNION ALL
SELECT EventType, EventDate, count
FROM data
WHERE EventType != 'Sent'

This is a gaps-and-islands problem. The simplest method is to use row_number() and subtraction to identify the "islands". And then aggregate:
select min(row), eventType, min(eventDate), sum(count)
from (select t.*,
row_number() over (partition by eventType order by eventDate) as seqnum
from t
) t
group by eventType, dateadd(eventDate, interval -seqnum day)

SQL Server : processing by group

I have a table with the following data:
Id Date Value
---------------------------
1 Dec-01-2019 10
1 Dec-03-2019 5
1 Dec-05-2019 8
1 Jan-03-2020 6
1 Jan-07-2020 3
1 Jan-08-2020 9
2 Dec-01-2019 4
2 Dec-03-2019 7
2 Dec-31-2019 9
2 Jan-04-2020 4
2 Jan-09-2020 6
I need to group it to the following format: 1 record per month per id. If month is closed, so date will be the last day of that month, if not, the last day available. Max and average are calculated using all data until that date.
Id Date Max_Value Average_Value
-----------------------------------------------
1 Dec-31-2019 10 7,6
1 Jan-08-2020 10 6,8
2 Dec-31-2019 9 6,6
2 Jan-09-2020 9 6,0
Any easy SQL to obtain this analysis?
Regards,

Hmmm . . . You want to aggregate by month and then just take the maximum date in the month:
select id, max(date), max(value), avg(value * 1.0)
from t
group by id, eomonth(date)
order by id, max(date);

If by closed month you mean that it's not the last month of the id then:
select id,
case
when year(Date) = year(maxDate) and month(Date) = month(maxDate) then maxDate
else eomonth(Date)
end Date,
max(maxValue) Max_Value,
round(avg(1.0 * Value), 1) Average_Value
from (
select *,
max(Date) over (partition by Id) maxDate,
max(Value) over (partition by Id) maxValue
from tablename
) t
group by id,
case
when year(Date) = year(maxDate) and month(Date) = month(maxDate) then maxDate
else eomonth(Date)
end
order by id, Date
See the demo.
Results:
> id | Date | Max_Value | Average_Value
> -: | :--------- | --------: | :------------
> 1 | 2019-12-31 | 10 | 7.7
> 1 | 2020-01-08 | 10 | 6.0
> 2 | 2019-12-31 | 9 | 6.7
> 2 | 2020-01-09 | 9 | 5.0

Calculate 7, 14 and 30 day moving average in bigquery

I am playing around with bigquery. I have IoT uptime recordings as input:
+---------------+-------------+----------+------------+
| device_id | reference | uptime | timestamp |
+---------------+-------------+----------+------------+
| 1 | 1000-5 | 0.7 | 2019-02-12 |
| 2 | 1000-6 | 0.9 | 2019-02-12 |
| 1 | 1000-5 | 0.8 | 2019-02-11 |
| 2 | 1000-6 | 0.95 | 2019-02-11 |
+---------------+-------------+----------+------------+
I want to calculate the 7, 14 and 30 day moving average of the uptime grouped by device. The output should look as follows:
+---------------+-------------+---------+--------+--------+
| device_id | reference | avg_7 | avg_14 | avg_30 |
+---------------+-------------+---------+--------+--------+
| 1 | 1000-5 | 0.7 | .. | .. |
| 2 | 1000-6 | 0.9 | .. | .. |
+---------------+-------------+---------+--------+--------+
What I have tried:
SELECT
device_id,
AVG(uptime) OVER (ORDER BY day RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, uptime, UNIX_DATE(DATE(timestamp)) as day FROM `uptime_recordings`
)
GROUP BY device_id, uptime, day
I have recordings for 1000 distinct devices and 200k readings. The grouping does not work and the query returns 200k records instead of 1000. Any ideas whats wrong?

I have recordings for 1000 distinct devices and 200k readings. The grouping does not work and the query returns 200k records instead of 1000. Any ideas whats wrong?
Instead of GROUP BY device_id, uptime, day do GROUP BY device_id, day.
A full working query:
WITH data
AS (
SELECT title device_id, views uptime, datehour timestamp
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-09'
AND wiki='br'
AND title='Chile'
)
SELECT device_id, day
, AVG(uptime) OVER (PARTITION BY device_id ORDER BY UNIX_DATE(day) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, AVG(uptime) uptime, (DATE(timestamp)) as day
FROM `data`
GROUP BY device_id, day
)
Edit: As per requested in the comments, not sure what's the goal of summarizing all of the 7d averages:
WITH data
AS (
SELECT title device_id, views uptime, datehour timestamp
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-09'
AND wiki='br'
AND title IN ('Chile', 'Saozneg')
)
SELECT device_id, AVG(avg_7d) avg_avg_7d
FROM (
SELECT device_id, day
, AVG(uptime) OVER (PARTITION BY device_id ORDER BY UNIX_DATE(day) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, AVG(uptime) uptime, (DATE(timestamp)) as day
FROM `data`
GROUP BY device_id, day
)
)
GROUP BY device_id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Start SUM aggregation at a certain threshold in bigquery - google-bigquery

Related

percentage per month Bigquery

How to calculate occurrence depending on months/years

BigQuery SQL - Concatenate two columns if they are on consecutive days

SQL Server : processing by group

Calculate 7, 14 and 30 day moving average in bigquery

Categories

Resources