percentage per month Bigquery - sql

I am working in Bigquery and I need the percentages for each result for each month, I have the following query but the percentage is calculated with respect to the total, I have tried to add a PARTITION BY in the OVER clause but it does not work.
SELECT CAST(TIMESTAMP_TRUNC(CAST((created_at) AS TIMESTAMP), MONTH) AS DATE) AS `month`,
result,
count(*) * 100.0 / sum(count(1)) over() as percentage
FROM table_name
GROUP BY 1,2
ORDER BY 1
month
result
percentage
2021-01
0001
50
2021-01
0000
50
2021-02
00001
33.33
2021-02
0000
33.33
2021-02
0002
33.33

Using the data that you shared as:
WITH data as(
SELECT "2021-01-01" as created_at,"0001" as result UNION ALL
SELECT "2021-01-01","0000" UNION ALL
SELECT "2021-02-01","00001"UNION ALL
SELECT "2021-02-01","0000"UNION ALL
SELECT "2021-02-01","0002"
)
I used a subquery to help you to deal with the month field and then use that field to partition by and then group them by month, and result.
d as (SELECT CAST(TIMESTAMP_TRUNC(CAST((created_at) AS TIMESTAMP), MONTH) AS DATE) AS month,
result, created_at
from DATA
)
SELECT d.month,
d.result,
count(*) * 100.0 / sum(count(1)) over(partition by month) as percentage
FROM d
GROUP BY 1, 2
ORDER BY 1
The output is the following:

This example is code on dbFiddle SQL server, but according to the documentation google-bigquery has the function COUNT( ~ ) OVER ( PARTITION BY ~ )
create table table_name(month char(7), result int)
insert into table_name values
('2021-01',50),
('2021-01',30),
('2021-01',20),
('2021-02',70),
('2021-02',80);
select
month,
result,
sum(result) over (partition by month) month_total,
100 * result / sum(result) over (partition by month) per_cent
from table_name
order by month, result;
month | result | month_total | per_cent
:------ | -----: | ----------: | -------:
2021-01 | 20 | 100 | 20
2021-01 | 30 | 100 | 30
2021-01 | 50 | 100 | 50
2021-02 | 70 | 150 | 46
2021-02 | 80 | 150 | 53
db<>fiddle here

Related

Start SUM aggregation at a certain threshold in bigquery

The energy usage of a device is logged hourly:
+--------------+-----------+-----------------------+
| energy_usage | device_id | timestamp |
+--------------+-----------+-----------------------+
| 10 | 1 | 2019-02-12T01:00:00 |
| 16 | 2 | 2019-02-12T01:00:00 |
| 26 | 1 | 2019-03-12T02:00:00 |
| 24 | 2 | 2019-03-12T02:00:00 |
+--------------+-----------+-----------------------+
My goal is:
Create two columns, one for energy_usage_day (8am-8pm) and another for energy_usage_night (8pm-8am)
Create a monthly aggregate, group by device_id and sum up the energy usage
So the result might look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 80 | 30 | 50 | 1 | 2 | 2019 |
| 130 | 60 | 70 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
Following query produces such results:
SELECT SUM(energy_usage) energy_usage
, SUM(IF(EXTRACT(HOUR FROM timestamp) BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_day
, SUM(IF(EXTRACT(HOUR FROM timestamp) NOT BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_night
, device_id
, EXTRACT(MONTH FROM timestamp) month, EXTRACT(YEAR FROM timestamp) year
FROM `data`
GROUP BY device_id, month, year
Say I am only interested in energy usage aggregates above a certain threshold, e.g. 50. I want to start the SUM at a total energy usage of 50. The result should look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 30 | 10 | 20 | 1 | 2 | 2019 |
| 80 | 50 | 30 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
In other words: the query should start summing up energy_usage, energy_usage_day and energy_usage_night only when energy_usage reaches the threshold of 50.
Is this possible in bigquery?
Below is for BigQuery Standard SQL and logic is that it starts aggregate usage ONLY after it reaches 50 (per device per month)
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
)
SELECT SUM(energy_usage) energy_usage,
SUM(IF(day_hour, energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp
WHERE qualified
GROUP BY device_id, month, year
Say the current SUM of usage is 49 and the next usage entry has a value of 2. The SUM will be 51. As a result usage of 2 will be added to the SUM. Instead only half of 1 should've been added. Can we solve such problem in BigQuery SQL?
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
SUM(energy_usage) OVER(win) - 50 rolling_sum,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
), temp_with_adjustments AS (
SELECT *,
IF(
ROW_NUMBER() OVER(PARTITION BY device_id, month, year ORDER BY `timestamp`) = 1,
rolling_sum,
energy_usage
) AS adjusted_energy_usage
FROM temp
WHERE qualified
)
SELECT SUM(adjusted_energy_usage) energy_usage,
SUM(IF(day_hour, adjusted_energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, adjusted_energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp_with_adjustments
GROUP BY device_id, month, year
As you can see, I've just added logic for temp_with_adjustments (and rolling_sum in the temp to support this) - the rest is the same

Postgres - How to use the AVG of the last number of rows and multiply it with another column?

I have the following table:
date | ratio | revenue
---------|-------|-----------
03-30-18 | 1.2 | 918264
03-31-18 | 0.94 | 981247
04-01-18 | 1.1 | 957353
04-02-18 | 0.99 | 926274
04-03-18 | 1.05 |
04-04-18 | 0.97 |
04-05-18 | 1.23 |
As you can see, 04-03-18 and beyond haven't happened yet so there is no revenue input for those days. But I have a ratio for those future days. I want to use the AVG revenue of the last 4 days that I do have and multiply it by the ratio to make future revenue predictions.
In result, I wish to have the following table:
date | ratio | revenue
---------|-------|-----------
03-30-18 | 1.2 | 918264
03-31-18 | 0.94 | 981247
04-01-18 | 1.1 | 957353
04-02-18 | 0.99 | 926274
04-03-18 | 1.05 | 993073.73
04-04-18 | 0.97 | 917410.97
04-05-18 | 1.23 | 1163314.94
I don't see a need for window functions, so I would phrase this as:
select t.date, t.ratio,
coalesce(t.revenue, a.avg4 * ratio) as revenue
from t cross join
(select avg(revenue) as avg4
from (select t.*
from t
where t.revenue is not null
order by date desc
limit 4
) t
) a
order by date;
You should calculate the average in an initial query and use the value for rows with nulls in revenue:
with the_avg as (
select avg
from (
select
date,
revenue,
avg(revenue) over (order by date rows between 4 preceding and current row)
from my_table
) s
where revenue is null
order by date
limit 1
)
select
date,
ratio,
case when revenue is not null then revenue
else round(avg * ratio, 2) end as revenue
from my_table
cross join the_avg
order by date;
SqlFiddle.

SQL Server : Group by two columns and Sum a third column with the bifurcation of two groups

I am creating an invoice system where I need to get the sum of total_amount with using two filters i.e, month and category_of_service
So far I am able to use GROUP BY clause according with two filters but my SUM is calculated as whole, not according to the groups.
I have referred to various questions yet unable to find my solution.
MySQL: Group by two columns and sum
month | category_of_service | total_amount
------|---------------------|-------------
12 | EB | 1000
12 | EB | 1200
12 | DG | 1500
12 | DG | 2000
What I am able to do is
month | category_of_service | total_amount
------|---------------------|-------------
12 | EB | 5700
12 | DG | 5700
What I actually want is
month | category_of_service | total_amount
------|---------------------|-------------
12 | EB | 2200
12 | DG | 3500
Note: There are multiple months and category_of_services
The query I'm using is:
SELECT
month, category_of_service, SUM(total_amount) AS TotalAmount
FROM
dbo.report
GROUP BY
month, category_of_service
Here is a screenshot of my output:
enter image description here
;With Cte([month] , category_of_service , total_amount)
AS
(
SELECT 12 , 'EB', 1000 Union all
SELECT 12 , 'EB', 1200 Union all
SELECT 12 , 'DG', 1500 Union all
SELECT 12 , 'DG', 2000
)
, Result
AS
(
SELECT * ,ROW_NUMBER()OVER(PARTITION by total_amount ORDER BY [month] desc) AS Seq FROM
(
SELECT [month] , category_of_service ,Sum(total_amount)OVER(PARTITION BY [month],category_of_service ORDER BY [month])AS total_amount FROM Cte
)dt
)
SELECT [month] , category_of_service , total_amount FROM Result WHERE seq=1
OutPut
month | category_of_service | total_amount
------|---------------------|-------------
12 | EB | 2200
12 | DG | 3500

SQL order by two column, omit if second column doesn't meet the order

Let's say we have next data
id | date | price
------------------------
1 | 10-09-2016 | 200
2 | 11-09-2016 | 190
3 | 12-09-2016 | 210
4 | 13-09-2016 | 220
5 | 14-09-2016 | 200
6 | 15-09-2016 | 200
7 | 16-09-2016 | 230
8 | 17-09-2016 | 240
and we have to order by date first, and price second, however if the price must be in order. If current price is less than previous we should omit this row, and the result will be:
id | date | price
------------------------
1 | 10-09-2016 | 200
3 | 12-09-2016 | 210
4 | 13-09-2016 | 220
7 | 16-09-2016 | 230
8 | 17-09-2016 | 240
Is it possible without join?
Use LAG window function
SELECT *
FROM (SELECT *,
Lag(price)OVER( ORDER BY date) AS prev_price
FROM Yourtable) a
WHERE price > prev_price
OR prev_price IS NULL -- to get the first record
If "previous" is supposed to mean the previous row in the output, then keep track of a running maximum. Postgres solution with a window function in a subquery:
SELECT id, date, price
FROM (
SELECT *, price >= max(price) OVER (ORDER BY date, price) AS ok
FROM tbl
) sub
WHERE ok;
If Postgres:
select id, date, price
from
(select
t.*,
price - lag(price, 1, price) over (order by id) diff
from
your_table) t
where diff > 0;
If MySQL:
select id, date, price from
(
select t.*,
price - #lastprice diff,
#lastprice := price
from
(select *
from your_table
order by id) t
cross join (select #lastprice := 0) t2
) t where t.diff > 0;

Querying DAU/MAU over time (daily)

I have a daily sessions table with columns user_id and date. I'd like to graph out DAU/MAU (daily active users / monthly active users) on a daily basis. For example:
Date MAU DAU DAU/MAU
2014-06-01 20,000 5,000 20%
2014-06-02 21,000 4,000 19%
2014-06-03 20,050 3,050 17%
... ... ... ...
Calculating daily active users is straightforward but calculating the monthly active users e.g. the number of users that logged in today minus 30 days, is causing problems. How is this achieved without a left join for each day?
Edit: I'm using Postgres.
Assuming you have values for each day, you can get the total counts using a subquery and range between:
with dau as (
select date, count(userid) as dau
from dailysessions ds
group by date
)
select date, dau,
sum(dau) over (order by date rows between -29 preceding and current row) as mau
from dau;
Unfortunately, I think you want distinct users rather than just user counts. That makes the problem much more difficult, especially because Postgres doesn't support count(distinct) as a window function.
I think you have to do some sort of self join for this. Here is one method:
with dau as (
select date, count(distinct userid) as dau
from dailysessions ds
group by date
)
select date, dau,
(select count(distinct user_id)
from dailysessions ds
where ds.date between date - 29 * interval '1 day' and date
) as mau
from dau;
This one uses COUNT DISTINCT to get the rolling 30 days DAU/MAU:
(calculating reddit's user engagement in BigQuery - but the SQL is standard enough to be used on other databases)
SELECT day, dau, mau, INTEGER(100*dau/mau) daumau
FROM (
SELECT day, EXACT_COUNT_DISTINCT(author) dau, FIRST(mau) mau
FROM (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) day, author
FROM [fh-bigquery:reddit_comments.2015_09]
WHERE subreddit='AskReddit') a
JOIN (
SELECT stopday, EXACT_COUNT_DISTINCT(author) mau
FROM (SELECT created_utc, subreddit, author FROM [fh-bigquery:reddit_comments.2015_09], [fh-bigquery:reddit_comments.2015_08]) a
CROSS JOIN (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) stopday
FROM [fh-bigquery:reddit_comments.2015_09]
GROUP BY 1
) b
WHERE subreddit='AskReddit'
AND SEC_TO_TIMESTAMP(created_utc) BETWEEN DATE_ADD(stopday, -30, 'day') AND TIMESTAMP(stopday)
GROUP BY 1
) b
ON a.day=b.stopday
GROUP BY 1
)
ORDER BY 1
I went further at How to calculate DAU/MAU with BigQuery (engagement)
I've written about this on my blog.
The DAU is easy, as you noticed. You can solve the MAU by first creating a view with boolean values for when a user activates and de-activates, like so:
CREATE OR REPLACE VIEW "vw_login" AS
SELECT *
, LEAST (LEAD("date") OVER w, "date" + 30) AS "activeExpiry"
, CASE WHEN LAG("date") OVER w IS NULL THEN true ELSE false AS "activated"
, CASE
WHEN LEAD("date") OVER w IS NULL THEN true
WHEN LEAD("date") OVER w - "date" > 30 THEN true
ELSE false
END AS "churned"
, CASE
WHEN LAG("date") OVER w IS NULL THEN false
WHEN "date" - LAG("date") OVER w <= 30 THEN false
WHEN row_number() OVER w > 1 THEN true
ELSE false
END AS "resurrected"
FROM "login"
WINDOW w AS (PARTITION BY "user_id" ORDER BY "date")
This creates boolean values per user per day when they become active, when they churn and when they re-activate.
Then do a daily aggregate of the same:
CREATE OR REPLACE VIEW "vw_activity" AS
SELECT
SUM("activated"::int) "activated"
, SUM("churned"::int) "churned"
, SUM("resurrected"::int) "resurrected"
, "date"
FROM "vw_login"
GROUP BY "date"
;
And finally calculate running totals of active MAUs by calculating the cumulative sums over the columns. You need to join the vw_activity twice, since the second one is joined to the day when the user becomes inactive (i.e. 30 days since their last login).
I've included a date series in order to ensure that all days are present in your dataset. You can do without it too, but you might skip days in your dataset.
SELECT
d."date"
, SUM(COALESCE(a.activated::int,0)
- COALESCE(a2.churned::int,0)
+ COALESCE(a.resurrected::int,0)) OVER w
, d."date", a."activated", a2."churned", a."resurrected" FROM
generate_series('2010-01-01'::date, CURRENT_DATE, '1 day'::interval) d
LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '30 days')::date
WINDOW w AS (ORDER BY d."date") ORDER BY d."date";
You can of course do this in a single query, but this helps understand the structure better.
You didn't show us your complete table definition, but maybe something like this:
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
order by date;
To get the percentage without repeating the window functions, just wrap this in a derived table:
select date,
dau,
mau,
dau::numeric / (case when mau = 0 then null else mau end) as pct
from (
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
) t
order by date;
Here is an example output:
postgres=> select * from sessions;
session_date | user_id
--------------+---------
2014-05-01 | 1
2014-05-01 | 2
2014-05-01 | 3
2014-05-02 | 1
2014-05-02 | 2
2014-05-02 | 3
2014-05-02 | 4
2014-05-02 | 5
2014-06-01 | 1
2014-06-01 | 2
2014-06-01 | 3
2014-06-02 | 1
2014-06-02 | 2
2014-06-02 | 3
2014-06-02 | 4
2014-06-03 | 1
2014-06-03 | 2
2014-06-03 | 3
2014-06-03 | 4
2014-06-03 | 5
(20 rows)
postgres=> select session_date,
postgres-> dau,
postgres-> mau,
postgres-> round(dau::numeric / (case when mau = 0 then null else mau end),2) as pct
postgres-> from (
postgres(> select session_date,
postgres(> count(*) over (partition by date_trunc('day', session_date) order by session_date) as dau,
postgres(> count(*) over (partition by date_trunc('month', session_date) order by session_date) as mau
postgres(> from sessions
postgres(> ) t
postgres-> order by session_date;
session_date | dau | mau | pct
--------------+-----+-----+------
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
(20 rows)
postgres=>