SQL Server : processing by group - sql

I have a table with the following data:
Id Date Value
---------------------------
1 Dec-01-2019 10
1 Dec-03-2019 5
1 Dec-05-2019 8
1 Jan-03-2020 6
1 Jan-07-2020 3
1 Jan-08-2020 9
2 Dec-01-2019 4
2 Dec-03-2019 7
2 Dec-31-2019 9
2 Jan-04-2020 4
2 Jan-09-2020 6
I need to group it to the following format: 1 record per month per id. If month is closed, so date will be the last day of that month, if not, the last day available. Max and average are calculated using all data until that date.
Id Date Max_Value Average_Value
-----------------------------------------------
1 Dec-31-2019 10 7,6
1 Jan-08-2020 10 6,8
2 Dec-31-2019 9 6,6
2 Jan-09-2020 9 6,0
Any easy SQL to obtain this analysis?
Regards,

Hmmm . . . You want to aggregate by month and then just take the maximum date in the month:
select id, max(date), max(value), avg(value * 1.0)
from t
group by id, eomonth(date)
order by id, max(date);

If by closed month you mean that it's not the last month of the id then:
select id,
case
when year(Date) = year(maxDate) and month(Date) = month(maxDate) then maxDate
else eomonth(Date)
end Date,
max(maxValue) Max_Value,
round(avg(1.0 * Value), 1) Average_Value
from (
select *,
max(Date) over (partition by Id) maxDate,
max(Value) over (partition by Id) maxValue
from tablename
) t
group by id,
case
when year(Date) = year(maxDate) and month(Date) = month(maxDate) then maxDate
else eomonth(Date)
end
order by id, Date
See the demo.
Results:
> id | Date | Max_Value | Average_Value
> -: | :--------- | --------: | :------------
> 1 | 2019-12-31 | 10 | 7.7
> 1 | 2020-01-08 | 10 | 6.0
> 2 | 2019-12-31 | 9 | 6.7
> 2 | 2020-01-09 | 9 | 5.0

Related

BigQuery SQL - Concatenate two columns if they are on consecutive days

I am looking for a way to adjust this sql query running in BigQuery to return single count total for Sent EventTypes that happen two or even three days in a row.
SELECT date(EventDate) as EventDate, EventType, count(*) as count FROM `Database.Table`
where date(EventDate) > DATE_SUB (CURRENT_DATE, INTERVAL 100 DAY)
Group by 1,2
ORDER by 1,2
Response from above Query:
| Row | EventDate | EventType | count |
| ------ | --------- |-----------|-------|
| 1 | 2019-02-06| Sent | 4 |
| 2 | 2019-02-07| Sent | 5 |
| 3 | 2019-02-12| NotSent | 7 |
| 4 | 2019-02-13| Bounces | 22 |
| 5 | 2019-02-14| Bounces | 22 |
| 6 | 2019-03-06| Sent | 2 |
| 7 | 2019-03-07| Sent | 4 |
| 8 | 2019-03-07| NotSent | 5 |
| 9 | 2019-03-12| Bounces | 7 |
| 10 | 2019-03-13| Sent | 22 |
| 11 | 2019-04-05| Sent | 2 |
Response I would like to get to:
| Row | EventDate | EventType | count |
| ------ | --------- |-----------|-------|
| 1 | 2019-02-06| Sent | 9 |
| 2 | 2019-02-12| NotSent | 7 |
| 3 | 2019-02-13| Bounces | 22 |
| 4 | 2019-02-14| Bounces | 22 |
| 5 | 2019-03-06| Sent | 6 |
| 6 | 2019-03-07| NotSent | 5 |
| 7 | 2019-03-12| Bounces | 7 |
| 8 | 2019-03-13| Sent | 22 |
| 9 | 2019-04-05| Sent | 2 |
Something along those line, so I am able to concatenate two counts with the EventType of 'Sent' for consecutive days, and show other EventTypes without concatenating them, such as Bounces and NotSent.
I wrote a query that merges all consecutive 2 days in the table.
It gives the exact same output you want.
I think you meant '2019-03-06' in the 5th row, so I fixed it in my dummy data section.
WITH
data AS (
SELECT CAST('2019-02-06' as date) as EventDate, 4 as count union all
SELECT CAST('2019-02-07' as date) as EventDate, 5 as count union all
SELECT CAST('2019-02-12' as date) as EventDate, 7 as count union all
SELECT CAST('2019-02-13' as date) as EventDate, 22 as count union all
SELECT CAST('2019-03-06' as date) as EventDate, 2 as count
),
data_with_steps AS (
SELECT *,
IF(DATE_DIFF(EventDate, LAG(EventDate) OVER (ORDER BY EventDate), day) > 2, 1, 0) as new_step
FROM data
),
data_grouped AS (
SELECT *,
SUM(new_step) OVER (ORDER BY EventDate) as step_group
FROM data_with_steps
)
SELECT MIN(EventDate) as EventDate, sum(count) as count
FROM data_grouped
GROUP BY step_group
So, how does it work?
First, I calculate the date difference to previous day. If it's more than 2 days, I set value 1, otherwise 0 for the new column new_step.
Then, I calculate the cumulative sum of new_step column and name it as step_group.
The output of the first two steps is:
At final step, I group table by step_group and get minimum date as event date, and sum counts to obtain group count.
Edit:
To add other events without grouping by, I added a new version.
I think the most intuitive and easiest way is to use Union All for that problem.
So you can use that updated query to include other events without grouping.
WITH
data AS (
SELECT CAST('2019-02-06' as date) as EventDate, 'Sent' as EventType, 4 as count union all
SELECT CAST('2019-02-07' as date) as EventDate, 'Sent' as EventType, 5 as count union all
SELECT CAST('2019-02-12' as date) as EventDate, 'Sent' as EventType, 7 as count union all
SELECT CAST('2019-02-13' as date) as EventDate, 'Sent' as EventType, 22 as count union all
SELECT CAST('2019-03-06' as date) as EventDate, 'Sent' as EventType, 2 as count union all
SELECT CAST('2019-02-12' as date) as EventDate, 'NotSent' as EventType, 7 as count union all
SELECT CAST('2019-03-07' as date) as EventDate, 'NotSent' as EventType, 5 as count union all
SELECT CAST('2019-02-13' as date) as EventDate, 'Bounces' as EventType, 22 as count union all
SELECT CAST('2019-02-14' as date) as EventDate, 'Bounces' as EventType, 22 as count union all
SELECT CAST('2019-03-12' as date) as EventDate, 'Bounces' as EventType, 7 as count
),
data_with_steps AS (
SELECT *,
IF(DATE_DIFF(EventDate, LAG(EventDate) OVER (ORDER BY EventDate), day) > 2, 1, 0) as new_step
FROM data
WHERE EventType = 'Sent'
),
data_grouped AS (
SELECT *,
SUM(new_step) OVER (ORDER BY EventDate) as step_group
FROM data_with_steps
)
SELECT EventType, MIN(EventDate) as EventDate, sum(count) as count
FROM data_grouped
GROUP BY EventType, step_group
UNION ALL
SELECT EventType, EventDate, count
FROM data
WHERE EventType != 'Sent'
This is a gaps-and-islands problem. The simplest method is to use row_number() and subtraction to identify the "islands". And then aggregate:
select min(row), eventType, min(eventDate), sum(count)
from (select t.*,
row_number() over (partition by eventType order by eventDate) as seqnum
from t
) t
group by eventType, dateadd(eventDate, interval -seqnum day)

Start SUM aggregation at a certain threshold in bigquery

The energy usage of a device is logged hourly:
+--------------+-----------+-----------------------+
| energy_usage | device_id | timestamp |
+--------------+-----------+-----------------------+
| 10 | 1 | 2019-02-12T01:00:00 |
| 16 | 2 | 2019-02-12T01:00:00 |
| 26 | 1 | 2019-03-12T02:00:00 |
| 24 | 2 | 2019-03-12T02:00:00 |
+--------------+-----------+-----------------------+
My goal is:
Create two columns, one for energy_usage_day (8am-8pm) and another for energy_usage_night (8pm-8am)
Create a monthly aggregate, group by device_id and sum up the energy usage
So the result might look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 80 | 30 | 50 | 1 | 2 | 2019 |
| 130 | 60 | 70 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
Following query produces such results:
SELECT SUM(energy_usage) energy_usage
, SUM(IF(EXTRACT(HOUR FROM timestamp) BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_day
, SUM(IF(EXTRACT(HOUR FROM timestamp) NOT BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_night
, device_id
, EXTRACT(MONTH FROM timestamp) month, EXTRACT(YEAR FROM timestamp) year
FROM `data`
GROUP BY device_id, month, year
Say I am only interested in energy usage aggregates above a certain threshold, e.g. 50. I want to start the SUM at a total energy usage of 50. The result should look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 30 | 10 | 20 | 1 | 2 | 2019 |
| 80 | 50 | 30 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
In other words: the query should start summing up energy_usage, energy_usage_day and energy_usage_night only when energy_usage reaches the threshold of 50.
Is this possible in bigquery?
Below is for BigQuery Standard SQL and logic is that it starts aggregate usage ONLY after it reaches 50 (per device per month)
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
)
SELECT SUM(energy_usage) energy_usage,
SUM(IF(day_hour, energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp
WHERE qualified
GROUP BY device_id, month, year
Say the current SUM of usage is 49 and the next usage entry has a value of 2. The SUM will be 51. As a result usage of 2 will be added to the SUM. Instead only half of 1 should've been added. Can we solve such problem in BigQuery SQL?
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
SUM(energy_usage) OVER(win) - 50 rolling_sum,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
), temp_with_adjustments AS (
SELECT *,
IF(
ROW_NUMBER() OVER(PARTITION BY device_id, month, year ORDER BY `timestamp`) = 1,
rolling_sum,
energy_usage
) AS adjusted_energy_usage
FROM temp
WHERE qualified
)
SELECT SUM(adjusted_energy_usage) energy_usage,
SUM(IF(day_hour, adjusted_energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, adjusted_energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp_with_adjustments
GROUP BY device_id, month, year
As you can see, I've just added logic for temp_with_adjustments (and rolling_sum in the temp to support this) - the rest is the same

Is it possible to do projection in Google Big Query?

I have a query (due to restrictions, it is using Legacy SQL) that produces a column that is the rolling average of last 3 days of sale (excluding today)
SELECT
id, date, sales, AVG(sales) OVER (PARTITION BY id ORDER BY date RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING) AS projected_sale
FROM tableA
tableA
+-------+---------+---------+
| id | date | sales |
+-------+---------+---------+
| 1 | 01-01-17| 5 |
| 1 | 01-02-17| 6 |
| 1 | 01-03-17| 7 |
| 1 | 01-04-17| 10 |
+-------+---------+---------+
The query produces
+-------+---------+---------+--------------+
| id | date | sales |projected_sale|
+-------+---------+---------+--------------+
| 1 | 01-01-17| 5 | . |
| 1 | 01-02-17| 6 | . |
| 1 | 01-03-17| 7 | . |
| 1 | 01-04-17| 10 | 6 |
+-------+---------+---------+--------------+
Since the average is excluding the current row, theoretically I can project the sale for 01-05-17 using the sales from (01-02 to 01-04). However since tableA doesn't actually have a entry with date 01-05-17, my query stops at 01-04-17 as the last row.
Is what I am trying to do possible in Big Query?
Thank you
First, I think using RANGE is incorrect here - it should be ROWS instead
Anyway, below is an example for BigQuery Legacy SQL that demonstrates how to achieve result you need.
#legacySQL
SELECT
id, dt, sales,
AVG(sales) OVER (
PARTITION BY id ORDER BY dt
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING
) AS projected_sale
FROM tableA, (SELECT 1 id, '01-05-17' dt, 0 sales)
As you can see here you just simply adding (UNION ALL - comma in Kegacy SQL) that missing day. Of course you can transform that one such that it will add such missing row for all id's
Nevetherless - hope this is a good starting point for you
You can test / play with it using dummy data as in your question
#legacySQL
SELECT
id, dt, sales,
AVG(sales) OVER (
PARTITION BY id ORDER BY dt
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING
) AS projected_sale
FROM (
SELECT * FROM
(SELECT 1 id, '01-01-17' dt, 5 sales),
(SELECT 1 id, '01-02-17' dt, 6 sales),
(SELECT 1 id, '01-03-17' dt, 7 sales),
(SELECT 1 id, '01-04-17' dt, 10 sales)
) tableA, (SELECT 1 id, '01-05-17' dt, 0 sales)
with result as
Row id dt sales projected_sale
1 1 01-01-17 5 null
2 1 01-02-17 6 5.0
3 1 01-03-17 7 5.5
4 1 01-04-17 10 6.0
5 1 01-05-17 0 7.0

Querying DAU/MAU over time (daily)

I have a daily sessions table with columns user_id and date. I'd like to graph out DAU/MAU (daily active users / monthly active users) on a daily basis. For example:
Date MAU DAU DAU/MAU
2014-06-01 20,000 5,000 20%
2014-06-02 21,000 4,000 19%
2014-06-03 20,050 3,050 17%
... ... ... ...
Calculating daily active users is straightforward but calculating the monthly active users e.g. the number of users that logged in today minus 30 days, is causing problems. How is this achieved without a left join for each day?
Edit: I'm using Postgres.
Assuming you have values for each day, you can get the total counts using a subquery and range between:
with dau as (
select date, count(userid) as dau
from dailysessions ds
group by date
)
select date, dau,
sum(dau) over (order by date rows between -29 preceding and current row) as mau
from dau;
Unfortunately, I think you want distinct users rather than just user counts. That makes the problem much more difficult, especially because Postgres doesn't support count(distinct) as a window function.
I think you have to do some sort of self join for this. Here is one method:
with dau as (
select date, count(distinct userid) as dau
from dailysessions ds
group by date
)
select date, dau,
(select count(distinct user_id)
from dailysessions ds
where ds.date between date - 29 * interval '1 day' and date
) as mau
from dau;
This one uses COUNT DISTINCT to get the rolling 30 days DAU/MAU:
(calculating reddit's user engagement in BigQuery - but the SQL is standard enough to be used on other databases)
SELECT day, dau, mau, INTEGER(100*dau/mau) daumau
FROM (
SELECT day, EXACT_COUNT_DISTINCT(author) dau, FIRST(mau) mau
FROM (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) day, author
FROM [fh-bigquery:reddit_comments.2015_09]
WHERE subreddit='AskReddit') a
JOIN (
SELECT stopday, EXACT_COUNT_DISTINCT(author) mau
FROM (SELECT created_utc, subreddit, author FROM [fh-bigquery:reddit_comments.2015_09], [fh-bigquery:reddit_comments.2015_08]) a
CROSS JOIN (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) stopday
FROM [fh-bigquery:reddit_comments.2015_09]
GROUP BY 1
) b
WHERE subreddit='AskReddit'
AND SEC_TO_TIMESTAMP(created_utc) BETWEEN DATE_ADD(stopday, -30, 'day') AND TIMESTAMP(stopday)
GROUP BY 1
) b
ON a.day=b.stopday
GROUP BY 1
)
ORDER BY 1
I went further at How to calculate DAU/MAU with BigQuery (engagement)
I've written about this on my blog.
The DAU is easy, as you noticed. You can solve the MAU by first creating a view with boolean values for when a user activates and de-activates, like so:
CREATE OR REPLACE VIEW "vw_login" AS
SELECT *
, LEAST (LEAD("date") OVER w, "date" + 30) AS "activeExpiry"
, CASE WHEN LAG("date") OVER w IS NULL THEN true ELSE false AS "activated"
, CASE
WHEN LEAD("date") OVER w IS NULL THEN true
WHEN LEAD("date") OVER w - "date" > 30 THEN true
ELSE false
END AS "churned"
, CASE
WHEN LAG("date") OVER w IS NULL THEN false
WHEN "date" - LAG("date") OVER w <= 30 THEN false
WHEN row_number() OVER w > 1 THEN true
ELSE false
END AS "resurrected"
FROM "login"
WINDOW w AS (PARTITION BY "user_id" ORDER BY "date")
This creates boolean values per user per day when they become active, when they churn and when they re-activate.
Then do a daily aggregate of the same:
CREATE OR REPLACE VIEW "vw_activity" AS
SELECT
SUM("activated"::int) "activated"
, SUM("churned"::int) "churned"
, SUM("resurrected"::int) "resurrected"
, "date"
FROM "vw_login"
GROUP BY "date"
;
And finally calculate running totals of active MAUs by calculating the cumulative sums over the columns. You need to join the vw_activity twice, since the second one is joined to the day when the user becomes inactive (i.e. 30 days since their last login).
I've included a date series in order to ensure that all days are present in your dataset. You can do without it too, but you might skip days in your dataset.
SELECT
d."date"
, SUM(COALESCE(a.activated::int,0)
- COALESCE(a2.churned::int,0)
+ COALESCE(a.resurrected::int,0)) OVER w
, d."date", a."activated", a2."churned", a."resurrected" FROM
generate_series('2010-01-01'::date, CURRENT_DATE, '1 day'::interval) d
LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '30 days')::date
WINDOW w AS (ORDER BY d."date") ORDER BY d."date";
You can of course do this in a single query, but this helps understand the structure better.
You didn't show us your complete table definition, but maybe something like this:
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
order by date;
To get the percentage without repeating the window functions, just wrap this in a derived table:
select date,
dau,
mau,
dau::numeric / (case when mau = 0 then null else mau end) as pct
from (
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
) t
order by date;
Here is an example output:
postgres=> select * from sessions;
session_date | user_id
--------------+---------
2014-05-01 | 1
2014-05-01 | 2
2014-05-01 | 3
2014-05-02 | 1
2014-05-02 | 2
2014-05-02 | 3
2014-05-02 | 4
2014-05-02 | 5
2014-06-01 | 1
2014-06-01 | 2
2014-06-01 | 3
2014-06-02 | 1
2014-06-02 | 2
2014-06-02 | 3
2014-06-02 | 4
2014-06-03 | 1
2014-06-03 | 2
2014-06-03 | 3
2014-06-03 | 4
2014-06-03 | 5
(20 rows)
postgres=> select session_date,
postgres-> dau,
postgres-> mau,
postgres-> round(dau::numeric / (case when mau = 0 then null else mau end),2) as pct
postgres-> from (
postgres(> select session_date,
postgres(> count(*) over (partition by date_trunc('day', session_date) order by session_date) as dau,
postgres(> count(*) over (partition by date_trunc('month', session_date) order by session_date) as mau
postgres(> from sessions
postgres(> ) t
postgres-> order by session_date;
session_date | dau | mau | pct
--------------+-----+-----+------
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
(20 rows)
postgres=>

How to calculate the value of a previous row from the count of another column

I want to create an additional column which calculates the value of a row from count column with its predecessor row from the sum column. Below is the query. I tried using ROLLUP but it does not serve the purpose.
select to_char(register_date,'YYYY-MM') as "registered_in_month"
,count(*) as Total_count
from CMSS.USERS_PROFILE a
where a.pcms_db != '*'
group by (to_char(register_date,'YYYY-MM'))
order by to_char(register_date,'YYYY-MM')
This is what i get
registered_in_month TOTAL_COUNT
-------------------------------------
2005-01 1
2005-02 3
2005-04 8
2005-06 4
But what I would like to display is below, including the months which have count as 0
registered_in_month TOTAL_COUNT SUM
------------------------------------------
2005-01 1 1
2005-02 3 4
2005-03 0 4
2005-04 8 12
2005-05 0 12
2005-06 4 16
To include missing months in your result, first you need to have complete list of months. To do that you should find the earliest and latest month and then use heirarchial
query to generate the complete list.
SQL Fiddle
with x(min_date, max_date) as (
select min(trunc(register_date,'month')),
max(trunc(register_date,'month'))
from users_profile
)
select add_months(min_date,level-1)
from x
connect by add_months(min_date,level-1) <= max_date;
Once you have all the months, you can outer join it to your table. To get the cumulative sum, simply add up the count using SUM as analytical function.
with x(min_date, max_date) as (
select min(trunc(register_date,'month')),
max(trunc(register_date,'month'))
from users_profile
),
y(all_months) as (
select add_months(min_date,level-1)
from x
connect by add_months(min_date,level-1) <= max_date
)
select to_char(a.all_months,'yyyy-mm') registered_in_month,
count(b.register_date) total_count,
sum(count(b.register_date)) over (order by a.all_months) "sum"
from y a left outer join users_profile b
on a.all_months = trunc(b.register_date,'month')
group by a.all_months
order by a.all_months;
Output:
| REGISTERED_IN_MONTH | TOTAL_COUNT | SUM |
|---------------------|-------------|-----|
| 2005-01 | 1 | 1 |
| 2005-02 | 3 | 4 |
| 2005-03 | 0 | 4 |
| 2005-04 | 8 | 12 |
| 2005-05 | 0 | 12 |
| 2005-06 | 4 | 16 |