BigQuery SQL - Concatenate two columns if they are on consecutive days - sql

I am looking for a way to adjust this sql query running in BigQuery to return single count total for Sent EventTypes that happen two or even three days in a row.
SELECT date(EventDate) as EventDate, EventType, count(*) as count FROM `Database.Table`
where date(EventDate) > DATE_SUB (CURRENT_DATE, INTERVAL 100 DAY)
Group by 1,2
ORDER by 1,2
Response from above Query:
| Row | EventDate | EventType | count |
| ------ | --------- |-----------|-------|
| 1 | 2019-02-06| Sent | 4 |
| 2 | 2019-02-07| Sent | 5 |
| 3 | 2019-02-12| NotSent | 7 |
| 4 | 2019-02-13| Bounces | 22 |
| 5 | 2019-02-14| Bounces | 22 |
| 6 | 2019-03-06| Sent | 2 |
| 7 | 2019-03-07| Sent | 4 |
| 8 | 2019-03-07| NotSent | 5 |
| 9 | 2019-03-12| Bounces | 7 |
| 10 | 2019-03-13| Sent | 22 |
| 11 | 2019-04-05| Sent | 2 |
Response I would like to get to:
| Row | EventDate | EventType | count |
| ------ | --------- |-----------|-------|
| 1 | 2019-02-06| Sent | 9 |
| 2 | 2019-02-12| NotSent | 7 |
| 3 | 2019-02-13| Bounces | 22 |
| 4 | 2019-02-14| Bounces | 22 |
| 5 | 2019-03-06| Sent | 6 |
| 6 | 2019-03-07| NotSent | 5 |
| 7 | 2019-03-12| Bounces | 7 |
| 8 | 2019-03-13| Sent | 22 |
| 9 | 2019-04-05| Sent | 2 |
Something along those line, so I am able to concatenate two counts with the EventType of 'Sent' for consecutive days, and show other EventTypes without concatenating them, such as Bounces and NotSent.

I wrote a query that merges all consecutive 2 days in the table.
It gives the exact same output you want.
I think you meant '2019-03-06' in the 5th row, so I fixed it in my dummy data section.
WITH
data AS (
SELECT CAST('2019-02-06' as date) as EventDate, 4 as count union all
SELECT CAST('2019-02-07' as date) as EventDate, 5 as count union all
SELECT CAST('2019-02-12' as date) as EventDate, 7 as count union all
SELECT CAST('2019-02-13' as date) as EventDate, 22 as count union all
SELECT CAST('2019-03-06' as date) as EventDate, 2 as count
),
data_with_steps AS (
SELECT *,
IF(DATE_DIFF(EventDate, LAG(EventDate) OVER (ORDER BY EventDate), day) > 2, 1, 0) as new_step
FROM data
),
data_grouped AS (
SELECT *,
SUM(new_step) OVER (ORDER BY EventDate) as step_group
FROM data_with_steps
)
SELECT MIN(EventDate) as EventDate, sum(count) as count
FROM data_grouped
GROUP BY step_group
So, how does it work?
First, I calculate the date difference to previous day. If it's more than 2 days, I set value 1, otherwise 0 for the new column new_step.
Then, I calculate the cumulative sum of new_step column and name it as step_group.
The output of the first two steps is:
At final step, I group table by step_group and get minimum date as event date, and sum counts to obtain group count.
Edit:
To add other events without grouping by, I added a new version.
I think the most intuitive and easiest way is to use Union All for that problem.
So you can use that updated query to include other events without grouping.
WITH
data AS (
SELECT CAST('2019-02-06' as date) as EventDate, 'Sent' as EventType, 4 as count union all
SELECT CAST('2019-02-07' as date) as EventDate, 'Sent' as EventType, 5 as count union all
SELECT CAST('2019-02-12' as date) as EventDate, 'Sent' as EventType, 7 as count union all
SELECT CAST('2019-02-13' as date) as EventDate, 'Sent' as EventType, 22 as count union all
SELECT CAST('2019-03-06' as date) as EventDate, 'Sent' as EventType, 2 as count union all
SELECT CAST('2019-02-12' as date) as EventDate, 'NotSent' as EventType, 7 as count union all
SELECT CAST('2019-03-07' as date) as EventDate, 'NotSent' as EventType, 5 as count union all
SELECT CAST('2019-02-13' as date) as EventDate, 'Bounces' as EventType, 22 as count union all
SELECT CAST('2019-02-14' as date) as EventDate, 'Bounces' as EventType, 22 as count union all
SELECT CAST('2019-03-12' as date) as EventDate, 'Bounces' as EventType, 7 as count
),
data_with_steps AS (
SELECT *,
IF(DATE_DIFF(EventDate, LAG(EventDate) OVER (ORDER BY EventDate), day) > 2, 1, 0) as new_step
FROM data
WHERE EventType = 'Sent'
),
data_grouped AS (
SELECT *,
SUM(new_step) OVER (ORDER BY EventDate) as step_group
FROM data_with_steps
)
SELECT EventType, MIN(EventDate) as EventDate, sum(count) as count
FROM data_grouped
GROUP BY EventType, step_group
UNION ALL
SELECT EventType, EventDate, count
FROM data
WHERE EventType != 'Sent'

This is a gaps-and-islands problem. The simplest method is to use row_number() and subtraction to identify the "islands". And then aggregate:
select min(row), eventType, min(eventDate), sum(count)
from (select t.*,
row_number() over (partition by eventType order by eventDate) as seqnum
from t
) t
group by eventType, dateadd(eventDate, interval -seqnum day)

Related

SQL Server : processing by group

I have a table with the following data:
Id Date Value
---------------------------
1 Dec-01-2019 10
1 Dec-03-2019 5
1 Dec-05-2019 8
1 Jan-03-2020 6
1 Jan-07-2020 3
1 Jan-08-2020 9
2 Dec-01-2019 4
2 Dec-03-2019 7
2 Dec-31-2019 9
2 Jan-04-2020 4
2 Jan-09-2020 6
I need to group it to the following format: 1 record per month per id. If month is closed, so date will be the last day of that month, if not, the last day available. Max and average are calculated using all data until that date.
Id Date Max_Value Average_Value
-----------------------------------------------
1 Dec-31-2019 10 7,6
1 Jan-08-2020 10 6,8
2 Dec-31-2019 9 6,6
2 Jan-09-2020 9 6,0
Any easy SQL to obtain this analysis?
Regards,
Hmmm . . . You want to aggregate by month and then just take the maximum date in the month:
select id, max(date), max(value), avg(value * 1.0)
from t
group by id, eomonth(date)
order by id, max(date);
If by closed month you mean that it's not the last month of the id then:
select id,
case
when year(Date) = year(maxDate) and month(Date) = month(maxDate) then maxDate
else eomonth(Date)
end Date,
max(maxValue) Max_Value,
round(avg(1.0 * Value), 1) Average_Value
from (
select *,
max(Date) over (partition by Id) maxDate,
max(Value) over (partition by Id) maxValue
from tablename
) t
group by id,
case
when year(Date) = year(maxDate) and month(Date) = month(maxDate) then maxDate
else eomonth(Date)
end
order by id, Date
See the demo.
Results:
> id | Date | Max_Value | Average_Value
> -: | :--------- | --------: | :------------
> 1 | 2019-12-31 | 10 | 7.7
> 1 | 2020-01-08 | 10 | 6.0
> 2 | 2019-12-31 | 9 | 6.7
> 2 | 2020-01-09 | 9 | 5.0

Start SUM aggregation at a certain threshold in bigquery

The energy usage of a device is logged hourly:
+--------------+-----------+-----------------------+
| energy_usage | device_id | timestamp |
+--------------+-----------+-----------------------+
| 10 | 1 | 2019-02-12T01:00:00 |
| 16 | 2 | 2019-02-12T01:00:00 |
| 26 | 1 | 2019-03-12T02:00:00 |
| 24 | 2 | 2019-03-12T02:00:00 |
+--------------+-----------+-----------------------+
My goal is:
Create two columns, one for energy_usage_day (8am-8pm) and another for energy_usage_night (8pm-8am)
Create a monthly aggregate, group by device_id and sum up the energy usage
So the result might look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 80 | 30 | 50 | 1 | 2 | 2019 |
| 130 | 60 | 70 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
Following query produces such results:
SELECT SUM(energy_usage) energy_usage
, SUM(IF(EXTRACT(HOUR FROM timestamp) BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_day
, SUM(IF(EXTRACT(HOUR FROM timestamp) NOT BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_night
, device_id
, EXTRACT(MONTH FROM timestamp) month, EXTRACT(YEAR FROM timestamp) year
FROM `data`
GROUP BY device_id, month, year
Say I am only interested in energy usage aggregates above a certain threshold, e.g. 50. I want to start the SUM at a total energy usage of 50. The result should look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 30 | 10 | 20 | 1 | 2 | 2019 |
| 80 | 50 | 30 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
In other words: the query should start summing up energy_usage, energy_usage_day and energy_usage_night only when energy_usage reaches the threshold of 50.
Is this possible in bigquery?
Below is for BigQuery Standard SQL and logic is that it starts aggregate usage ONLY after it reaches 50 (per device per month)
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
)
SELECT SUM(energy_usage) energy_usage,
SUM(IF(day_hour, energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp
WHERE qualified
GROUP BY device_id, month, year
Say the current SUM of usage is 49 and the next usage entry has a value of 2. The SUM will be 51. As a result usage of 2 will be added to the SUM. Instead only half of 1 should've been added. Can we solve such problem in BigQuery SQL?
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
SUM(energy_usage) OVER(win) - 50 rolling_sum,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
), temp_with_adjustments AS (
SELECT *,
IF(
ROW_NUMBER() OVER(PARTITION BY device_id, month, year ORDER BY `timestamp`) = 1,
rolling_sum,
energy_usage
) AS adjusted_energy_usage
FROM temp
WHERE qualified
)
SELECT SUM(adjusted_energy_usage) energy_usage,
SUM(IF(day_hour, adjusted_energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, adjusted_energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp_with_adjustments
GROUP BY device_id, month, year
As you can see, I've just added logic for temp_with_adjustments (and rolling_sum in the temp to support this) - the rest is the same

Query to find records that where created one after another in bigquery

I am playing around with bigquery. Following input is given:
+---------------+---------+---------+--------+----------------------+
| customer | agent | value | city | timestamp |
+---------------+---------+---------+--------+----------------------+
| 1 | 1 | 106 | LA | 2019-02-12 03:05pm |
| 1 | 1 | 251 | LA | 2019-02-12 03:06pm |
| 3 | 2 | 309 | NY | 2019-02-12 06:41pm |
| 1 | 1 | 654 | LA | 2019-02-12 05:12pm |
+---------------+---------+---------+--------+----------------------+
I want to find transactions that where issued one after another (say within 5 minutes) by one and the same agent. So the output for the above table should look like:
+---------------+---------+---------+--------+----------------------+
| customer | agent | value | city | timestamp |
+---------------+---------+---------+--------+----------------------+
| 1 | 1 | 106 | LA | 2019-02-12 03:05pm |
| 1 | 1 | 251 | LA | 2019-02-12 03:06pm |
+---------------+---------+---------+--------+----------------------+
The query should somehow group by agent and find such transactions. However the result is not really grouped as you can see from the output. My first thought was using the LEAD function, but I am not sure. Do you have any ideas?
Ideas for a query:
sort by agent and timestamp DESC
start with the first row, look at the following row (using LEAD?)
check if timestamp difference is less than 5 minutes
if so, this two rows should be in the output
continue with next (2nd) row
When the 2nd and 3rd row match the criteria, too, the 2nd row will get into the output, which would cause duplicate rows. I am not sure how to avoid that, yet.
There must be an easier way but does this achieve what you are after?
CTE2 AS (
SELECT customer, agent, value, city, timestamp,
lead(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lead,
lead(customer,1) OVER (PARTITION BY agent ORDER BY timestamp) customer_lead,
lead(value,1) OVER (PARTITION BY agent ORDER BY timestamp) value_lead,
lead(city,1) OVER (PARTITION BY agent ORDER BY timestamp) city_lead,
lag(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lag
FROM CTE
)
SELECT agent,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(customer as string),', ',cast(customer_lead as string)),cast(customer as string)) customer,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(value as string),', ',cast(value_lead as string)),cast(value as string)) value,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(city as string),', ',cast(city_lead as string)),cast(city as string)) cities,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(timestamp as string),', ',cast(timestamp_lead as string)),cast(timestamp as string)) timestamps
FROM CTE2
WHERE (timestamp_diff(timestamp_lead,timestamp,MINUTE)<5 OR NOT timestamp_diff(timestamp,timestamp_lag,MINUTE)<5)
Below is for BigQuery Standard SQL
#standardSQL
SELECT * FROM (
SELECT *,
IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5,
LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts),
NULL).*
FROM `project.dataset.yourtable`
)
WHERE NOT next_customer IS NULL
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer, 1 agent, 106 value,'LA' city, '2019-02-12 03:05pm' ts UNION ALL
SELECT 1, 1, 251,'LA', '2019-02-12 03:06pm' UNION ALL
SELECT 3, 2, 309,'NY', '2019-02-12 06:41pm' UNION ALL
SELECT 1, 1, 654,'LA', '2019-02-12 05:12pm'
), temp AS (
SELECT customer, agent, value, city, PARSE_TIMESTAMP('%Y-%m-%d %I:%M%p', ts) ts
FROM `project.dataset.table`
)
SELECT * FROM (
SELECT *,
IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5,
LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts),
NULL).*
FROM temp
)
WHERE NOT next_customer IS NULL
-- ORDER BY ts
with result
Row customer agent value city ts next_customer next_value
1 1 1 106 LA 2019-02-12 15:05:00 UTC 1 251

Is it possible to do projection in Google Big Query?

I have a query (due to restrictions, it is using Legacy SQL) that produces a column that is the rolling average of last 3 days of sale (excluding today)
SELECT
id, date, sales, AVG(sales) OVER (PARTITION BY id ORDER BY date RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING) AS projected_sale
FROM tableA
tableA
+-------+---------+---------+
| id | date | sales |
+-------+---------+---------+
| 1 | 01-01-17| 5 |
| 1 | 01-02-17| 6 |
| 1 | 01-03-17| 7 |
| 1 | 01-04-17| 10 |
+-------+---------+---------+
The query produces
+-------+---------+---------+--------------+
| id | date | sales |projected_sale|
+-------+---------+---------+--------------+
| 1 | 01-01-17| 5 | . |
| 1 | 01-02-17| 6 | . |
| 1 | 01-03-17| 7 | . |
| 1 | 01-04-17| 10 | 6 |
+-------+---------+---------+--------------+
Since the average is excluding the current row, theoretically I can project the sale for 01-05-17 using the sales from (01-02 to 01-04). However since tableA doesn't actually have a entry with date 01-05-17, my query stops at 01-04-17 as the last row.
Is what I am trying to do possible in Big Query?
Thank you
First, I think using RANGE is incorrect here - it should be ROWS instead
Anyway, below is an example for BigQuery Legacy SQL that demonstrates how to achieve result you need.
#legacySQL
SELECT
id, dt, sales,
AVG(sales) OVER (
PARTITION BY id ORDER BY dt
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING
) AS projected_sale
FROM tableA, (SELECT 1 id, '01-05-17' dt, 0 sales)
As you can see here you just simply adding (UNION ALL - comma in Kegacy SQL) that missing day. Of course you can transform that one such that it will add such missing row for all id's
Nevetherless - hope this is a good starting point for you
You can test / play with it using dummy data as in your question
#legacySQL
SELECT
id, dt, sales,
AVG(sales) OVER (
PARTITION BY id ORDER BY dt
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING
) AS projected_sale
FROM (
SELECT * FROM
(SELECT 1 id, '01-01-17' dt, 5 sales),
(SELECT 1 id, '01-02-17' dt, 6 sales),
(SELECT 1 id, '01-03-17' dt, 7 sales),
(SELECT 1 id, '01-04-17' dt, 10 sales)
) tableA, (SELECT 1 id, '01-05-17' dt, 0 sales)
with result as
Row id dt sales projected_sale
1 1 01-01-17 5 null
2 1 01-02-17 6 5.0
3 1 01-03-17 7 5.5
4 1 01-04-17 10 6.0
5 1 01-05-17 0 7.0

SQL order by two column, omit if second column doesn't meet the order

Let's say we have next data
id | date | price
------------------------
1 | 10-09-2016 | 200
2 | 11-09-2016 | 190
3 | 12-09-2016 | 210
4 | 13-09-2016 | 220
5 | 14-09-2016 | 200
6 | 15-09-2016 | 200
7 | 16-09-2016 | 230
8 | 17-09-2016 | 240
and we have to order by date first, and price second, however if the price must be in order. If current price is less than previous we should omit this row, and the result will be:
id | date | price
------------------------
1 | 10-09-2016 | 200
3 | 12-09-2016 | 210
4 | 13-09-2016 | 220
7 | 16-09-2016 | 230
8 | 17-09-2016 | 240
Is it possible without join?
Use LAG window function
SELECT *
FROM (SELECT *,
Lag(price)OVER( ORDER BY date) AS prev_price
FROM Yourtable) a
WHERE price > prev_price
OR prev_price IS NULL -- to get the first record
If "previous" is supposed to mean the previous row in the output, then keep track of a running maximum. Postgres solution with a window function in a subquery:
SELECT id, date, price
FROM (
SELECT *, price >= max(price) OVER (ORDER BY date, price) AS ok
FROM tbl
) sub
WHERE ok;
If Postgres:
select id, date, price
from
(select
t.*,
price - lag(price, 1, price) over (order by id) diff
from
your_table) t
where diff > 0;
If MySQL:
select id, date, price from
(
select t.*,
price - #lastprice diff,
#lastprice := price
from
(select *
from your_table
order by id) t
cross join (select #lastprice := 0) t2
) t where t.diff > 0;