SQL join on timestamp differences from the same table - sql

I'm not sure how to write this SQL query in BigQuery. I have a table of events with names and timestamps. Let's say I have only two events in the table: A and B. What I want to do is query the table to get all instances of event A, and get the next closest occurrence of B and create a new column with the time difference. B will always happen after A.
For example if I had a table that looks like:
A1 | 1:00 pm
B5 | 2:00 pm
A3 | 3:00 pm
B9 | 5:00 pm
My resultant table would be:
A1 | 1 hour
A3 | 2 hours
The query I came up with is the following:
SELECT
CAST(TIMESTAMP_DIFF((SELECT MIN(sub.time)
FROM table sub
WHERE sub.time > main.time), main.time, SECOND) AS INT64) duration
FROM table main
This works fine for getting the table I wanted above, but I would also like to include an additional column from the subquery. Something that looks like:
A1 | 1 hour | B5Column
A3 | 2 hours | B9Column
I attempted at using the query below:
SELECT
(SELECT
sub.SubQueryColumn
FROM table sub
WHERE sub.time > main.time
ORDER BY sub.time asc
LIMIT 1) SubColumn,
CAST(TIMESTAMP_DIFF((SELECT MIN(sub.time)
FROM table sub
WHERE sub.time > main.time), main.time, SECOND) AS INT64) duration
FROM table main
but it did not work. The error I get is
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
Could I get some help with this?

Here is one method:
select m.*,
timestamp_diff(time, next_b_time, second) as duration
from (select m.*,
min(case when event like 'B%' then time end) over (order by time desc) as next_b_time
from main m
) m
where event like 'A%';

Below is for BigQuery Standard SQL
#standardSQL
SELECT event, TIMESTAMP_DIFF(b_time, time, SECOND) duration, b_event
FROM (
SELECT event, time,
LEAD(time) OVER(PARTITION BY grp ORDER BY time) b_time,
LEAD(event) OVER(PARTITION BY grp ORDER BY time) b_event
FROM (
SELECT *,
COUNTIF(STARTS_WITH(event, 'A')) OVER(ORDER BY time) grp
FROM `project.dataset.your_table` t
)
)
WHERE STARTS_WITH(event, 'A')
-- ORDER BY time
You can test / play with it using dummy data from your question as below
#standardSQL
WITH `project.dataset.your_table` AS (
SELECT 'A1' event, TIMESTAMP '2018-01-01 1:00:00' time UNION ALL
SELECT 'B5', TIMESTAMP '2018-01-01 2:00:00' UNION ALL
SELECT 'A3', TIMESTAMP '2018-01-01 3:00:00' UNION ALL
SELECT 'B9', TIMESTAMP '2018-01-01 5:00:00'
)
SELECT event, TIMESTAMP_DIFF(b_time, time, SECOND) duration, b_event
FROM (
SELECT event, time,
LEAD(time) OVER(PARTITION BY grp ORDER BY time) b_time,
LEAD(event) OVER(PARTITION BY grp ORDER BY time) b_event
FROM (
SELECT *,
COUNTIF(STARTS_WITH(event, 'A')) OVER(ORDER BY time) grp
FROM `project.dataset.your_table` t
)
)
WHERE STARTS_WITH(event, 'A')
ORDER BY time
with result as
Row event duration b_event
1 A1 3600 B5
2 A3 7200 B9
Please note: above solution rely on statement in your question - B will always happen after A so if you have sequence as below
WITH `project.dataset.your_table` AS (
SELECT 'A1' event, TIMESTAMP '2018-01-01 1:00:00' time UNION ALL
SELECT 'A2', TIMESTAMP '2018-01-01 1:30:00' UNION ALL
SELECT 'B5', TIMESTAMP '2018-01-01 2:00:00' UNION ALL
SELECT 'A3', TIMESTAMP '2018-01-01 3:00:00' UNION ALL
SELECT 'B9', TIMESTAMP '2018-01-01 5:00:00'
)
result will be
Row event duration b_event
1 A1 null null
2 A2 1800 B5
3 A3 7200 B9
If you need to address this - try below
#standardSQL
WITH `project.dataset.your_table` AS (
SELECT 'A1' event, TIMESTAMP '2018-01-01 1:00:00' time UNION ALL
SELECT 'A2', TIMESTAMP '2018-01-01 1:30:00' UNION ALL
SELECT 'B5', TIMESTAMP '2018-01-01 2:00:00' UNION ALL
SELECT 'A3', TIMESTAMP '2018-01-01 3:00:00' UNION ALL
SELECT 'B9', TIMESTAMP '2018-01-01 5:00:00'
)
SELECT event, TIMESTAMP_DIFF(b_time, time, SECOND) duration, b_event
FROM (
SELECT event, time, type, grp,
FIRST_VALUE(event) OVER(ORDER BY grp RANGE BETWEEN 1 FOLLOWING AND 1 FOLLOWING) b_event,
FIRST_VALUE(time) OVER(ORDER BY grp RANGE BETWEEN 1 FOLLOWING AND 1 FOLLOWING) b_time
FROM (
SELECT event, time, SUBSTR(event, 1, 1) type,
COUNTIF(STARTS_WITH(event, 'B')) OVER(ORDER BY time) grp
FROM `project.dataset.your_table` t
)
)
WHERE STARTS_WITH(event, 'A')
ORDER BY time
this version will return
Row event duration b_event
1 A1 3600 B5
2 A2 1800 B5
3 A3 7200 B9

Related

LAG with condition

I want to get a value from the previous row that matches a certain condition.
For example: here I want for each row to get the timestamp from the last event = 1.
I feel I can do it without joins with LAG and PARTITION BY with CASE but I am not able to crack it.
Please help.
Here is one approach using analytic functions:
WITH cte AS (
SELECT *, COUNT(CASE WHEN event = 1 THEN 1 END) OVER
(PARTITION BY customer_id ORDER BY ts) cnt
FROM yourTable
)
SELECT ts, customer_id, event,
MAX(CASE WHEN event = 1 THEN ts END) OVER
(PARTITION BY customer_id, cnt) AS desired_result
FROM cte
ORDER BY customer_id, ts;
Demo
We can articulate your problem by saying that your want the desired_result column to contain the most recent timestamp value when the event was 1. The count (cnt) in the CTE above computes a pseudo group of records for each time the event is 1. Then we simply do a conditional aggregation over customer and pseudo group to find the timestamp value.
One more approach with "one query":
with data as
(
select sysdate - 0.29 ts, 111 customer_id, 1 event from dual union all
select sysdate - 0.28 ts, 111 customer_id, 2 event from dual union all
select sysdate - 0.27 ts, 111 customer_id, 3 event from dual union all
select sysdate - 0.26 ts, 111 customer_id, 1 event from dual union all
select sysdate - 0.25 ts, 111 customer_id, 1 event from dual union all
select sysdate - 0.24 ts, 111 customer_id, 2 event from dual union all
select sysdate - 0.23 ts, 111 customer_id, 1 event from dual union all
select sysdate - 0.22 ts, 111 customer_id, 1 event from dual
)
select
ts, event,
last_value(case when event=1 then ts end) ignore nulls
over (partition by customer_id order by ts) desired_result,
max(case when event=1 then ts end)
over (partition by customer_id order by ts) desired_result_2
from data
order by ts
Edit: As suggested by MatBailie the max(case...) works as well and is a more general approach. The "last_value ... ignore nulls" is Oracle specific.

How can I select the first and the last row for each set returned

I have the following data which I want to select as follows:
How can I modify the query to select the output as shown below?
select primary_id, timestamp, secondary_id,... from tablename where
timestamp <= to_timestamp('2020-07-29 00:00:00', 'YYYY-MM-DD HH24:MI:SS') and
timestamp < to_timestamp('2020-07-29 04:00:00', 'YYYY-MM-DD HH24:MI:SS')
order by timestamp, secondary_id;
primary_id timestamp secondary_id attribute1 attribute2 ... -- I want to get
-------------------------------------------------------------------
1 2020/01/20 10 ... ... ... -- <- this
2 2020/02/28 10 ... ... ...
3 2020/03/01 10 ... ... ... -- <- and this
4 2020/04/08 20 ... ... ... -- <- this
5 2020/05/31 20 ... ... ...
6 2020/06/30 20 ... ... ...
7 2020/06/31 20 ... ... ...
8 2020/07/31 20 ... ... ... -- <- and this
You can use window functions to rank records having the same secondary_id by ascending and descending timestamp, and then use that information to filter in the first and last record per group:
select primary_id, timestamp, secondary_id, ...
from (
select
t.*,
row_number() over(partition by secondary_id order by timestamp asc ) rn_asc,
row_number() over(partition by secondary_id order by timestamp desc) rn_desc
from tablename t
where
timestamp <= timestamp '2020-07-29 00:00:00'
and timestamp < timestamp '2020-07-29 04:00:00'
) t
where 1 in (rn_asc, rn_desc)
order by timestamp, secondary_id;
Note that you don't need to_timestamp() to convert these literal strings: you can use literal dates instead.
This also works when the value of secondary_id can be repeated in another group of rows, it simply checks if the current id is different from the previous or next row:
select *
from (
select
t.*,
lag(secondary_id) over(order by timestamp asc ) lag_id,
lead(secondary_id) over(order by timestamp asc) lead_id
from tablename t
where timestamp <= timestamp '2020-07-29 00:00:00'
and timestamp < timestamp '2020-07-29 04:00:00'
) t
where lag_id is null
or lead_id is null
or lag_id <> secondary_id
or lead_id <> secondary_id
order by timestamp, secondary_id;
Should be quite efficient as there's the same ORDER BY for both LEAD & LAG.
Please use below query,
select primary_id, timestamp, secondary_id,... from
(select primary_id, timestamp, secondary_id,...,
row_number() over (partition by secondary_id order by timestamp) as rnk1,
row_number() over (partition by secondary_id order by timestamp desc) as rnk2
from tablename where
timestamp <= to_timestamp('2020-07-29 00:00:00', 'YYYY-MM-DD HH24:MI:SS') and
timestamp < to_timestamp('2020-07-29 04:00:00', 'YYYY-MM-DD HH24:MI:SS') ) qry
where rnk1=1 and rnk2 = 1
order by timestamp, secondary_id;
You could use first_value and last_value. These are analytics functions and can be used like in the demo below.
with demo_data ( primary_id, secondary_id, timestamp)as
( select 1, 10, date '2020-01-01' from dual
union all
select 2 ,10, date '2020-01-28' from dual
union all
select 3, 10, date '2020-02-03' from dual
union all
select 4, 20, date '2020-03-02' from dual
union all
select 5, 20, date '2020-03-15' from dual
)
, grouped_data as
( select primary_id,
secondary_id,
timestamp,
decode(first_value(primary_id) over(partition by secondary_id order by timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ), primary_id, 'Y', 'N') first_row_in_group,
decode(last_value(primary_id) over(partition by secondary_id order by timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), primary_id, 'Y', 'N') last_row_in_group
from demo_data
)
select primary_id, secondary_id, timestamp
from grouped_data s
where first_row_in_group = 'Y' or last_row_in_group = 'Y'
/

Finding the most recent thing prior to a specific event

I'm doing some timestamp problem solving but am stuck with some join logic.
I have a table of data like so:
id, event_time, event_type, location
1001, 2018-06-04 18:23:48.526895 UTC, I, d
1001, 2018-06-04 19:26:44.359296 UTC, I, h
1001, 2018-06-05 06:07:03.658263 UTC, I, w
1001, 2018-06-07 00:47:44.651841 UTC, I, d
1001, 2018-06-07 00:48:17.857729 UTC, C, d
1001, 2018-06-08 00:04:53.086240 UTC, I, a
1001, 2018-06-12 21:23:03.071829 UTC, I, d
...
And I'm trying to find the timestamp difference between when a user has an event_type of C and the most recent event type of I up to event_type C for a given location value.
Ultimately the schema I'm after is:
id, location, timestamp_diff
1001, d, 33
1001, z, 21
1002, a, 55
...
I tried the following, which works for only one id value, but doesn't seem to work for multiples ids. I might be over-complicating the issue, but I wasn't sure. On one id it gives about 5 rows, which is right. However, when I open it up two ids, I get upwards of 200 rows when I should get something like 7 (5 for the first id and 2 for the second):
with c as (
select
id
,event_time as c_time
,location
from data
where event_type = 'C'
and id = '1001'
)
,i as (
select
id
,event_time as i_time
,location
from data
where event_type = 'I'
)
,check1 as (
c.*
,i.i_time
from c
left join i on (c.id = i.id and c.location = i.location)
group by 1,2,3,4
having i_time <= c_time
)
,check2 as (
select
id
,c_time
,location
,max(i_time) as i_time
from check1
group by 1,2,3
)
select
id
,location
,timestamp_diff(c_time, i_time, second) as timestamp_diff
#standardSQL
SELECT id, location, TIMESTAMP_DIFF(event_time, i_event_time, SECOND) AS diff
FROM (
SELECT *, MAX(IF(event_type = 'I', event_time, NULL)) OVER(win2) AS i_event_time
FROM (
SELECT *, COUNTIF(event_type = 'C') OVER(win1) grp
FROM `project.dataset.table`
WINDOW win1 AS (PARTITION BY id, location ORDER BY event_time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
WINDOW win2 AS (PARTITION BY id, location, grp ORDER BY event_time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
WHERE event_type = 'C'
AND NOT i_event_time IS NULL
This version addresses some edge cases - like for example case when there are consecutive 'C' events with "missing" 'I' events as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1001 id, TIMESTAMP '2018-06-04 18:23:48.526895 UTC' event_time, 'I' event_type, 'd' location UNION ALL
SELECT 1001, '2018-06-04 19:26:44.359296 UTC', 'I', 'h' UNION ALL
SELECT 1001, '2018-06-05 06:07:03.658263 UTC', 'I', 'w' UNION ALL
SELECT 1001, '2018-06-07 00:47:44.651841 UTC', 'I', 'd' UNION ALL
SELECT 1001, '2018-06-07 00:48:17.857729 UTC', 'C', 'd' UNION ALL
SELECT 1001, '2018-06-08 00:04:53.086240 UTC', 'C', 'd' UNION ALL
SELECT 1001, '2018-06-12 21:23:03.071829 UTC', 'I', 'd'
)
SELECT id, location, TIMESTAMP_DIFF(event_time, i_event_time, SECOND) AS diff
FROM (
SELECT *, MAX(IF(event_type = 'I', event_time, NULL)) OVER(win2) AS i_event_time
FROM (
SELECT *, COUNTIF(event_type = 'C') OVER(win1) grp
FROM `project.dataset.table`
WINDOW win1 AS (PARTITION BY id, location ORDER BY event_time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
WINDOW win2 AS (PARTITION BY id, location, grp ORDER BY event_time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
WHERE event_type = 'C'
AND NOT i_event_time IS NULL
result is
Row id location diff
1 1001 d 33
while if not to address that mentioned edge case it would be
Row id location diff
1 1001 d 33
2 1001 d 83795
You can use a cumulative max() function to get the most recent i time before every event.
Then just filter based on the C event:
select id, location,
timestamp_diff(event_time, i_event_time, second) as diff
from (select t.*,
max(case when event_type = 'I' then event_time end) over (partition by id, location order by event_time) as i_event_time
from t
) t
where event_type = 'C';

Date wise hourly (on 24 hour) coustomer count

I have a data set where customer id , customer join time and leave time available. I want to count hourly basis each date customer
Here is sample data set
My expected output
Here I going to add my code snip that i tried,where 1st created 24 hours span then tried to join and aggregate function for getting expected result and got for current date but i need for any date i.e dynamically
select logdate as date,timespan,count(customer_id)
(
SELECT userid,cast(joinTime as date) as logdate,customer_id
,starttime,endtime,timespan
FROM login_out_logs AS logTable
left join
(select '00:00:00 - 01:00:00' timespan,DATEadd(hh,0,cast(dateadd(dd,-1,getdate()))) starttime,dateadd(hh,1,cast(dateadd(dd,-1,getdate()))) endtime
union
select '01:00:00 - 02:00:00', dateadd(hh,1,cast(dateadd(dd,-1,getdate()))),dateadd(hh,2,cast(dateadd(dd,-1,getdate())))
union
select '02:00:00 - 03:00:00', dateadd(hh,2,cast(dateadd(dd,-1,getdate()))),dateadd(hh,3,cast(dateadd(dd,-1,getdate())))
union
select '03:00:00 - 04:00:00', dateadd(hh,3,cast(dateadd(dd,-1,getdate()))),dateadd(hh,4,cast(dateadd(dd,-1,getdate())))
union
select '04:00:00 - 05:00:00', dateadd(hh,4,cast(dateadd(dd,-1,getdate()))),dateadd(hh,5,cast(dateadd(dd,-1,getdate())))
union
select '05:00:00 - 06:00:00',dateadd(hh,5,cast(dateadd(dd,-1,getdate()))),dateadd(hh,6,cast(dateadd(dd,-1,getdate())))
union
select '06:00:00 - 07:00:00',dateadd(hh,6,cast(dateadd(dd,-1,getdate()))),dateadd(hh,7,cast(dateadd(dd,-1,getdate())))
union
select '07:00:00 - 08:00:00',dateadd(hh,7,cast(dateadd(dd,-1,getdate()))),dateadd(hh,8,cast(dateadd(dd,-1,getdate())))
union
select '08:00:00 - 09:00:00',dateadd(hh,8,cast(dateadd(dd,-1,getdate()))),dateadd(hh,9,cast(dateadd(dd,-1,getdate())))
union
select '09:00:00 - 10:00:00',dateadd(hh,9,cast(dateadd(dd,-1,getdate()))),dateadd(hh,10,cast(dateadd(dd,-1,getdate())))
union
select '10:00:00 - 11:00:00',dateadd(hh,10,cast(dateadd(dd,-1,getdate()))),dateadd(hh,11,cast(dateadd(dd,-1,getdate())))
union
select '11:00:00 - 12:00:00',dateadd(hh,11,cast(dateadd(dd,-1,getdate()))),dateadd(hh,12,cast(dateadd(dd,-1,getdate())))
union
select '12:00:00 - 13:00:00',dateadd(hh,12,cast(dateadd(dd,-1,getdate()))),dateadd(hh,13,cast(dateadd(dd,-1,getdate())))
union
select '13:00:00 - 14:00:00',dateadd(hh,13,cast(dateadd(dd,-1,getdate()))),dateadd(hh,14,cast(dateadd(dd,-1,getdate())))
union
select '14:00:00 - 15:00:00',dateadd(hh,14,cast(dateadd(dd,-1,getdate()))),dateadd(hh,15,cast(dateadd(dd,-1,getdate())))
union
select '15:00:00 - 16:00:00',dateadd(hh,15,cast(dateadd(dd,-1,getdate()))),dateadd(hh,16,cast(dateadd(dd,-1,getdate())))
union
select '16:00:00 - 17:00:00',dateadd(hh,16,cast(dateadd(dd,-1,getdate()))),dateadd(hh,17,cast(dateadd(dd,-1,getdate())))
union
select '17:00:00 - 18:00:00',dateadd(hh,17,cast(dateadd(dd,-1,getdate()))),dateadd(hh,18,cast(dateadd(dd,-1,getdate())))
union
select '18:00:00 - 19:00:00',dateadd(hh,18,cast(dateadd(dd,-1,getdate()))),dateadd(hh,19,cast(dateadd(dd,-1,getdate())))
union
select '19:00:00 - 20:00:00',dateadd(hh,19,cast(dateadd(dd,-1,getdate()))),dateadd(hh,20,cast(dateadd(dd,-1,getdate())))
union
select '20:00:00 - 21:00:00',dateadd(hh,20,cast(dateadd(dd,-1,getdate()))),dateadd(hh,21,cast(dateadd(dd,-1,getdate())))
union
select '21:00:00 - 22:00:00',dateadd(hh,21,cast(dateadd(dd,-1,getdate()))),dateadd(hh,22,cast(dateadd(dd,-1,getdate())))
union
select '22:00:00 - 23:00:00',dateadd(hh,22,cast(dateadd(dd,-1,getdate()))),dateadd(hh,23,cast(dateadd(dd,-1,getdate())))
union
select '24:00:00 - 00:00:00',dateadd(hh,23,cast(dateadd(dd,-1,getdate()))),dateadd(hh,23,dateadd(mi,59,cast(dateadd(dd,-1,getdate())))))a
on starttime between jointime and leaveTime
or endtime between jointime and leaveTime
or jointime>=starttime and jointime<endtime
) as T
group by leaveTime,timespan
Date Hour customer_count
2018-01-01 8-9 1
2018-01-01 9-10 1
2018-01-01 10-11 1
2018-01-01 11-12 1
2018-01-01 12-13 1
2018-01-01 13-14 1
2018-01-01 14-15 1
2018-01-01 15-16 1
2018-01-01 16-17 1
2018-01-01 17-18 1
2018-01-01 18-19 1
2018-01-01 19-20 1
2018-01-01 20-21 2
2018-01-01 21-22 3
2018-01-01 22-23 2
2018-01-01 23-00 1
Here is an approach - maybe this already solves your problem. I designed it in order to work with any day-difference between join and leave. However, I can't tell anything about the performance on larger sets since I tested with your example only and the evaluation of all relevant hours might take a bit longer if it comes to bigger data sets.
Anyways, I used a recursice cte here in order to evaluate all hours between join and leave and lateron I group by date and hour:
DECLARE #Cust TABLE(
customer_id INT,
joinTime DATETIME,
leaveTime DATETIME
)
INSERT INTO #Cust VALUES
(536, '2018-01-01 08:05:00', '2018-01-01 18:31:00'),
(344, '2018-01-01 19:37:00', '2018-01-01 20:16:00'),
(344, '2018-01-01 19:49:00', '2018-01-01 20:00:00'),
(899, '2018-01-01 20:49:00', '2018-01-01 21:14:00'),
(2336, '2018-01-01 21:02:00', '2018-01-01 21:03:00'),
(335, '2018-01-01 21:03:00', '2018-01-01 23:43:00'),
(2336, '2018-01-01 21:03:00', '2018-01-02 00:06:00'),
(899, '2018-01-01 21:18:00', '2018-01-01 22:24:00'),
(345, '2018-01-01 21:21:00', '2018-01-01 21:39:00'),
(345, '2018-01-01 21:53:00', '2018-01-02 00:13:00');
;WITH cte AS(
SELECT c.customer_id,
c.joinTime,
c.leaveTime,
c.joinTime x
FROM #Cust c
UNION ALL
SELECT c.customer_id,
c.joinTime,
c.leaveTime,
DATEADD(HOUR, 1, x) x
FROM cte c
WHERE DATEADD(HOUR, 1, x) <= CASE WHEN DATEPART(MINUTE, x) < DATEPART(MINUTE, c.leaveTime) THEN c.leaveTime ELSE DATEADD(HOUR, 1, c.leaveTime) END
)
SELECT CONVERT(DATE, x) AS cDate, DATEPART(HOUR, x) AS cHour, COUNT(*) AS cCount
FROM cte
GROUP BY CONVERT(DATE, x), DATEPART(HOUR, x)
ORDER BY 1,2
OPTION (MAXRECURSION 0)
Try this:
;WITH hourlist(starthour) AS (
SELECT 0 -- Seed Row
UNION ALL
SELECT starthour + 1 -- Recursion
FROM hourlist
where starthour+1<=23
)
SELECT
day
,convert(nvarchar,starthour)+'-'+convert(nvarchar,case when starthour+1=24 then 0 else starthour+1 end) hourtitle
,count(distinct customer_id) 'customer count'
FROM
hourlist h -- list of all hourse
cross join
(
select distinct dateadd(day,datediff(day,0, joinTime),0) from #login_out_logs
union
select distinct dateadd(day,datediff(day,0,leaveTime),0) from #login_out_logs
)q10(day) -- list of all days of jointime and leavetime
inner join #login_out_logs l on -- log considered for specific day/hour if starts before hourend and ends before hourstart
l.joinTime <dateadd(hour,starthour+1,q10.day)
and
l.leaveTime>=dateadd(hour,starthour ,q10.day)
group by day,starthour
order by day,starthour
Note: this will only work for jointimes and leavetimes that differ 0 or 1 days, not 2 or more.

How to combine multiple SELECTs into a single SELECT by a common column in (BigQuery) SQL?

Given I have multiple tables in BigQuery, hence I have multiple SQL-statements that gives me "the number of X per day". For example:
SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as installs
FROM database.table1
GROUP BY day
ORDER BY day ASC
Which would give the result:
| day | installs |
-------------------------
| 2017-01-01 | 11 |
| 2017-01-02 | 22 |
etc
Another statement:
SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as uninstalls
FROM database.table2
GROUP BY day
ORDER BY day ASC
Which would give the result:
| day | uninstalls |
---------------------------
| 2017-01-02 | 22 |
| 2017-01-03 | 33 |
etc
Another statement:
SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as cases
FROM database.table3
GROUP BY day
ORDER BY day ASC
Which would give the result:
| day | cases |
----------------------
| 2017-01-01 | 11 |
| 2017-01-03 | 33 |
etc
etc
Now I need to combine all these into a single SELECT statement that gives the following results:
| day | installs | uninstalls | cases |
----------------------------------------------
| 2017-01-01 | 11 | 0 | 11 |
| 2017-01-02 | 22 | 22 | 0 |
| 2017-01-03 | 0 | 33 | 33 |
etc
Is this even possible?
Or what's the closest SQL-statement I can write that would give me a similar result?
Any feedback is appreciated!
Here is a self-contained example that might help to get you started. It uses two dummy tables, InstallEvents and UninstallEvents, which contain timestamps for the respective actions. It creates a common table expression called StartAndEnd that computes the minimum and maximum dates for these events in order to decide which dates to aggregate over, then unions the contents of the InstallEvents and UninstallEvents, counting the events for each day.
WITH InstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
),
StartAndEnd AS (
SELECT MIN(DATE(timestamp)) AS min_date, MAX(DATE(timestamp)) AS max_date
FROM (
SELECT * FROM InstallEvents UNION ALL
SELECT * FROM UninstallEvents
)
)
SELECT
day,
COUNTIF(is_install AND DATE(timestamp) = day) AS installs,
COUNTIF(NOT is_install AND DATE(timestamp) = day) AS uninstalls
FROM (
SELECT *, true AS is_install
FROM InstallEvents UNION ALL
SELECT *, false
FROM UninstallEvents
)
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(
(SELECT min_date FROM StartAndEnd),
(SELECT max_date FROM StartAndEnd)
)) AS day
GROUP BY day
ORDER BY day;
If you know what the start and end dates are in advance, you can hard-code them in the query instead and then omit the StartAndEnd CTE:
WITH InstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
)
SELECT
day,
COUNTIF(is_install AND DATE(timestamp) = day) AS installs,
COUNTIF(NOT is_install AND DATE(timestamp) = day) AS uninstalls
FROM (
SELECT *, true AS is_install
FROM InstallEvents UNION ALL
SELECT *, false
FROM UninstallEvents
)
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-01-04')) AS day
GROUP BY day
ORDER BY day;
To see the events in the sample data, use a query that unions the contents:
WITH InstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
)
SELECT timestamp, true AS is_install
FROM InstallEvents UNION ALL
SELECT timestamp, false
FROM UninstallEvents;
Below is for BigQuery Standard SQL
#standardSQL
WITH calendar AS (
SELECT day
FROM (
SELECT MIN(min_day) AS min_day, MAX(max_day) AS max_day
FROM (
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table1` UNION ALL
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table2` UNION ALL
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table3`
)
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) AS day
)
SELECT
c.day AS day,
IFNULL(SUM(installs), 0) AS installs,
IFNULL(SUM(uninstalls), 0) AS uninstalls,
IFNULL(SUM(cases),0) AS cases
FROM calendar AS c
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) installs FROM `database.table1` GROUP BY day) t1 ON t1.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) uninstalls FROM `database.table2` GROUP BY day) t2 ON t2.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) cases FROM `database.table3` GROUP BY day) t3 ON t3.day = c.day
GROUP BY day
HAVING installs + uninstalls + cases > 0
-- ORDER BY day
Please note: you are using timestamp as a column name which is not the best practice as it is keyword, so in my example i leave your naming but consider to change this!
You can test / play this solution with below dummy data
#standardSQL
WITH `database.table1` AS (
SELECT TIMESTAMP '2017-01-01' AS timestamp, 1 AS installs
UNION ALL SELECT TIMESTAMP '2017-01-01', 22
),
`database.table2` AS (
SELECT TIMESTAMP '2016-12-01' AS timestamp, 1 AS installs UNION ALL SELECT TIMESTAMP '2017-01-01', 22 UNION ALL SELECT TIMESTAMP '2017-01-01', 22 UNION ALL
SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22
),
`database.table3` AS (
SELECT TIMESTAMP '2017-01-01' AS timestamp, 1 AS installs UNION ALL SELECT TIMESTAMP '2017-01-01', 22 UNION ALL SELECT TIMESTAMP '2017-01-01', 22 UNION ALL
SELECT TIMESTAMP '2017-01-10', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22
),
calendar AS (
SELECT day
FROM (
SELECT MIN(min_day) AS min_day, MAX(max_day) AS max_day
FROM (
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table1` UNION ALL
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table2` UNION ALL
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table3`
)
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) AS day
)
SELECT
c.day AS day,
IFNULL(SUM(installs), 0) AS installs,
IFNULL(SUM(uninstalls), 0) AS uninstalls,
IFNULL(SUM(cases),0) AS cases
FROM calendar AS c
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) installs FROM `database.table1` GROUP BY day) t1 ON t1.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) uninstalls FROM `database.table2` GROUP BY day) t2 ON t2.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) cases FROM `database.table3` GROUP BY day) t3 ON t3.day = c.day
GROUP BY day
HAVING installs + uninstalls + cases > 0
ORDER BY day
I am not very familiar with bigquery, so this is probably not going to be a copy-paste answer.
You'll first have to build a calander table to make sure you have all dates. Here's an example for sql server. There are probably examples for bigquery available as well. The following assumes a Calander table with Date attribute in timestamp.
Once you have your calander table you can join all your tables to that:
SELECT FORMAT_TIMESTAMP("%F",C.Date) AS day
, COUNT(T1.DATE(T1.TIMESTAMP)) AS installs --Here you could also use your FORMAT_TIMESTAMP
, COUNT(T1.DATE(T2.TIMESTAMP)) AS uninstalls
FROM Calander C
LEFT JOIN database.table1 T1
ON DATE(T1.TIMESTAMP) = DATE(C.Date) --Convert to date to remove times, you could also use your FORMAT_TIMESTAMP
LEFT JOIN database.table2 T2
ON DATE(T2.TIMESTAMP) = DATE(C.Date)
GROUP BY day
ORDER BY day ASC