Query to find and subtract two timestamps associated with the same identifier - sql

I'm very new to BigQuery and not terribly familiar with SQL. I have a table of data that looks like this, where MyDate is a Timestamp object:
Row
MyDate
StateTransition
MyIdentifier
1
2022-09-23 00:08:00 UTC
Start
6371
2
2022-10-10 01:17:14 UTC
Finished
6371
3
2022-09-26 04:51:40 UTC
Start
7768
4
2022-10-05 03:44:32 UTC
Finished
7768
etc.
My query looks something like
SELECT *
FROM <my-data-source>
WHERE (StateTransition="Start" OR StateTransition="Finished")
ORDER BY MyIdentifier, MyDate
What I'm trying to do is calculate the elapsed time (in days) between the Start and Finished timestamps associated with each MyIdentifier, and to have that displayed in another column. It could look like:
Row
MyDate
StateTransition
MyIdentifier
ElapsedTime
1
2022-09-23 00:08:00 UTC
Start
6371
2
2022-10-10 01:17:14 UTC
Finished
6371
0.33
3
2022-09-26 04:51:40 UTC
Start
7768
4
2022-10-05 03:44:32 UTC
Finished
7768
0.04
Alternatively, it could even be flattened a little to something like:
Row
StartTransition
FinishedTransition
MyIdentifier
ElapsedTime
1
2022-09-23 00:08:00 UTC
2022-10-10 01:17:14 UTC
6371
0.33
2
2022-09-26 04:51:40 UTC
2022-10-05 03:44:32 UTC
7768
0.04
I've tried looking through the BigQuery docs and Stack Overflow but haven't found anything that addresses this use case of selecting items from multiple rows with a common identifier and then performing an operation on them. It seems like subtracting the two timestamps would be done with the TIMESTAMP_DIFF function.
Any assistance is appreciated!

WITH sample_table AS (
SELECT TIMESTAMP '2022-09-23 00:08:00 UTC' MyDate, 'Start' StateTransition, 6371 MyIdentifier, 'aaa' AS author UNION ALL
SELECT '2022-10-10 01:17:14 UTC' MyDate, 'Finished' StateTransition, 6371 MyIdentifier, 'bbb' AS author UNION ALL
SELECT '2022-09-26 04:51:40 UTC' MyDate, 'Start' StateTransition, 7768 MyIdentifier, 'ccc' AS author UNION ALL
SELECT '2022-10-05 03:44:32 UTC' MyDate, 'Finished' StateTransition, 7768 MyIdentifier, 'ccc' AS author
)
SELECT MyIdentifier,
ARRAY_AGG(author ORDER BY MyDate LIMIT 1)[SAFE_OFFSET(0)] AS author,
MIN(MyDate) AS StartTransition,
MAX(MyDate) AS FinishedTransition,
TIMESTAMP_DIFF(MAX(MyDate), MIN(MyDate), DAY) AS ElapsedTime,
FROM sample_table
WHERE (StateTransition="Start" OR StateTransition="Finished")
GROUP BY 1;
Query results
If Start and Finished has different author name and you want the name of Finished, you can use below instead.
ARRAY_AGG(author ORDER BY MyDate DESC LIMIT 1)[SAFE_OFFSET(0)] AS author,

For flattened result, you might consider below using an aggregation.
WITH sample_table AS (
SELECT TIMESTAMP '2022-09-23 00:08:00 UTC' MyDate, 'Start' StateTransition, 6371 MyIdentifier UNION ALL
SELECT '2022-10-10 01:17:14 UTC' MyDate, 'Finished' StateTransition, 6371 MyIdentifier UNION ALL
SELECT '2022-09-26 04:51:40 UTC' MyDate, 'Start' StateTransition, 7768 MyIdentifier UNION ALL
SELECT '2022-10-05 03:44:32 UTC' MyDate, 'Finished' StateTransition, 7768 MyIdentifier
)
SELECT MyIdentifier,
MIN(MyDate) AS StartTransition,
MAX(MyDate) AS FinishedTransition,
TIMESTAMP_DIFF(MAX(MyDate), MIN(MyDate), DAY) AS ElapsedTime,
FROM sample_table
WHERE (StateTransition="Start" OR StateTransition="Finished")
GROUP BY 1;
Query results
But for the intermediate result, we need a window function.
SELECT *,
IF(
StateTransition = 'Finished',
TIMESTAMP_DIFF(MyDate, FIRST_VALUE(IF(StateTransition = 'Start', MyDate, NULL) IGNORE NULLS) OVER w, DAY),
NULL
) AS ElapsedTime
FROM sample_table
WINDOW w AS (PARTITION BY MyIdentifier ORDER BY MyDate);
and if you want flattend result from the above result (using a window function), the query will looks like below which shows same result as the first query using an aggregation.
SELECT MyIdentifier,
FIRST_VALUE(IF(StateTransition = 'Start', MyDate, NULL) IGNORE NULLS) OVER w AS StartTransition,
MyDate AS FinishedTransition,
IF(
StateTransition = 'Finished',
TIMESTAMP_DIFF(MyDate, FIRST_VALUE(IF(StateTransition = 'Start', MyDate, NULL) IGNORE NULLS) OVER w, DAY),
NULL
) AS ElapsedTime
FROM sample_table
QUALIFY StateTransition = 'Finished'
WINDOW w AS (PARTITION BY MyIdentifier ORDER BY MyDate);

Consider below approach
select *, timestamp_diff(Transition_Finished, Transition_Start, day) as ElapsedTime
from your_table
pivot (max(MyDate) Transition for StateTransition in ('Start', 'Finished'))
if applied to sample data in your question - output is
Use below to test
WITH your_table AS (
SELECT TIMESTAMP '2022-09-23 00:08:00 UTC' MyDate, 'Start' StateTransition, 6371 MyIdentifier UNION ALL
SELECT '2022-10-10 01:17:14 UTC' MyDate, 'Finished' StateTransition, 6371 MyIdentifier UNION ALL
SELECT '2022-09-26 04:51:40 UTC' MyDate, 'Start' StateTransition, 7768 MyIdentifier UNION ALL
SELECT '2022-10-05 03:44:32 UTC' MyDate, 'Finished' StateTransition, 7768 MyIdentifier
)
select *, timestamp_diff(Transition_Finished, Transition_Start, day) as ElapsedTime
from your_table
pivot (max(MyDate) Transition for StateTransition in ('Start', 'Finished'))

I think that for each MyIdentifier you should have only one start and one finish, so you can simply split and join:
;WITH
ts AS ( SELECT * FROM <my-data-source> WHERE StateTransition = 'Start'),
tf AS ( SELECT * FROM <my-data-source> WHERE StateTransition = 'Finished')
SELECT
ts.MyIdentifier,
ts.MyDate StartTransition,
tf.MyDate FinishedTransition,
TIMESTAMP_DIFF(ts.MyDate, tf.MyDate, DAY) ElapsedTime
FROM ts
LEFT JOIN tf on ts.MyIdentifier = tf.MyIdentifier
UPDATE
If you have more start and finish for each identifier you need to choice which one to keep, I will assume you want to keep the 1st start and the last finish:
;WITH
ts AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY MyIdentifier ORDER BY MyDate) n
FROM <my-data-source> WHERE StateTransition = 'Start'),
tf AS (
SELECT *,
ROW_NUMBER() over (partition by MyIdentifier ORDER BY MyDate DESC) n
FROM <my-data-source> WHERE StateTransition = 'Finished')
SELECT
ts.MyIdentifier,
ts.MyDate StartTransition,
tf.MyDate FinishedTransition,
TIMESTAMP_DIFF(ts.MyDate, tf.MyDate, DAY) ElapsedTime
FROM ts
LEFT JOIN tf on ts.MyIdentifier = tf.MyIdentifier and tf.n=1
WHERE ts.n=1

Related

ORACLE DATE FILTER NOT SHOWING CORRECT DATE RESULT

I have data as
ID MYDATE
1 2020-02-02 19:45:00:00
1 2020-02-02 20:00:00:00
I need to get data of only min_date.
So I have used query as (considering eastern and utc time zone)
SELECT ID, MYDATE
FROM MYTABLE
WHERE TO_CHAR(FROM_TZ(TO_TIMESTAMP(TO_CHAR(MYDATE.'YYYY-MM-DD HH24.MI.SS'),'YYYY-
MM-DD HH24.MI.SS), 'AMERICA/NEW_YORK) AT TIME ZONE 'UTC', 'YYYYMMDD') =
'20200202'
I get result for 2020-02-02 (which is expected)
1 2020-02-02 19:45:00:00
But when I run for date '20200203' I am getting
1 2020-02-02 20:00:00:00
which I should not be getting( I shouldn't be getting any results)
Help is appreciated!
With 2 rows of sample data it is ... challenging to give the solution that you are looking for. But here is a query that gives you the row with the lowest date for an id:
WITH tab (id, mydate)
AS
(
SELECT 1, TO_DATE('2020-02-02 19:45:00','YYYY-MM-DD HH24:MI:SS') FROM DUAL union
SELECT 1, TO_DATE('2020-02-02 20:00:00','YYYY-MM-DD HH24:MI:SS') FROM DUAL
),
ordered_dates AS
(
SELECT
id,
mydate,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY mydate) AS rn FROM tab
)
SELECT id, TO_CHAR(mydate,'YYYY-MM-DD HH24:MI:SS') FROM ordered_dates WHERE rn = 1;
AMERICA/NEW_YORK timezone is UTC -4, so when you convert AMERICA/NEW_YORK to UTC, you get
original time + 4 hour,ie 2020-02-03 01:00:00 UTC:
WITH t (id, mydate)
AS
(
SELECT 1, TO_DATE('2020-02-02 19:45:00','YYYY-MM-DD HH24:MI:SS') FROM DUAL union
SELECT 2, TO_DATE('2020-02-02 20:00:00','YYYY-MM-DD HH24:MI:SS') FROM DUAL
)
select
TO_CHAR(MYDATE,'YYYY-MM-DD HH24.MI.SS') original_time,
FROM_TZ(
TO_TIMESTAMP(
TO_CHAR(MYDATE,'YYYY-MM-DD HH24.MI.SS')
,'YYYY-MM-DD HH24.MI.SS'
),
'AMERICA/NEW_YORK'
)
AT TIME ZONE 'UTC'
as UTC_TIME
from t;
ORIGINAL_TIME UTC_TIME
------------------- --------------------------------------
2020-02-02 19.45.00 2020-02-03 00:45:00.000000000 UTC
2020-02-02 20.00.00 2020-02-03 01:00:00.000000000 UTC
2 rows selected.
And you do not need to_timestamp(to_char(... in this case, it's easier to use CAST:
FROM_TZ(cast(MYDATE as timestamp),'AMERICA/NEW_YORK') AT TIME ZONE 'UTC'

Get 1st Row of a particular event type in Bigquery?

Row EventType CloudId ts
1 stop 5201156607311872 2018-07-07 12:25:21 UTC
2 start 5201156607311872 2018-07-07 12:27:39 UTC
3 start 5201156607311872 2018-07-07 12:28:15 UTC
4 stop 5738776789778432 2018-07-07 12:28:54 UTC
5 stop 5201156607311872 2018-07-07 12:30:30 UTC
6 stop 5738776789778432 2018-07-07 12:37:45 UTC
7 stop 5738776789778432 2018-07-07 12:40:52 UTC
I have a table structure as above. I want to filter only the first event before the EventType of the row changes. i.e row 2 and row 3 have the same EventType, I need to remove row 3 from the table. row 4,5,6,7 have the same EventType, I want to keep row 4 and remove row 5,6,7.
Use lag():
select t.*
from (select t.*,
lag(eventtype) over (order by row) as prev_eventtype
from t
) t
where prev_eventtype is null or prev_eventtype <> eventtype;
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(prev_eventtype) FROM (
SELECT *, LAG(eventtype) OVER (ORDER BY ts) AS prev_eventtype
FROM `project.dataset.table`
)
WHERE prev_eventtype IS NULL OR prev_eventtype <> eventtype
You can test, play with above using dummy data from your question:
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'stop' EventType, 5201156607311872 CloudId, TIMESTAMP '2018-07-07 12:25:21 UTC' ts UNION ALL
SELECT 'start', 5201156607311872, '2018-07-07 12:27:39 UTC' UNION ALL
SELECT 'start', 5201156607311872, '2018-07-07 12:28:15 UTC' UNION ALL
SELECT 'stop', 5738776789778432, '2018-07-07 12:28:54 UTC' UNION ALL
SELECT 'stop', 5201156607311872, '2018-07-07 12:30:30 UTC' UNION ALL
SELECT 'stop', 5738776789778432, '2018-07-07 12:37:45 UTC' UNION ALL
SELECT 'stop', 5738776789778432, '2018-07-07 12:40:52 UTC'
)
SELECT * EXCEPT(prev_eventtype) FROM (
SELECT *, LAG(eventtype) OVER (ORDER BY ts) AS prev_eventtype
FROM `project.dataset.table`
)
WHERE prev_eventtype IS NULL OR prev_eventtype <> eventtype
with result :
EventType CloudId ts
stop 5201156607311872 2018-07-07 12:25:21 UTC
start 5201156607311872 2018-07-07 12:27:39 UTC
stop 5738776789778432 2018-07-07 12:28:54 UTC
select
Row,
EventType,
CloudId,
ts
from
(
select
Row,
EventType,
CloudId,
ts,
row_number() over (partition by EventType order by CloudId,Row) as rnk
from table_name
)evnt where rnk=1
You can use SELECT statement to just hide the unwanted rows :
select t.*
from table t
where t.row = (select min(t1.row) from table t1 where t1.EventType = t.EventType);

SQL join on timestamp differences from the same table

I'm not sure how to write this SQL query in BigQuery. I have a table of events with names and timestamps. Let's say I have only two events in the table: A and B. What I want to do is query the table to get all instances of event A, and get the next closest occurrence of B and create a new column with the time difference. B will always happen after A.
For example if I had a table that looks like:
A1 | 1:00 pm
B5 | 2:00 pm
A3 | 3:00 pm
B9 | 5:00 pm
My resultant table would be:
A1 | 1 hour
A3 | 2 hours
The query I came up with is the following:
SELECT
CAST(TIMESTAMP_DIFF((SELECT MIN(sub.time)
FROM table sub
WHERE sub.time > main.time), main.time, SECOND) AS INT64) duration
FROM table main
This works fine for getting the table I wanted above, but I would also like to include an additional column from the subquery. Something that looks like:
A1 | 1 hour | B5Column
A3 | 2 hours | B9Column
I attempted at using the query below:
SELECT
(SELECT
sub.SubQueryColumn
FROM table sub
WHERE sub.time > main.time
ORDER BY sub.time asc
LIMIT 1) SubColumn,
CAST(TIMESTAMP_DIFF((SELECT MIN(sub.time)
FROM table sub
WHERE sub.time > main.time), main.time, SECOND) AS INT64) duration
FROM table main
but it did not work. The error I get is
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
Could I get some help with this?
Here is one method:
select m.*,
timestamp_diff(time, next_b_time, second) as duration
from (select m.*,
min(case when event like 'B%' then time end) over (order by time desc) as next_b_time
from main m
) m
where event like 'A%';
Below is for BigQuery Standard SQL
#standardSQL
SELECT event, TIMESTAMP_DIFF(b_time, time, SECOND) duration, b_event
FROM (
SELECT event, time,
LEAD(time) OVER(PARTITION BY grp ORDER BY time) b_time,
LEAD(event) OVER(PARTITION BY grp ORDER BY time) b_event
FROM (
SELECT *,
COUNTIF(STARTS_WITH(event, 'A')) OVER(ORDER BY time) grp
FROM `project.dataset.your_table` t
)
)
WHERE STARTS_WITH(event, 'A')
-- ORDER BY time
You can test / play with it using dummy data from your question as below
#standardSQL
WITH `project.dataset.your_table` AS (
SELECT 'A1' event, TIMESTAMP '2018-01-01 1:00:00' time UNION ALL
SELECT 'B5', TIMESTAMP '2018-01-01 2:00:00' UNION ALL
SELECT 'A3', TIMESTAMP '2018-01-01 3:00:00' UNION ALL
SELECT 'B9', TIMESTAMP '2018-01-01 5:00:00'
)
SELECT event, TIMESTAMP_DIFF(b_time, time, SECOND) duration, b_event
FROM (
SELECT event, time,
LEAD(time) OVER(PARTITION BY grp ORDER BY time) b_time,
LEAD(event) OVER(PARTITION BY grp ORDER BY time) b_event
FROM (
SELECT *,
COUNTIF(STARTS_WITH(event, 'A')) OVER(ORDER BY time) grp
FROM `project.dataset.your_table` t
)
)
WHERE STARTS_WITH(event, 'A')
ORDER BY time
with result as
Row event duration b_event
1 A1 3600 B5
2 A3 7200 B9
Please note: above solution rely on statement in your question - B will always happen after A so if you have sequence as below
WITH `project.dataset.your_table` AS (
SELECT 'A1' event, TIMESTAMP '2018-01-01 1:00:00' time UNION ALL
SELECT 'A2', TIMESTAMP '2018-01-01 1:30:00' UNION ALL
SELECT 'B5', TIMESTAMP '2018-01-01 2:00:00' UNION ALL
SELECT 'A3', TIMESTAMP '2018-01-01 3:00:00' UNION ALL
SELECT 'B9', TIMESTAMP '2018-01-01 5:00:00'
)
result will be
Row event duration b_event
1 A1 null null
2 A2 1800 B5
3 A3 7200 B9
If you need to address this - try below
#standardSQL
WITH `project.dataset.your_table` AS (
SELECT 'A1' event, TIMESTAMP '2018-01-01 1:00:00' time UNION ALL
SELECT 'A2', TIMESTAMP '2018-01-01 1:30:00' UNION ALL
SELECT 'B5', TIMESTAMP '2018-01-01 2:00:00' UNION ALL
SELECT 'A3', TIMESTAMP '2018-01-01 3:00:00' UNION ALL
SELECT 'B9', TIMESTAMP '2018-01-01 5:00:00'
)
SELECT event, TIMESTAMP_DIFF(b_time, time, SECOND) duration, b_event
FROM (
SELECT event, time, type, grp,
FIRST_VALUE(event) OVER(ORDER BY grp RANGE BETWEEN 1 FOLLOWING AND 1 FOLLOWING) b_event,
FIRST_VALUE(time) OVER(ORDER BY grp RANGE BETWEEN 1 FOLLOWING AND 1 FOLLOWING) b_time
FROM (
SELECT event, time, SUBSTR(event, 1, 1) type,
COUNTIF(STARTS_WITH(event, 'B')) OVER(ORDER BY time) grp
FROM `project.dataset.your_table` t
)
)
WHERE STARTS_WITH(event, 'A')
ORDER BY time
this version will return
Row event duration b_event
1 A1 3600 B5
2 A2 1800 B5
3 A3 7200 B9

How to combine multiple SELECTs into a single SELECT by a common column in (BigQuery) SQL?

Given I have multiple tables in BigQuery, hence I have multiple SQL-statements that gives me "the number of X per day". For example:
SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as installs
FROM database.table1
GROUP BY day
ORDER BY day ASC
Which would give the result:
| day | installs |
-------------------------
| 2017-01-01 | 11 |
| 2017-01-02 | 22 |
etc
Another statement:
SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as uninstalls
FROM database.table2
GROUP BY day
ORDER BY day ASC
Which would give the result:
| day | uninstalls |
---------------------------
| 2017-01-02 | 22 |
| 2017-01-03 | 33 |
etc
Another statement:
SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as cases
FROM database.table3
GROUP BY day
ORDER BY day ASC
Which would give the result:
| day | cases |
----------------------
| 2017-01-01 | 11 |
| 2017-01-03 | 33 |
etc
etc
Now I need to combine all these into a single SELECT statement that gives the following results:
| day | installs | uninstalls | cases |
----------------------------------------------
| 2017-01-01 | 11 | 0 | 11 |
| 2017-01-02 | 22 | 22 | 0 |
| 2017-01-03 | 0 | 33 | 33 |
etc
Is this even possible?
Or what's the closest SQL-statement I can write that would give me a similar result?
Any feedback is appreciated!
Here is a self-contained example that might help to get you started. It uses two dummy tables, InstallEvents and UninstallEvents, which contain timestamps for the respective actions. It creates a common table expression called StartAndEnd that computes the minimum and maximum dates for these events in order to decide which dates to aggregate over, then unions the contents of the InstallEvents and UninstallEvents, counting the events for each day.
WITH InstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
),
StartAndEnd AS (
SELECT MIN(DATE(timestamp)) AS min_date, MAX(DATE(timestamp)) AS max_date
FROM (
SELECT * FROM InstallEvents UNION ALL
SELECT * FROM UninstallEvents
)
)
SELECT
day,
COUNTIF(is_install AND DATE(timestamp) = day) AS installs,
COUNTIF(NOT is_install AND DATE(timestamp) = day) AS uninstalls
FROM (
SELECT *, true AS is_install
FROM InstallEvents UNION ALL
SELECT *, false
FROM UninstallEvents
)
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(
(SELECT min_date FROM StartAndEnd),
(SELECT max_date FROM StartAndEnd)
)) AS day
GROUP BY day
ORDER BY day;
If you know what the start and end dates are in advance, you can hard-code them in the query instead and then omit the StartAndEnd CTE:
WITH InstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
)
SELECT
day,
COUNTIF(is_install AND DATE(timestamp) = day) AS installs,
COUNTIF(NOT is_install AND DATE(timestamp) = day) AS uninstalls
FROM (
SELECT *, true AS is_install
FROM InstallEvents UNION ALL
SELECT *, false
FROM UninstallEvents
)
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-01-04')) AS day
GROUP BY day
ORDER BY day;
To see the events in the sample data, use a query that unions the contents:
WITH InstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
)
SELECT timestamp, true AS is_install
FROM InstallEvents UNION ALL
SELECT timestamp, false
FROM UninstallEvents;
Below is for BigQuery Standard SQL
#standardSQL
WITH calendar AS (
SELECT day
FROM (
SELECT MIN(min_day) AS min_day, MAX(max_day) AS max_day
FROM (
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table1` UNION ALL
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table2` UNION ALL
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table3`
)
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) AS day
)
SELECT
c.day AS day,
IFNULL(SUM(installs), 0) AS installs,
IFNULL(SUM(uninstalls), 0) AS uninstalls,
IFNULL(SUM(cases),0) AS cases
FROM calendar AS c
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) installs FROM `database.table1` GROUP BY day) t1 ON t1.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) uninstalls FROM `database.table2` GROUP BY day) t2 ON t2.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) cases FROM `database.table3` GROUP BY day) t3 ON t3.day = c.day
GROUP BY day
HAVING installs + uninstalls + cases > 0
-- ORDER BY day
Please note: you are using timestamp as a column name which is not the best practice as it is keyword, so in my example i leave your naming but consider to change this!
You can test / play this solution with below dummy data
#standardSQL
WITH `database.table1` AS (
SELECT TIMESTAMP '2017-01-01' AS timestamp, 1 AS installs
UNION ALL SELECT TIMESTAMP '2017-01-01', 22
),
`database.table2` AS (
SELECT TIMESTAMP '2016-12-01' AS timestamp, 1 AS installs UNION ALL SELECT TIMESTAMP '2017-01-01', 22 UNION ALL SELECT TIMESTAMP '2017-01-01', 22 UNION ALL
SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22
),
`database.table3` AS (
SELECT TIMESTAMP '2017-01-01' AS timestamp, 1 AS installs UNION ALL SELECT TIMESTAMP '2017-01-01', 22 UNION ALL SELECT TIMESTAMP '2017-01-01', 22 UNION ALL
SELECT TIMESTAMP '2017-01-10', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22 UNION ALL SELECT TIMESTAMP '2017-01-02', 22
),
calendar AS (
SELECT day
FROM (
SELECT MIN(min_day) AS min_day, MAX(max_day) AS max_day
FROM (
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table1` UNION ALL
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table2` UNION ALL
SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table3`
)
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) AS day
)
SELECT
c.day AS day,
IFNULL(SUM(installs), 0) AS installs,
IFNULL(SUM(uninstalls), 0) AS uninstalls,
IFNULL(SUM(cases),0) AS cases
FROM calendar AS c
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) installs FROM `database.table1` GROUP BY day) t1 ON t1.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) uninstalls FROM `database.table2` GROUP BY day) t2 ON t2.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) cases FROM `database.table3` GROUP BY day) t3 ON t3.day = c.day
GROUP BY day
HAVING installs + uninstalls + cases > 0
ORDER BY day
I am not very familiar with bigquery, so this is probably not going to be a copy-paste answer.
You'll first have to build a calander table to make sure you have all dates. Here's an example for sql server. There are probably examples for bigquery available as well. The following assumes a Calander table with Date attribute in timestamp.
Once you have your calander table you can join all your tables to that:
SELECT FORMAT_TIMESTAMP("%F",C.Date) AS day
, COUNT(T1.DATE(T1.TIMESTAMP)) AS installs --Here you could also use your FORMAT_TIMESTAMP
, COUNT(T1.DATE(T2.TIMESTAMP)) AS uninstalls
FROM Calander C
LEFT JOIN database.table1 T1
ON DATE(T1.TIMESTAMP) = DATE(C.Date) --Convert to date to remove times, you could also use your FORMAT_TIMESTAMP
LEFT JOIN database.table2 T2
ON DATE(T2.TIMESTAMP) = DATE(C.Date)
GROUP BY day
ORDER BY day ASC

Getting minimum value at a given time in SQL

I have the following SQL table:
start_time end_time value
2016-01-01 00:00:00 2016-01-01 08:59:59 1
2016-01-01 06:00:00 2016-01-01 14:59:59 2
2016-01-01 12:00:00 2016-01-01 17:59:59 1.5
2016-01-01 03:00:00 2016-01-01 17:59:59 3
I want to convert it into:
start_time end_time min_value
2016-01-01 00:00:00 2016-01-01 08:59:59 1
2016-01-01 09:00:00 2016-01-01 11:59:59 2
2016-01-01 12:00:00 2016-01-01 17:59:59 1.5
where min_value is the minimum value at a given point in time. Is it possible to do this in SQL?
Try below. I think it does exactly what you asked
As you can see - I added one more entry in your example to make it a little spicier :o)
WITH YourTable AS (
SELECT TIMESTAMP '2016-01-01 00:00:00' AS start_time, TIMESTAMP '2016-01-01 08:59:59' AS end_time, 1 AS value UNION ALL
SELECT TIMESTAMP '2016-01-01 06:00:00' AS start_time, TIMESTAMP '2016-01-01 14:59:59' AS end_time, 2 AS value UNION ALL
SELECT TIMESTAMP '2016-01-01 12:00:00' AS start_time, TIMESTAMP '2016-01-01 17:59:59' AS end_time, 1.5 AS value UNION ALL
SELECT TIMESTAMP '2016-01-01 03:00:00' AS start_time, TIMESTAMP '2016-01-01 17:59:59' AS end_time, 3 AS value UNION ALL
SELECT TIMESTAMP '2016-01-01 12:30:00' AS start_time, TIMESTAMP '2016-01-01 12:40:59' AS end_time, 1 AS value
),
Intervals AS (
SELECT iStart AS start_time, LEAD(iStart) OVER(ORDER BY iStart) AS end_time
FROM (
SELECT DISTINCT iStart FROM (
SELECT start_time AS iStart FROM YourTable UNION ALL
SELECT end_time AS iStart FROM YourTable )
)
),
Intervals_Mins AS (
SELECT b.start_time, b.end_time, MIN(value) AS min_value
FROM YourTable AS a
JOIN Intervals AS b
ON b.start_time BETWEEN a.start_time AND a.end_time
AND b.end_time BETWEEN a.start_time AND a.end_time
GROUP BY b.start_time, b.end_time
),
Intervals_Group AS (
SELECT start_time, end_time, min_value, IFNULL(SUM(flag) OVER(PARTITION BY CAST(min_value AS STRING) ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) AS time_group
FROM (
SELECT start_time, end_time, min_value, IF(end_time = LEAD(start_time) OVER(PARTITION BY CAST(min_value AS STRING) ORDER BY start_time), 0, 1) AS flag
FROM Intervals_Mins
)
)
SELECT MIN(start_time) AS start_time, MAX(end_time) AS end_time, min_value
FROM Intervals_Group
GROUP BY min_value, time_group
-- ORDER BY start_time
Hmmm . . . This seems hard. I think the following strategy will work:
Break the data into two parts, for start times and end times.
For each start time calculate the minimum value in effect at that time.
For each end time, calculate the minimum value in effect starting at that time.
Recombine using a gaps-and-islands approach
I'm just not 100% sure you can do this in BQ, because it involves non-equijoins. But . . .
with starts as (
select start_time as time,
(select min(t2.value)
from t t2
where t.start_time between t2.start_time and t2.end_time
) as value
from t
),
ends as (
select end_time as time,
(select min(t2.value)
from t t2
where t2.end_time > t.end_time and
t2.start_time <= t.end_time
) as value
from t
)
select value, min(time), max(time)
from (select time,
row_number() over (order by time) as seqnum,
row_number() over (partition by value order by time) as seqnum_v
from ((select s.* from starts) union all
(select e.* from ends)
) t
) t
group by value, (seqnum - seqnum_v);
I'm not sure that I understand how the expected output relates to the input, but if you just want to associate the minimum value with distinct (start_time, end_time) pairs, you can do e.g.:
#standardSQL
WITH T AS (
SELECT TIMESTAMP '2016-01-01 00:00:00' AS start_time,
TIMESTAMP '2016-01-01 08:59:59' AS end_time, 1 AS value UNION ALL
SELECT TIMESTAMP '2016-01-01 06:00:00',
TIMESTAMP '2016-01-01 14:59:59', 2 UNION ALL
SELECT TIMESTAMP '2016-01-01 12:00:00',
TIMESTAMP '2016-01-01 17:59:59', 1.5 UNION ALL
SELECT TIMESTAMP '2016-01-01 3:00:00',
TIMESTAMP '2016-01-01 17:59:59', 3
)
SELECT
start_time,
end_time,
MIN(value) AS min_value
FROM T
GROUP BY start_time, end_time;