Count overlapping intervals by ID BigQuery - sql

I want to count how many overlapping interval I have, according to the ID
WITH table AS (
SELECT 1001 as id, 1 AS start_time, 10 AS end_time UNION ALL
SELECT 1001, 2, 5 UNION ALL
SELECT 1002, 3, 4 UNION ALL
SELECT 1003, 5, 8 UNION ALL
SELECT 1003, 6, 8 UNION ALL
SELECT 1001, 6, 20
)
In this case the desired result should be:
2 overlapping for ID=1001
1 overlapping for ID=1003
0 overlapping for ID=1002
TOT OVERLAPPING = 3
Whenever there is a overlapping (even partial) I need to count it as such.
How can I achieve this in BigQuery?

Below is for BigQuery Standard SQL and is simple and quite straightforward self-joining and checking and counting overlaps
#standardSQL
SELECT a.id,
COUNTIF(
a.start_time BETWEEN b.start_time AND b.end_time
OR a.end_time BETWEEN b.start_time AND b.end_time
OR b.start_time BETWEEN a.start_time AND a.end_time
OR b.end_time BETWEEN a.start_time AND a.end_time
) overlaps
FROM `project.dataset.table` a
LEFT JOIN `project.dataset.table` b
ON a.id = b.id AND TO_JSON_STRING(a) < TO_JSON_STRING(b)
GROUP BY id
If to apply to sample data in your question - it results with
Row id overlaps
1 1001 2
2 1002 0
3 1003 1
Another option (to avoid self-joining in favor of using analytics functions)
#standardSQL
SELECT id,
SUM((SELECT COUNT(1) FROM y.arr x
WHERE y.start_time BETWEEN x.start_time AND x.end_time
OR y.end_time BETWEEN x.start_time AND x.end_time
OR x.start_time BETWEEN y.start_time AND y.end_time
OR x.end_time BETWEEN y.start_time AND y.end_time
)) overlaps
FROM (
SELECT id, start_time, end_time,
ARRAY_AGG(STRUCT(start_time, end_time))
OVER(PARTITION BY id ORDER BY TO_JSON_STRING(t)
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
) arr
FROM `project.dataset.table` t
) y
GROUP BY id
Obviously with same result / output as previous version

The logic for all overlaps compares the start and end times:
SELECT t1.id,
COUNTIF(t1.end_time > t2.start_time AND t2.start_time < t1.end_time) as num_overlaps
FROM `project.dataset.table` t1 LEFT JOIN
`project.dataset.table` t2
ON t1.id = t2.id
GROUP BY t1.id;
That is not exactly what you want, because this compares every interval to every other interval, including itself. Removing the "same" one basically requires a unique identifier. We can get this using row_number().
Further, you don't seem to want to count overlaps twice. So:
with t as (
select t.*, row_number() over (partition by id order by start_time) as seqnum
from `project.dataset.table` t
)
SELECT t1.id,
COUNTIF(t1.end_time > t2.start_time AND t2.start_time < t1.end_time) as num_overlaps
FROM t t1 LEFT JOIN
t t2
ON t1.id = t2.id AND t1.seqnum < t2.seqnum
GROUP BY t1.id;

Related

SQL Optimization: multiplication of two calculated field generated by window functions

Given two time-series tables tbl1(time, b_value) and tbl2(time, u_value).
https://www.db-fiddle.com/f/4qkFJZLkZ3BK2tgN4ycCsj/1
Suppose we want to find the last value of u_value in each day, the daily cumulative sum of b_value on that day, as well as their multiplication, i.e. daily_u_value * b_value_cum_sum.
The following query calculates the desired output:
WITH cte AS (
SELECT
t1.time,
t1.b_value,
t2.u_value * t1.b_value AS bu_value,
last_value(t2.u_value)
OVER
(PARTITION BY DATE_TRUNC('DAY', t1.time) ORDER BY DATE_TRUNC('DAY', t2.time) ) AS daily_u_value
FROM stackoverflow.tbl1 t1
LEFT JOIN stackoverflow.tbl2 t2
ON
t1.time = t2.time
)
SELECT
DATE_TRUNC('DAY', c.time) AS time,
AVG(c.daily_u_value) AS daily_u_value,
SUM( SUM(c.b_value)) OVER (ORDER BY DATE_TRUNC('DAY', c.time) ) as b_value_cum_sum,
AVG(c.daily_u_value) * SUM( SUM(c.b_value) ) OVER (ORDER BY DATE_TRUNC('DAY', c.time) ) as daily_u_value_mul_b_value
FROM cte c
GROUP BY 1
ORDER BY 1 DESC
I was wondering what I can do to optimize this query? Is there any alternative solution that generates the same result?
db filddle demo
from your query: Execution Time: 250.666 ms to my query Execution Time: 205.103 ms
seems there is some progress there. Mainly reduce the time of cast, since I saw your have many times cast from timestamptz to timestamp. I wonder why not just another date column.
I first execute my query then yours, which mean the compare condition is quite fair, since second time execute generally more faster than first time.
alter table tbl1 add column t1_date date;
alter table tbl2 add column t2_date date;
update tbl1 set t1_date = time::date;
update tbl2 set t2_date = time::date;
WITH cte AS (
SELECT
t1.t1_date,
t1.b_value,
t2.u_value * t1.b_value AS bu_value,
last_value(t2.u_value)
OVER
(PARTITION BY t1_date ORDER BY t2_date ) AS daily_u_value
FROM stackoverflow.tbl1 t1
LEFT JOIN stackoverflow.tbl2 t2
ON
t1.time = t2.time
)
SELECT
t1_date,
AVG(c.daily_u_value) AS daily_u_value,
SUM( SUM(c.b_value)) OVER (ORDER BY t1_date ) as b_value_cum_sum,
AVG(c.daily_u_value) * SUM( SUM(c.b_value) ) OVER
(ORDER BY t1_date ) as daily_u_value_mul_b_value
FROM cte c
GROUP BY 1
ORDER BY 1 DESC

Calculating difference in rows for many columns in SQL (Access)

What's up guys. I have an other question regarding using SQL to analyze. I have a table build like this.
ID Date Value
1 31.01.2019 10
1 30.01.2019 5
2 31.01.2019 20
2 30.01.2019 10
3 31.01.2019 30
3 30.01.2019 20
With many different IDs and many different Dates. What I would like to have as an output is an additional column, that gives me the difference to the previous date for each ID. So that I can then analyze the change of values between days for each Category (ID). To do that I would need to avoid that the command computes the difference of Last Day WHERE ID = 1 - First Day WHERE ID = 2.
Desired Output:
ID Date Difference to previous Days
1 31.01.2019 5
2 31.01.2019 10
3 31.01.2019 10
In the end I want to find outlier, so days where the difference in value between two days is very large. Does anyone have a solution? If it is not possible with Access, I am open to solutions with Excel, but Access should be the first choice as it is more scaleable.
Greetings and thanks in advance!!
With a self join:
select t1.ID, t1.[Date],
t1.[Value] - t2.[Value] as [Difference to previous Day]
from tablename t1 inner join tablename t2
on t2.[ID] = t1.[ID] and t2.[Date] = t1.[Date] - 1
Results:
ID Date Difference to previous Day
1 31/1/2019 5
2 31/1/2019 10
3 31/1/2019 10
Edit.
For the case that there are gaps between your dates:
select
t1.ID, t1.[Date], t1.[Value] - t2.[Value] as [Difference to previous Day]
from (
select t.ID, t.[Date], t.[Value],
(select max(tt.[Date]) from tablename as tt where ID = t.ID and tt.[Date] < t.[Date]) as prevdate
from tablename as t
) as t1 inner join tablename as t2
on t2.ID = t1.ID and t2.[Date] = t1.prevdate
In your example data, each id has the same two rows and the values are increasing. If this is generally true, then you can simply use aggregation:
select id, max(date), max(value) - min(value)
from t
group by id;
If the values might not be increasing, but the dates are the same, then you can use conditional aggregation:
select id,
max(date),
(max(iif(date = "31.01.2019", value, null)) -
max(iif(date = "30.01.2019", value, null))
) as diff
from t
group by id;
Note: Your date looks like it is using a bespoke format, so I am just doing the comparison as a string.
If previous date is exactly one day before, you can use a join:
select t.*,
(t.value - tprev.value) as diff
from t left join
t as tprev
on t.id = tprev.di and t.date = dateadd("d", 1, tprev.date);
If date is arbitrarily the previous date in the table, then you can use a correlated subquery
select t.*,
(t.value -
(select top (1) tprev.value
from t as tprev
where tprev.id = t.id and tprev.date < t.date
order by tprev.date desc
)
) as diff
(t.value - tprev.value) as diff
from t;
You can use a self join with an additional condition using a sub-query to determine the previous date
SELECT t.ID, t.Date, t.Value - prev.Value AS Diff
FROM
dtvalues AS t
INNER JOIN dtvalues AS prev
ON t.ID = prev.ID
WHERE
prev.[Date] = (SELECT MAX(x.[Date]) FROM dtvalues x WHERE x.ID=t.ID AND x.[Date]<t.[Date])
ORDER BY t.ID, t.[Date];
You could also include the where condition into the join condition, but the query designer would not be able to handle the query anymore. Like this, you can still edit the query in the query designer.

Minimum difference between dates in the same column in Redshift

I have data like this:
person_id date1
1 2016-08-03
1 2016-08-04
1 2016-08-07
What i want a as a result is the minimum difference between all dates for person_id, in this case the minimum difference is 1 day(between 8/3 and 8/4).
Is there a way to query for this grouped by person_id in redshift?
Thanks!
I assume you want this for each person. If so, use lag() or lead() and aggregation:
select person_id, min(next_date1 - date1)
from (select t.*,
lead(date1) over (partition by person_id order by date1) as next_date1
from t
) t
group by person_id;
SELF JOIN should work you. Try this way
SELECT a.date1 - b.date1
FROM table1 a
JOIN table1 b
ON a.person_id = b.person_id
AND a.date1 <> b.date1
Where a.date1 - b.date1 > 0
ORDER BY a.date1 - b.date1 ASC
LIMIT 1
This one uses a self join to compare each date:
SELECT t1.person_id, MIN(datediff(t1.date1, t2.date1)) AS difference
FROM t t1
INNER JOIN t t2
ON t1.person_id = t2.person_id
AND t1.date1 > t2.date1
GROUP by t1.person_id
Tested here: http://sqlfiddle.com/#!9/1638f/1

Querying for a 'run' of consecutive columns in Postgres

I have a table:
create table table1 (event_id integer, event_time timestamp without time zone);
insert into table1 (event_id, event_time) values
(1, '2011-01-01 00:00:00'),
(2, '2011-01-01 00:00:15'),
(3, '2011-01-01 00:00:29'),
(4, '2011-01-01 00:00:58'),
(5, '2011-01-02 06:03:00'),
(6, '2011-01-02 06:03:09'),
(7, '2011-01-05 11:01:31'),
(8, '2011-01-05 11:02:15'),
(9, '2011-01-06 09:34:19'),
(10, '2011-01-06 09:34:41'),
(11, '2011-01-06 09:35:06');
I would like to construct a statement that given an event could return the length of the 'run' of events starting with that event. A run is defined by:
Two events are in a run together if they are within 30 seconds of one another.
If A and B are in a run together, and B and C are in a run together then A is in a run
with C.
However my query does not need to go backwards in time, so if I select on event 2, then only events 2, 3, and 4 should be counted as part of the run of events starting with 2, and 3 should be returned as the length of the run.
Any ideas? I'm stumped.
Here is the RECURSIVE CTE-solution. (islands-and-gaps problems naturally lend themselves to recursive CTE)
WITH RECURSIVE runrun AS (
SELECT event_id, event_time
, event_time - ('30 sec'::interval) AS low_time
, event_time + ('30 sec'::interval) AS high_time
FROM table1
UNION
SELECT t1.event_id, t1.event_time
, LEAST ( rr.low_time, t1.event_time - ('30 sec'::interval) ) AS low_time
, GREATEST ( rr.high_time, t1.event_time + ('30 sec'::interval) ) AS high_time
FROM table1 t1
JOIN runrun rr ON t1.event_time >= rr.low_time
AND t1.event_time < rr.high_time
)
SELECT DISTINCT ON (event_id) *
FROM runrun rr
WHERE rr.event_time >= '2011-01-01 00:00:15'
AND rr.low_time <= '2011-01-01 00:00:15'
AND rr.high_time > '2011-01-01 00:00:15'
;
Result:
event_id | event_time | low_time | high_time
----------+---------------------+---------------------+---------------------
2 | 2011-01-01 00:00:15 | 2010-12-31 23:59:45 | 2011-01-01 00:00:45
3 | 2011-01-01 00:00:29 | 2010-12-31 23:59:45 | 2011-01-01 00:01:28
4 | 2011-01-01 00:00:58 | 2010-12-31 23:59:30 | 2011-01-01 00:01:28
(3 rows)
Could look like this:
WITH x AS (
SELECT event_time
,row_number() OVER w AS rn
,lead(event_time) OVER w AS next_time
FROM table1
WHERE event_id >= <start_id>
WINDOW w AS (ORDER BY event_time, event_id)
)
SELECT COALESCE(
(SELECT x.rn
FROM x
WHERE (x.event_time + interval '30s') < x.next_time
ORDER BY x.rn
LIMIT 1)
,(SELECT count(*) FROM x)
) AS run_length
This version does not rely on a gap-less sequence of IDs, but on event_time only.
Identical event_time's are additionally sorted by event_id to be unambiguous.
Read about the window functions row_number() and lead() and CTE (With clause) in the manual.
Edit
If we cannot assume that a bigger event_id has a later (or equal) event_time, substitute this for the first WHERE clause:
WHERE event_time >= (SELECT event_time FROM table1 WHERE event_id = <start_id>)
Rows with the same event_time as the starting row but a a smaller event_id will still be ignored.
In the special case of one run till the end no end is found and no row returned. COALESCE returns the count of all rows instead.
You can join a table onto itself on a date difference statement. Actually, this is postgres, a simple minus works.
This subquery will find all records that is a 'start event'. That is to say, all event records that does not have another event record occurring within 30 seconds before it:
(Select a.event_id, a.event_time from
(Select event_id, event_time from table1) a
left join
(select event_id, event_time from table1) b
on a.event_time - b.event_time < '00:00:30' and a.event_time - b.event_time > '00:00:00'
where b.event_time is null) startevent
With a few changes...same logic, except picking up an 'end' event:
(Select a.event_id, a.event_time from
(Select event_id, event_time from table1) a
left join
(select event_id, event_time from table1) b
on b.event_time - a.event_time < '00:00:30' and b.event_time - a.event_time > '00:00:00'
where b.event_time is null) end_event
Now we can join these together to associate which start event goes to which end event:
(still writing...there's a couple ways at going on this. I'm assuming only the example has linear ID numbers, so you'll want to join the start event time to the end event time having the smallest positive difference on the event times).
Here's my end result...kinda nested a lot of subselects
select a.start_id, case when a.event_id is null then t1.event_id::varchar else 'single event' end as end_id
from
(select start_event.event_id as start_id, start_event.event_time as start_time, last_event.event_id, min(end_event.event_time - start_event.event_time) as min_interval
from
(Select a.event_id, a.event_time from
(Select event_id, event_time from table1) a
left join
(select event_id, event_time from table1) b
on a.event_time - b.event_time < '00:00:30' and a.event_time - b.event_time > '00:00:00'
where b.event_time is null) start_event
inner join
(Select a.event_id, a.event_time from
(Select event_id, event_time from table1) a
left join
(select event_id, event_time from table1) b
on b.event_time - a.event_time < '00:00:30' and b.event_time - a.event_time > '00:00:00'
where b.event_time is null) end_event
on end_event.event_time > start_event.event_time
--check for only event
left join
(Select a.event_id, a.event_time from
(Select event_id, event_time from table1) a
left join
(select event_id, event_time from table1) b
on b.event_time - a.event_time < '00:00:30' and b.event_time - a.event_time > '00:00:00'
where b.event_time is null) last_event
on start_event.event_id = last_event.event_id
group by 1,2,3) a
left join table1 t1 on t1.event_time = a.start_time + a.min_interval
Results as start_id, end_Id:
1;"4"
5;"6"
7;"single event"
8;"single event"
9;"11"
I had to use a third left join to pick out single events as a method of detecting events that were both start events and end events. End result is in ID's and can be linked back to your original table if you want different information than just the ID. Unsure how this solution will scale, if you've got millions of events...could be an issue.

Discrete Derivative in SQL

I've got sensor data in a table in the form:
Time Value
10 100
20 200
36 330
46 440
I'd like to pull the change in values for each time period. Ideally, I'd like to get:
Starttime Endtime Change
10 20 100
20 36 130
36 46 110
My SQL skills are pretty rudimentary, so my inclination is to pull all the data out to a script that processes it and then push it back to the new table, but I thought I'd ask if there was a slick way to do this all in the database.
Select a.Time as StartTime
, b.time as EndTime
, b.time-a.time as TimeChange
, b.value-a.value as ValueChange
FROM YourTable a
Left outer Join YourTable b ON b.time>a.time
Left outer Join YourTable c ON c.time<b.time AND c.time > a.time
Where c.time is null
Order By a.time
Select a.Time as StartTime, b.time as EndTime, b.time-a.time as TimeChange, b.value-a.value as ValueChange
FROM YourTable a, YourTable b
WHERE b.time = (Select MIN(c.time) FROM YourTable c WHERE c.time>a.time)
you could use a SQL window function, below is an example based on BIGQUERY syntax.
SELECT
LAG(time, 1) OVER (BY time) AS start_time,
time AS end_time,
(value - LAG(value, 1) OVER (BY time))/value AS Change
from data
First off, I would add an id column to the table so that you have something that predictably increases from row to row.
Then, I would try the following query:
SELECT t1.Time AS 'Starttime', t2.Time AS 'Endtime',
(t2.Value - t1.Value) AS 'Change'
FROM SensorData t1
INNER JOIN SensorData t2 ON (t2.id - 1) = t1.id
ORDER BY t1.Time ASC
I'm going to create a test table to try this for myself so I don't know if it works yet but it's worth a shot!
Update
Fixed with one minor issue (CHANGE is a protected word and had to be quoted) but tested it and it works! It produces exactly the results defined above.
Does this work?
WITH T AS
(
SELECT [Time]
, Value
, RN1 = ROW_NUMBER() OVER (ORDER BY [Time])
, RN2 = ROW_NUMBER() OVER (ORDER BY [Time]) - 1
FROM SensorData
)
SELECT
StartTime = ISNULL(t1.[time], t2.[time])
, EndTime = ISNULL(t2.[time], 0)
, Change = t2.value - t1.value
FROM T t1
LEFT OUTER JOIN
T t2
ON t1.RN1 = t2.RN2