Count of overlapping intervals in BigQuery - sql

Given a table of intervals, can I efficiently query for the number of currently open intervals at the start of each interval (including the current interval itself)?
For example, given the following table:
start_time end_time
1 10
2 5
3 4
5 6
7 11
19 20
I want the following output:
start_time count
1 1
2 2
3 3
5 3
7 2
19 1
On small datasets, I can solve this problem by joining the dataset against itself:
WITH intervals AS (
SELECT 1 AS start, 10 AS end UNION ALL
SELECT 2, 5 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 7, 11 UNION ALL
SELECT 19, 20
)
SELECT
a.start_time,
count(*)
FROM
intervals a CROSS JOIN intervals b
WHERE
a.start_time >= b.start_time AND
a.start_time <= b.end_time
GROUP BY a.start_time
ORDER BY a.start_time
With large datasets the CROSS JOIN is both impractical and unnecessary, since any given answer only depends on a small number of preceding intervals (when sorted by start_time). In fact, on the dataset that I have, it times out. Is there a better way to achieve this?

... CROSS JOIN is both impractical and unnecessary ...
Is there a better way to achieve this?
Try below for BigQuery Standard SQL. No JOINs involved
#standardSQL
SELECT
start_time,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt
FROM (
SELECT
start_time,
ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
FROM intervals
)
-- ORDER BY start_time
You can test/play with it using below example with dummy data from your question
#standardSQL
WITH intervals AS (
SELECT 1 AS start_time, 10 AS end_time UNION ALL
SELECT 2, 5 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 7, 11 UNION ALL
SELECT 19, 20
)
SELECT
start_time,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt
FROM (
SELECT
start_time,
ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
FROM intervals
)
-- ORDER BY start_time

Related

How can I divide hours to next working days in SQL?

I have a table that stores the start-date and number of the hours. I have also another time table as reference to working days. My main goal is the divide this hours to the working days.
For examle:
ID Date Hour
1 20210504 40
I want it to be structured as
ID Date Hour
1 20210504 8
1 20210505 8
1 20210506 8
1 20210507 8
1 20210510 8
I manage to divide the hours with the given code but couldn't manage to make it in working days.
WITH cte1 AS
(
select 1 AS ID, 20210504 AS Date, 40 AS Hours --just a test case
), working_days AS
(
select date from dateTable
),
cte2 AS
(
select ID, Date, Hours, IIF(Hours<=8, Hours, 8) AS dailyHours FROM cte1
UNION ALL
SELECT
cte2.ID,
cte2.Date + 1
,cte2.Hours - 8
,IIF(Hours<=8, Hours, 8)
FROM cte2
JOIN cte1 t ON cte2.ID = t.ID
WHERE cte2.HOURS > 8 AND cte2.Date + 1 IN (select * from working_days)
When I use it like this it only gives me this output with one day missing
ID Date Hour
1 20210504 8
1 20210505 8
1 20210506 8
1 20210507 8
To solve your problem you need to build your calendar in the right way,
adding also to working_days a ROW_NUMBER to get correct progression.
declare #date_start date = '2021-05-01'
;WITH
cte1 AS (
SELECT * FROM
(VALUES
(1, '20210504', 40),
(2, '20210505', 55),
(3, '20210503', 44)
) X (ID, Date, Hour)
),
numbers as (
SELECT ROW_NUMBER() over (order by o.object_id) N
FROM sys.objects o
),
cal as (
SELECT cast(DATEADD(day, n, #date_start) as date) d, n-1 n
FROM numbers n
where n.n<32
),
working_days as (
select d, ROW_NUMBER() over (order by n) dn
from cal
where DATEPART(weekday, d) < 6 /* monday to friday in italy (country dependent) */
),
base as (
SELECT t.ID, t.Hour, w.d, w.dn
from cte1 t
join working_days w on w.d = t.date
)
SELECT t.ID, w.d, iif((8*n)<=Hour, 8, 8 + Hour - (8*n) ) h
FROM base t
join numbers m on m.n <= (t.Hour / 8.0) + 0.5
join working_days w on w.dn = t.dn + N -1
order by 1,2
You can use a recursive CTE. This should do the trick:
with cte as (
select id, date, 8 as hour, hour as total_hour
from t
union all
select id, dateadd(day, 1, date),
(case when total_hour < 8 then total_hour else 8 end),
total_hour - 8
from cte
where total_hour > 0
)
select *
from cte;
Note: This assumes that total_hour is at least 8, just to avoid a case expression in the anchor part of the CTE. That can trivially be added.
Also, if there might be more than 100 days, you will need option (maxrecursion 0).

How to include values that count nothing on certain day (APEX)

I have this query:
SELECT
COUNT(ID) AS FREQ,
TO_CHAR(TRUNC(CREATED_AT),'DD-MON') DATES
FROM TICKETS
WHERE TRUNC(CREATED_AT) > TRUNC(SYSDATE) - 32
GROUP BY TRUNC(CREATED_AT)
ORDER BY TRUNC(CREATED_AT) ASC
This counts how many tickets where created every day for the past month.
The result looks something like this: (first 10 rows)
FREQ DATES
3 28-DEC
4 04-JAN
8 05-JAN
1 06-JAN
4 07-JAN
5 08-JAN
2 11-JAN
6 12-JAN
3 13-JAN
8 14-JAN
The linechart that I created looks like this:
The problem is that the days where tickets are not created (in particular the weekends) the line just goes straight to the day where there is created a ticket.
Is there a way in APEX or in my query to include the days that aren't counted?
As commented, using one of row generator techniques you'd create a "calendar" table and outer join it with a table that contains data you're displaying.
Something like this (see comments within code):
SQL> with yours (amount, datum) as
2 -- your sample table
3 (select 100, date '2021-01-01' from dual union all
4 select 200, date '2021-01-03' from dual union all
5 select 300, date '2021-01-07' from dual
6 ),
7 minimax as
8 -- MIN and MAX date (so that they could be used in row generator --> CALENDAR CTE (below)
9 (select min(datum) min_datum,
10 max(datum) max_datum
11 from yours
12 ),
13 calendar as
14 -- calendar, from MIN to MAX date in YOUR table
15 (select min_datum + level - 1 datum
16 from minimax
17 connect by level <= max_datum - min_datum + 1
18 )
19 -- final query uses outer join
20 select c.datum,
21 nvl(y.amount, 0) amount
22 from calendar c left join yours y on y.datum = c.datum
23 order by c.datum;
DATUM AMOUNT
---------- ----------
01.01.2021 100
02.01.2021 0
03.01.2021 200
04.01.2021 0
05.01.2021 0
06.01.2021 0
07.01.2021 300
7 rows selected.
SQL>
Applied to your current query:
WITH
minimax
AS
-- MIN and MAX date (so that they could be used in row generator --> CALENDAR CTE (below)
(SELECT MIN (created_at) min_datum, MAX (created_at) max_datum
FROM tickets),
calendar
AS
-- calendar, from MIN to MAX date in YOUR table
( SELECT min_datum + LEVEL - 1 datum
FROM minimax
CONNECT BY LEVEL <= max_datum - min_datum + 1)
-- final query uses outer join
SELECT COUNT (t.id) AS freq, TO_CHAR (TRUNC (c.datum), 'DD-MON') dates
FROM calendar c LEFT JOIN tickets t ON t.created_at = c.datum
WHERE TRUNC (t.created_at) > TRUNC (SYSDATE) - 32
GROUP BY TRUNC (c.datum)
ORDER BY dates ASC
I added a with clause to generate last 31 days, then I left joined with your base table like below.
with last_31_days as (
select trunc(sysdate) - 32 + level dt from dual connect by trunc(sysdate) - 32 + level < trunc(sysdate)
)
SELECT
nvl(COUNT(t.ID), 0) AS FREQ,
TO_CHAR(
nvl(TRUNC(t.CREATED_AT), a.dt)
,'DD-MON') DATES
FROM last_31_days a
LEFT JOIN TICKETS t
ON TRUNC(t.CREATED_AT) = a.dt
GROUP BY nvl(TRUNC(t.CREATED_AT), a.dt)
ORDER BY 2 ASC
;
#Littlefoot answer is perfect. but here is a cheeky way to get the similar table with format match OP output. using a simple cte for this.
WITH cte AS (
SELECT To_Char(Trunc(SYSDATE - ROWNUM),'DD-MON') dtcol
FROM DUAL
CONNECT BY ROWNUM < 366
)
SELECT * FROM cte
here is db<>fiddle
and then you can simply join this cte to fill up empty date. as the origin output column date looks like a string column.
connect by is for oracle only. but I think you can still use recursive cte to get similar result in other DBMS support recursive cte.

SQL not returning a value if no row exist for time queried

I'm writing this SQL query which returns the number of records created in an hour in last 24 hours. I'm getting the result for only those hours that have a non zero value. If no records were created, it doesn't return anything at all.
Here's my query:
SELECT HOUR(timeStamp) as hour, COUNT(*) as count
FROM `events`
WHERE timeStamp > DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY HOUR(timeStamp)
ORDER BY HOUR(timeStamp)
The output of current Query:
+-----------------+----------+
| hour | count |
+-----------------+----------+
| 14 | 6 |
| 15 | 5 |
+-----------------+----------+
But i'm expecting 0 for hours in which no records were created. Where am I going wrong?
One solution is to generate a table of numbers from 0 to 23 and left join it with your original table.
Here is a query that uses a recursive query to generate the list of hours (if you are running MySQL, this requires version 8.0):
with hours as (
select 0 hr
union all select hr + 1 where h < 23
)
select h.hr, count(e.eventID) as cnt
from hours h
left join events e
on e.timestamp > now() - interval 1 day
and hour(e.timestamp) = h.hr
group by h.hr
If your RDBMS does not support recursive CTEs, then one option is to use an explicit derived table:
select h.hr, count(e.eventID) as cnt
from (
select 0 hr union all select 1 union all select 2 ... union all select 23
) h
left join events e
on e.timestamp > now() - interval 1 day
and hour(e.timestamp) = h.hr
group by h.hr

Incremental business day column that resets each month

I need to create a table that contains records with 1) all 365 days of the year and 2) a counter representing which business day of the month the day is. Non-business days should be represented with a 0. For example:
Date | Business Day
2019-10-01 1
2019-10-02 2
2019-10-03 3
2019-10-04 4
2019-10-05 0 // Saturday
2019-10-06 0 // Sunday
2019-10-07 5
....
2019-11-01 1
2019-11-02 0 // Saturday
2019-11-03 0 // Sunday
2019-11-04 2
So far, I've been able to create a table that contains all dates of the year.
CREATE TABLE ${TMPID}_days_of_the_year
(
`theDate` STRING
);
INSERT OVERWRITE TABLE ${TMPID}_days_of_the_year
select
dt_set.theDate
from
(
-- last 0~99 months
select date_sub('2019-12-31', a.s + 10*b.s + 100*c.s) as theDate
from
(
select 0 as s union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9
) a
cross join
(
select 0 as s union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9
) b
cross join
(
select 0 as s union all select 1 union all select 2 union all select 3
) c
) dt_set
where dt_set.theDate between '2019-01-01' and '2019-12-31'
order by dt_set.theDate DESC;
And I also have a table that contains all of the weekend days and holidays (this data is loaded from a file, and the date format is YYYY-MM-DD)
CREATE TABLE ${TMPID}_company_holiday
(
`holidayDate` STRING
)
;
LOAD DATA LOCAL INPATH '${FILE}' INTO TABLE ${TMPID}_company_holiday;
My question is.... how do I join these tables together while creating the business day counter column shown as in the sample data above?
You can use row_number() for the enumeration. This is a little tricky, because it needs to be conditional, but the information you need is provided by a left join:
select dy.*,
(case when ch.holiday_date is null
then row_number() over (partition by trunc(dy.date, 'MONTH'), ch.holiday_date
order by dy.date
)
else 0
end) as business_day
from days_of_the_year dy left join
company_holiday ch
on dy.date = ch.holiday_date;

select maximum score grouped by date, display full datetime

Gday, I have a table that shows a series of scores and datetimes those scores occurred.
I'd like to select the maximum of these scores for each day, but display the datetime that the score occurred.
I am using an Oracle database (10g) and the table is structured like so:
scoredatetime score (integer)
---------------------------------------
01-jan-09 00:10:00 10
01-jan-09 01:00:00 11
01-jan-09 04:00:01 9
...
I'd like to be able to present the results such the above becomes:
01-jan-09 01:00:00 11
This following query gets me halfway there.. but not all the way.
select
trunc(t.scoredatetime), max(t.score)
from
mytable t
group by
trunc(t.scoredatetime)
I cannot join on score only because the same high score may have been achieved multiple times throughout the day.
I appreciate your help!
Simon Edwards
with mytableRanked(d,scoredatetime,score,rk) as (
select
scoredatetime,
score,
row_number() over (
partition by trunc(scoredatetime)
order by score desc, scoredatetime desc
)
from mytable
)
select
scoredatetime,
score
from mytableRanked
where rk = 1
order by date desc
In the case of multiple high scores within a day, this returns the row corresponding to the one that occurred latest in the day. If you want to see all highest scores in a day, remove scoredatetime desc from the order by specification in the row_number window.
Alternatively, you can do this (it will list ties of high score for a date):
select
scoredatetime,
score
from mytable
where not exists (
select *
from mytable as M2
where trunc(M2.scoredatetime) = trunc(mytable.scoredatetime)
and M2.score > mytable.scoredatetime
)
order by scoredatetime desc
First of all, you did not yet specify what should happen if two or more rows within the same day contain an equal high score.
Two possible answers to that question:
1) Just select one of the scoredatetime's, it doesn't matter which one
In this case don't use self joins or analytics as you see in the other answers, because there is a special aggregate function that can do your job more efficient. An example:
SQL> create table mytable (scoredatetime,score)
2 as
3 select to_date('01-jan-2009 00:10:00','dd-mon-yyyy hh24:mi:ss'), 10 from dual union all
4 select to_date('01-jan-2009 01:00:00','dd-mon-yyyy hh24:mi:ss'), 11 from dual union all
5 select to_date('01-jan-2009 04:00:00','dd-mon-yyyy hh24:mi:ss'), 9 from dual union all
6 select to_date('02-jan-2009 00:10:00','dd-mon-yyyy hh24:mi:ss'), 1 from dual union all
7 select to_date('02-jan-2009 01:00:00','dd-mon-yyyy hh24:mi:ss'), 1 from dual union all
8 select to_date('02-jan-2009 04:00:00','dd-mon-yyyy hh24:mi:ss'), 0 from dual
9 /
Table created.
SQL> select max(scoredatetime) keep (dense_rank last order by score) scoredatetime
2 , max(score)
3 from mytable
4 group by trunc(scoredatetime,'dd')
5 /
SCOREDATETIME MAX(SCORE)
------------------- ----------
01-01-2009 01:00:00 11
02-01-2009 01:00:00 1
2 rows selected.
2) Select all records with the maximum score.
In this case you need analytics with a RANK or DENSE_RANK function. An example:
SQL> select scoredatetime
2 , score
3 from ( select scoredatetime
4 , score
5 , rank() over (partition by trunc(scoredatetime,'dd') order by score desc) rnk
6 from mytable
7 )
8 where rnk = 1
9 /
SCOREDATETIME SCORE
------------------- ----------
01-01-2009 01:00:00 11
02-01-2009 00:10:00 1
02-01-2009 01:00:00 1
3 rows selected.
Regards,
Rob.
You might need two SELECT statements to pull this off: the first to collect the truncated date and associated max score, and the second to pull in the actual datetime values associated with the score.
Try:
SELECT T.ScoreDateTime, T.Score
FROM
(
SELECT
TRUNC(T.ScoreDateTime) ScoreDate, MAX(T.score) BestScore
FROM
MyTable T
GROUP BY
TRUNC(T.ScoreDateTime)
) ByDate
INNER JOIN MyTable T
ON TRUNC(T.ScoreDateTime) = ByDate.ScoreDate and T.Score = ByDate.BestScore
ORDER BY T.ScoreDateTime DESC
This will pull in best score ties as well.
For a version which selects only the most recently-posted high score for each day:
SELECT T.ScoreDateTime, T.Score
FROM
(
SELECT
TRUNC(T.ScoreDateTime) ScoreDate,
MAX(T.score) BestScore,
MAX(T.ScoreDateTime) BestScoreTime
FROM
MyTable T
GROUP BY
TRUNC(T.ScoreDateTime)
) ByDate
INNER JOIN MyTable T
ON T.ScoreDateTime = ByDate.BestScoreTime and T.Score = ByDate.BestScore
ORDER BY T.ScoreDateTime DESC
This may produce multiple records per date if two different scores were posted at exactly the same time.