tackling the building of a complex query - sql

I have intermediate SQL skills, but this is the most complex query I've ever attempted.
My goal is to build a query that will show how many minutes of any given day, a set of 6 drives are in use or idle. Drives that are 'in use' are writing backups to tape aka running a job. A drive can handle only on job at a time. A drive may start and end a job on the same day, or start one day and end 2 days later, if it's a big job. The most important thing is that I be able to report the number of minutes EACH drive is either UP or IDLE (both are important) and also to only report the minutes it worked on the respective day, even if the job carried into the next.
So, complexity results from following
I can't just subtract start time from end time and SUM the elapsed time of all jobs run by a particular drives, because many jobs span midnight, and I must assign the minutes worked to the day in which they occurred. IE. I can't report that a drive performed 50 hours of work in a 24 hour period, just because the end time of the job was 2 days out.
the start time and end time columns are in UTC time, and must be converted to PST.
I need placeholders for minutes of the day when any one of the drive is idle, so that I can show up/idle time for each of the drives.
The tables I need to put together are just two:
a Time calendar table. It has a row for each minute of the day starting with 10-10-2009 through 10-07-2021.
a table containing the start and end times of all jobs that have completed, the names of the drives that ran them, and the names of the jobs.
Here's DDL for a calendar table containing a row for every minute of the day since 2009 through 2014.
WITH e1(min) AS(
SELECT * FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1))x(n)
),
e2(min) AS(
SELECT e1.min FROM e1, e1 x
),
e4(min) AS(
SELECT e2.min FROM e2, e2 x
),
e8(min) AS(
SELECT e4.min FROM e4, e4 x
),
cteTally(min) AS(
SELECT TOP 6307204 ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) - 1
FROM e8
),
Test(min) AS(
SELECT DATEADD( minute, min, DATEADD( YEAR, -2, GETDATE()))
FROM cteTally)
SELECT DATEADD( MINUTE, DATEDIFF( MINUTE, 0, DATEADD( YEAR, -2, GETDATE())), 0)
FROM Test
WHERE min <= DATEADD( YEAR, 10, GETDATE())
Here’s sample DDL for table containing the device/job/start & end times.
CREATE TABLE JobHistorySummary
(JobName nvarchar(255),
ActualStartTime datetime,
EndTime datetime,
DeviceName nvarchar(128))
INSERT INTO JobHistorySummary
VALUES
('FOAMTools E: Weekly - FULL', '2013-08-04 03:20:00.000', '2013-08-04 20:20:00.000', '1 Drv'),
('HRDuplex D: Weekly - FULL', '2013-08-04 18:26:00.000', '2013-08-05 13:00:00.000', '2 Drv'),
('HRDuplex D: Daily - INC', '2013-08-04 20:44:00.000', '2013-08-05 15:50:00.000', '1 Drv'),
('PayNROLL C: Weekly - FULL', '2013-08-04 00:00:00.000', '2013-08-06 15:40:00.000','3 Drv'),
('PayNROLL C: Daily - INC', '2013-08-05 06:30:00.000', '2013-08-05 06:50:00.000', '4 Drv'),
('SmallIBM F: Daily - FULL', '2013-08-04 00:30:00.000', '2013-08-04 06:30:00.000', '5 Drv'),
('BigIBM F: Daily - INC', '2013-08-06 12:30:00.000', '2013-08-06 12:50:00.000', '6 Drv');
The calculation need to get local time is [ActualStartTime]+ GETDATE() - GETUTCDATE())
Even though I just need two tables, I can't figure out the logic of joining them so that they create NULL placeholders for those datetimes where drives are idle. I would like to count up the rows with NULL values as the idle minutes per drive. Also, I can't figure out how to isolate minutes of usage to the day in which they occurred...meaning no more than 1440 minutes of work per day per drive, even for jobs spanning midnight. Minutes of the next day are allocated as minutes worked by respective drive to the following day.

The following shows how to generate the time in use per day per device. The idle time is just 24 minus that. Assuming you have generated a table with all the dates in the range of the table, lets call that cal with one field dt of type Date. (easy to do). The following gives the general approach.
select devicename, cal.dt, sum(time(least(actualendtime, cal.dt)- time(actualstarttime))
from JobHistorySummary jhs inner join cal
on (jhs.actualstarttime >= cal.dt and jhs.actualendtime < cal.dt)
group by devicename, cal.dt
Now you have use the same statement above using the converted times and also assuming cal is in the converted time zone.
select devicename, cal.dt, sum(time(least(convert_tz(actualendtime,'UTC','US/Pacific'), cal.dt)- time(convert_tz(actualstarttime,'UTC','US/Pacific')))
from JobHistorySummary jhs inner join cal
on (convert_tz(actualstarttime,'UTC','US/Pacific') >= cal.dt and convert_tz(actualendtime,'UTC','US/Pacific') < cal.dt)
group by devicename, cal.dt
But also the above isn't exacty right either because mysql does not do substract and aggregate summing correctly on time values. So you need to use something more like:
select devicename, cal.dt, econd_to_time(sum(time_to_second(timediff(time(least(actualendtime, cal.dt), time(actualstarttime)))))
from JobHistorySummary jhs inner join cal
on (jhs.actualstarttime >= cal.dt and jhs.actualendtime < cal.dt)
group by devicename, cal.dt

Related

Listing the hours between two timestamps and grouping by those hours

I am trying to ascertain a count of the couriers that are active every hour of a shift using the the start and end times of their shifts to create an array which I hope to group by. Firstly, when I run it I'm given epoch times back, secondly, I am not able to group by the hours array.
Does anyone have any solutions that they would kindly share with me?
**
SELECT
GENERATE_TIMESTAMP_ARRAY(CAST(fss.start_time_local AS TIMESTAMP), CAST(fss.end_time_local AS TIMESTAMP) , INTERVAL 1 hour) as hours,
#COUNT(sys_scheduled_shift_id) AS number_schedule_shift,
FROM just-data-warehouse.delco_analytics_team_dwh.fact_scheduled_shifts AS fss
#GROUP BY hours
**
For your reference the shift data for the courier is structured like so
To calculate how many couriers have been active at least one minute in every hour I would do it like this:
SELECT
CALENDAR.datetime
,SUM(workers.flag_worker) as n_workers
FROM (
-- CALENDAR
SELECT
cast(datetime as datetime) datetime
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2022-01-01T00:00:00', '2022-01-02T00:00:00'
,INTERVAL 1 hour)) AS datetime
) CALENDAR
-- TABLE of SHIFTS
LEFT JOIN (
SELECT * , 1 flag_worker FROM
UNNEST(
ARRAY<STRUCT<worker_id string , shift_start datetime, shift_end datetime>>[
('Worker_01', '2022-01-01T06:00:00','2022-01-01T14:00:00')
,('Worker_02', '2022-01-01T10:00:00','2022-01-01T18:00:00')
]
)
AS workers
)workers
ON CALENDAR.datetime < workers.shift_end
AND DATETIME_ADD(CALENDAR.datetime, INTERVAL 1 hour) > workers.shift_start
GROUP BY CALENDAR.datetime
The idea is to build a calendar of datetimes and then join it with a table of shifts.
Instead of hours, the calendar can be modified to have fractions of hours. Also, there may be a more elegant way to build the calendar.

Working days between two dates in Snowflake

Is there any ways to calculate working days between two dates in snowflake without creating calendar table, only using "datediff" function
After doing research work on snowflake datediff function, I have found the following conclusions.
DATEDIFF(DAY/WEEK, START_DATE, END_DATE) will calculate difference, but the last date will be considered as END_DATE -1.
DATEDIFF(WEEK, START_DATE, END_DATE) will count number of Sundays between two dates.
By summarizing these two points, I have implemented the logic below.
SELECT
( DATEDIFF(DAY, START_DATE, DATEADD(DAY, 1, END_DATE))
- DATEDIFF(WEEK, START_DATE, DATEADD(DAY, 1, END_DATE))*2
- (CASE WHEN DAYNAME(START_DATE) != 'Sun' THEN 1 ELSE 0 END)
+ (CASE WHEN DAYNAME(END_DATE) != 'Sat' THEN 1 ELSE 0 END)
) AS WORKING_DAYS
Here's an article with a calendar table solution that also includes a UDF to solve this in Snowflake (the business days are hard-coded, so that does require some maintenance, but you don't have to maintain a calendar table at least):
https://medium.com/dandy-engineering-blog/how-to-calculate-the-number-of-working-hours-between-two-timestamps-in-sql-b5696de66e51
The best way to count the number of Sundays between two dates is possibly as follows:
CREATE OR REPLACE FUNCTION SUNDAYS_BETWEEN(a DATE,b DATE)
RETURNS INTEGER
AS $$
FLOOR( (DAYOFWEEKISO(a) + DATEDIFF('days',a,b)) / 7 ,0)
$$
The above is better than using DATEDIFF(WEEK because the output of that function changes if the WEEK_START session parameter is altered away from the legacy default of 0
I have a way to calculate the number of business hours that elapse between a start time and end time but it only works if you make the following assumptions.
Asssume only 1 time zone for all timestamps
Any start or end times that occur outside of business hours should be rounded to nearest business hour time. (I.e. Assuming a schedule of 10:00am - 6:00 pm, timestamps occurring from midnight to 9:59am should be rounded to 10am, times after 6:00pm should be set to the next day at 10:00am)
Timestamps that occur on the weekends should be set to the opening time of the next business day. (In this case Monday at 10:00am)
My model does not account for any holidays.
If these 4 conditions are met then the following code should be enough for a rough estimate of business hours elapsed.
(DATEDIFF(seconds, start_time, end_time) --accounts for the pure number of seconds in between the two dates
- (DATEDIFF(DAY, start_time,end_time) * 16 * 60*60) --For every day between the two dates, we need to subtract out X number of hours. Where X is the number of hours not worked in a day. (i.e. for a standard 8 hour work day, set X =16. For a 10 hour day, set X = 14, etc.) We multiple by (60*60*16) to convert days into seconds.
- (DATEDIFF(WEEK, businness_hours_wait_time_start_at_est, businness_hours_first_touch_at_est)*(8*2*60*60)) --This accounts for the fact that weekends are not work days. Which is why we need to subtract an additional 8 hours for Saturday and Sunday.
)/(60*60*8) --We then divide by 60*60*8 to convert the business seconds into business days. We use 8 hours here instead of 24 hours since our "business day" is only 8 hours long.

Block average in SQL

I have to take average of delta time interval between rows in SQL Server that represent the time occurred between two consecutive operations. However, there are no operations during nights / holidays / weekends (e.g. between the last operation of Friday and the first one on Monday the delta time is more than 48h, but i don't want to consider it), so the average time is totally incorrect.
How to deal with this problem? Is there a way to drop these entries and compute the real average delta time, doing a sort of block (per-day?) average?
Thanks!
An example:
Time
00:00:37
00:00:32
00:00:25
...
00:01:22
00:00:54 ---- e.g. Night ---
09:34:12 <--- Exclude this from the average calculation ---
00:00:22
00:00:41
00:00:36
...
Desired output
Avg time: 41.13s
For the time difference, you can apply a where clause. For the rest, just date functions and arithmetic:
select convert(date, time),
avg( datediff(second, prev_time, time) * 1.0 ) as avg_seconds
from (select t.*,
lag(time) over (order by time) as prev_time
from t
) t
where time < dateadd(hour, 4, prev_time) -- or whatever the threshold is
group by convert(date, time);

Get count of matching time ranges for every minute of the day in Postgres

Problem
I have a table of records each containing id, in_datetime, and out_datetime. A record is considered "open" during the time between the in_datetime and out_datetime. I want to know how many time records were "open" for each minute of the day (regardless of date). For example, for the last 90 days I want to know how many records were "open" at 3:14 am, then 3:15 am, then 3:16 am, then... If no records were "open" at 2:00 am the query should return 0 or null instead of excluding the row, thus 1440 rows should always be returned (the number of minutes in a day). Datetimes are stored in UTC and need to be cast to a time zone.
Simplified example graphic
record_id | time_range
| 0123456789 (these are minutes past midnight)
1 | =========
2 | ===
3 | =======
4 | ===
5 | ==
______________________
result 3323343210
Desired output
time | count of open records at this time
00:00 120
00:01 135
00:02 132
...
23:57 57
23:58 62
23:59 60
No more than 1440 records would ever be returned as there are only 1440 minutes in the day.
What I've tried
1.) In a subquery, I currently generate a minutely series of times for the entire range of each time record. I then group those by time and get a count of the records per minute.
Here is a db-fiddle using my current query:
select
trs.minutes,
count(trs.minutes)
from (
select
generate_series(
DATE_TRUNC('minute', (time_records.in_datetime::timestamptz AT TIME ZONE 'America/Denver')),
DATE_TRUNC('minute', (time_records.out_datetime::timestamptz AT TIME ZONE 'America/Denver')),
interval '1 min'
)::time as minutes
from
time_records
) trs
group by
trs.minutes
This works but is quite inefficient and takes several seconds to run due to the size of my table. Additionally, it excludes times when no records were open. I think somehow I could use window functions to count the number of overlapping time records for each minute of the day, but I don't quite understand how to do that.
2.) Modifying Gordon Linoff's query in his answer below, I came to this (db-fiddle link):
with tr as (
select
date_trunc('minute', (tr.in_datetime::timestamptz AT TIME ZONE 'America/Denver'))::time as m,
1 as inc
from
time_records tr
union all
select
(date_trunc('minute', (tr.out_datetime::timestamptz AT TIME ZONE 'America/Denver')) + interval '1 minute')::time as m,
-1 as inc
from
time_records tr
union all
select
minutes::time,
0
from
generate_series(timestamp '2000-01-01 00:00', timestamp '2000-01-01 23:59', interval '1 min') as minutes
)
select
m,
sum(inc) as changes_at_inc,
sum(sum(inc)) over (order by m) as running_count
from
tr
where
m is not null
group by
m
order by
m;
This runs reasonably quickly, but towards the end of the day (about 22:00 onwards in the linked example) the values turn negative for some reason. Additionally, this query doesn't seem to work correctly with records with time ranges that cross over midnight. It's a step in the right direction, but I unfortunately don't understand it enough to improve on it further.
Here is a faster method. Generate "in" and "out" records for when something gets counted. Then aggregate and use a running sum.
To get all minutes, throw in a generate_series() for the time period in question:
with tr as (
select date_trunc('minute', (tr.in_datetime::timestamptz AT TIME ZONE 'America/Denver')) as m,
1 as inc
from time_records tr
union all
select date_trunc('minute', (tr.out_datetime::timestamptz AT TIME ZONE 'America/Denver')) + interval '1 minute' as m,
-1 as inc
from time_records tr
union all
select generate_series(date_trunc('minute',
min(tr.in_datetime::timestamptz AT TIME ZONE 'America/Denver')),
date_trunc('minute',
max(tr.out_datetime::timestamptz AT TIME ZONE 'America/Denver')),
interval '1 minute'
), 0
from time_records tr
)
select m,
sum(inc) as changes_at_inc,
sum(sum(inc)) over (order by m) as running_count
from tr
group by m
order by m;

Averaging event start time from DateTime column

I'm calculating average start times from events that run late at night and may not start until the next morning.
2018-01-09 00:01:38.000
2018-01-09 23:43:22.000
currently all I can produce is an average of 11:52:30.0000000
I would like the result to be ~ 23:52
the times averaged will not remain static as this event runs daily and I will have new data daily. I will likely take the most recent 10 records and average them.
Would be nice to have SQL you're running, but probably you just need to format properly your output, it should be something like this:
FORMAT(cast(<your column> as time), N'hh\:mm(24h)')
The following will both compute the average across the datetime field and also return the result as a 24hr time notation only.
SELECT CAST(CAST(AVG(CAST(<YourDateTimeField_Here> AS FLOAT)) AS DATETIME) AS TIME) [AvgTime] FROM <YourTableContaining_DateTime>
The following will calculate the average time of day, regardless of what day that is.
--SAMPLE DATA
create table #tmp_sec_dif
(
sample_date_time datetime
)
insert into #tmp_sec_dif
values ('2018-01-09 00:01:38.000')
, ('2018-01-09 23:43:22.000')
--ANSWER
declare #avg_sec_dif int
set #avg_sec_dif =
(select avg(a.sec_dif) as avg_sec_dif
from (
--put the value in terms of seconds away from 00:00:00
--where 23:59:00 would be -60 and 00:01:00 would be 60
select iif(
datepart(hh, sample_date_time) < 12 --is it morning?
, datediff(s, '00:00:00', cast(sample_date_time as time)) --if morning
, datediff(s, '00:00:00', cast(sample_date_time as time)) - 86400 --if evening
) as sec_dif
from #tmp_sec_dif
) as a
)
select cast(dateadd(s, #avg_sec_dif, '00:00:00') as time) as avg_time_of_day
The output would be an answer of 23:52:30.0000000
This code allows you to define a date division point. e.g. 18 identifies 6pm. The time calculation would then be based on seconds after 6pm.
-- Defines the hour of the day when a new day starts
DECLARE #DayDivision INT = 18
IF OBJECT_ID(N'tempdb..#StartTimes') IS NOT NULL DROP TABLE #StartTimes
CREATE TABLE #StartTimes(
start DATETIME NOT NULL
)
INSERT INTO #StartTimes
VALUES
('2018-01-09 00:01:38.000')
,('2018-01-09 23:43:22.000')
SELECT
-- 3. Add the number of seconds to a day starting at the
-- day division hour, then extract the time portion
CAST(DATEADD(SECOND,
-- 2. Average number of seconds
AVG(
-- 1. Get the number of seconds from the day division point (#DayDivision)
DATEDIFF(SECOND,
CASE WHEN DATEPART(HOUR,start) < #DayDivision THEN
SMALLDATETIMEFROMPARTS(YEAR(DATEADD(DAY,-1,start)),MONTH(DATEADD(DAY,-1,start)),DAY(DATEADD(DAY,-1,start)),#DayDivision,0)
ELSE
SMALLDATETIMEFROMPARTS(YEAR(start),MONTH(start),DAY(start),#DayDivision,0)
END
,start)
)
,'01 jan 1900 ' + CAST(#DayDivision AS VARCHAR(2)) + ':00') AS TIME) AS AverageStartTime
FROM #StartTimes