Hourly gaps in data for a month

Hourly gaps in data for a month - sql

Hi Fellow StackOverflowers
I'm really struggling with finding the right approach for this (in my mind) relatively simple Gap-finding problem.
I have a table with hourly datetimes (hourly log file imported to DB).
I need to find the missing hours during a period (lets say april).
So imagine having the following data in DB table [imported_logs]
[2018-04-02 10:00:000]
[2018-04-02 11:00:000]
[2018-04-02 12:00:000]
[2018-04-02 17:00:000]
i need the result for april gap analysis to be:
[ GAP-BEGIN ] [ GAB_END ]
[2018-04-01 00:00:000] [2018-04-02 10:00:000] <-- problem
[2018-04-02 13:00:000] [2018-04-02 17:00:000] <-- can be found using below code
[2018-04-02 18:00:000] [2018-05-01 00:00:000] <-- problem
My problem is mainly finding the start and end ranges, but the following code is helping finding the cap in between available data.
WITH t AS (
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY zone ORDER BY hourImported)
FROM logsImportedTable
Where hourImported > '2018-04-01' and hourImported < '2018-05-01' and zone = 1
)
SELECT t1.zone, t1.hourImported as GapStart, t2.hourImported as GapEnd
FROM t t1
INNER JOIN t t2 ON t2.zone = t1.zone AND t2.rn = t1.rn + 1
WHERE DATEDIFF(MINUTE, t1.hourImported, t2.hourImported) > 60
which gives me only the result:
[zone] [gap_start ] [gap_end ]
[1 ] [2018-04-02 13:00:00.000] [2018-04-02 17:00:00.000]
So basically if no logs has been imported during april at all the current implemetation would show no missing data at all (kinda wrong)
I'm thinking that i need to somehow add some new datapoints just before beginning and end of the april period to somehow get the query to catch the start and end of the month as missing data?
What would you bright guys/girls do?
/Kind regards

For this situation, just add the initial and end values, if appropriate:
<your query here>
union all
select '2018-04-01 00:00:000', min(lit.hourImported)
from logsImportedTable lit
where lit.hourImported >= '2018-04-01 00:00:00'
having min(lit.hourImported) > '2018-04-01 00:00:00'
union all
select '2018-05-01 00:00:000', max(lit.hourImported)
from logsImportedTable lit
where lit.hourImported <= '2018-05-01 00:00:00'
having max(lit.hourImported) > '2018-05-01 00:00:00';

Ok thx to the great help from #Gordon this is my final solution to the problem. It will give the gaps even if entire month of data is missing and all small gaps within.
DECLARE #zone INT = 1, #currentPeriodStart DATETIME = '2018-01-01',
#currentPeriodEnd DATETIME = '2018-02-01';
WITH t AS (
SELECT *, rn = ROW_NUMBER() over (PARTITION BY zone_id ORDER BY
time_of_file_present)
FROM test
Where time_of_file_present > #currentPeriodStart and time_of_file_present <
#currentPeriodEnd and zone_id = #zone
)
SELECT t1.zone_id, t1.time_of_file_present as gap_start,
t2.time_of_file_present as gap_end
FROM t t1
INNER JOIN t t2 ON t2.zone_id = t1.zone_id AND t2.rn = t1.rn + 1
WHERE DATEDIFF(MINUTE, t1.time_of_file_present, t2.time_of_file_present) >60
union all
select #zone, #currentPeriodStart, min(lit.time_of_file_present)
from test lit
where lit.time_of_file_present >= #currentPeriodStart
having min(lit.time_of_file_present) > #currentPeriodStart and
min(lit.time_of_file_present) < #currentPeriodEnd
union all
select #zone,max(lit.time_of_file_present), #currentPeriodEnd
from test lit
where lit.time_of_file_present <= #currentPeriodEnd
having max(lit.time_of_file_present) < #currentPeriodEnd and
max(lit.time_of_file_present) > #currentPeriodStart
union all
select #zone,#currentPeriodStart, #currentPeriodEnd
from test lit
having max(lit.time_of_file_present) < #currentPeriodStart or
max(lit.time_of_file_present) > #currentPeriodEnd
order by gap_start

Related

Proportional distribution of a given value between two dates in SQL Server

There's a table with three columns: start date, end date and task duration in hours. For example, something like that:
Id
StartDate
EndDate
Duration
1
07-11-2022
15-11-2022
40
2
02-09-2022
02-11-2022
122
3
10-10-2022
05-11-2022
52
And I want to get a table like that:
Id
Month
HoursPerMonth
1
11
40
2
09
56
2
10
62
2
11
4
3
10
42
3
11
10
Briefly, I wanted to know, how many working hours is in each month between start and end dates. Proportionally. How can I achieve that by MS SQL Query? Data is quite big so the query speed is important enough. Thanks in advance!
I've tried DATEDIFF and EOMONTH, but that solution doesn't work with tasks > 2 months. And I'm sure that this solution is bad decision. I hope, that it can be done more elegant way.

Here is an option using an ad-hoc tally/calendar table
Not sure I'm agree with your desired results
Select ID
,Month = month(D)
,HoursPerMonth = (sum(1.0) / (1+max(datediff(DAY,StartDate,EndDate)))) * max(Duration)
From YourTable A
Join (
Select Top 75000 D=dateadd(day,Row_Number() Over (Order By (Select NULL)),0)
From master..spt_values n1, master..spt_values n2
) B on D between StartDate and EndDate
Group By ID,month(D)
Order by ID,Month
Results

This answer uses CTE recursion.
This part just sets up a temp table with the OP's example data.
DECLARE #source
TABLE (
SOURCE_ID INT
,STARTDATE DATE
,ENDDATE DATE
,DURATION INT
)
;
INSERT
INTO
#source
VALUES
(1, '20221107', '20221115', 40 )
,(2, '20220902', '20221102', 122 )
,(3, '20221010', '20221105', 52 )
;
This part is the query based on the above data. The recursive CTE breaks the time period into months. The second CTE does the math. The final selection does some more math and presents the results the way you want to seem them.
WITH CTE AS (
SELECT
SRC.SOURCE_ID
,SRC.STARTDATE
,SRC.ENDDATE
,SRC.STARTDATE AS 'INTERIM_START_DATE'
,CASE WHEN EOMONTH(SRC.STARTDATE) < SRC.ENDDATE
THEN EOMONTH(SRC.STARTDATE)
ELSE SRC.ENDDATE
END AS 'INTERIM_END_DATE'
,SRC.DURATION
FROM
#source SRC
UNION ALL
SELECT
CTE.SOURCE_ID
,CTE.STARTDATE
,CTE.ENDDATE
,CASE WHEN EOMONTH(CTE.INTERIM_START_DATE) < CTE.ENDDATE
THEN DATEADD( DAY, 1, EOMONTH(CTE.INTERIM_START_DATE) )
ELSE CTE.STARTDATE
END
,CASE WHEN EOMONTH(CTE.INTERIM_START_DATE, 1) < CTE.ENDDATE
THEN EOMONTH(CTE.INTERIM_START_DATE, 1)
ELSE CTE.ENDDATE
END
,CTE.DURATION
FROM
CTE
WHERE
CTE.INTERIM_END_DATE < CTE.ENDDATE
)
, CTE2 AS (
SELECT
CTE.SOURCE_ID
,CTE.STARTDATE
,CTE.ENDDATE
,CTE.INTERIM_START_DATE
,CTE.INTERIM_END_DATE
,CAST( DATEDIFF( DAY, CTE.INTERIM_START_DATE, CTE.INTERIM_END_DATE ) + 1 AS FLOAT ) AS 'MNTH_DAYS'
,CAST( DATEDIFF( DAY, CTE.STARTDATE, CTE.ENDDATE ) + 1 AS FLOAT ) AS 'TTL_DAYS'
,CAST( CTE.DURATION AS FLOAT ) AS 'DURATION'
FROM
CTE
)
SELECT
CTE2.SOURCE_ID AS 'Id'
,MONTH( CTE2.INTERIM_START_DATE ) AS 'Month'
,ROUND( CTE2.MNTH_DAYS/CTE2.TTL_DAYS * CTE2.DURATION, 0 ) AS 'HoursPerMonth'
FROM
CTE2
ORDER BY
CTE2.SOURCE_ID
,CTE2.INTERIM_END_DATE
;
My results agree with Mr. Cappelletti's, not the OP's. Perhaps some tweaking regarding the definition of a "Day" is needed. I don't know.
If time between start and end date is large (more than 100 months) you may want to specify OPTION (MAXRECURSION 0) at the end.

SQL Rowwise comparison between groups

Question
The following is a snippet of my data:
Create Table Emps(person VARCHAR(50), started DATE, stopped DATE);
Insert Into Emps Values
('p1','2015-10-10','2016-10-10'),
('p1','2016-10-11','2017-10-11'),
('p1','2017-10-12','2018-10-13'),
('p2','2019-11-13','2019-11-13'),
('p2','2019-11-14','2020-10-14'),
('p3','2020-07-15','2021-08-15'),
('p3','2021-08-16','2022-08-16');
db<>fiddle.
I want to use T-SQL to get a count of how many persons fulfil the following criteria at least once - multiples should also count as one:
For a person:
One of the dates in 'started' (say s1) is larger than at least one of the dates in 'ended' (say e1)
s1 and e1 are in the same year, to be set manually - e.g. '2021-01-01' until '2022-01-01'
Example expected response
If I put the date range '2016-01-01' until '2017-01-01' somewhere in a WHERE / HAVING clause, the output should be 1 as only p1 has both a start date and an end date that fall in 2016 where the start date is larger than the end date:
s1 = '2016-10-11', and e1 = '2016-10-10'.
Why can't I do this myself
The reason I'm stuck is that I don't know how to do this rowwise comparison between groups. The question requires comparing values across columns (start with end) across rows, within a person ID.

Use conditional aggregation to get the maximum start date and the minimum stop date in the given range.
select person
from emps
group by person
having max(case when started >= '2016-01-01' and started < '2017-01-01'
then started end) >
min(case when stopped >= '2016-01-01' and stopped < '2017-01-01'
then stopped end);
Demo: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=45adb153fcac9ce72708f1283cac7833

I would choose to use a self-outer-join with an exists correlation, it should be pretty much the most performant, all things being equal.
select Count(*)
from emps e
where exists (
select * from emps e2
where e2.person = e.person
and e2.stopped > e.started
and e.started between '20160101' and '20170101'
and e2.started between '20160101' and '20170101'
);

You said you plan to set the dates manually, so this works where we set the start date in one CTE, and the end date in another CTE. Then we calculate the min/max for each, and use that criteria in the query where statement.
with min_max_start as (
select person,
min(started) as min_start, --obsolete
max(started) as max_start
from emps
where started >= '2016-01-01'
group by person
),
min_max_end as (
select person,
min(stopped) as min_stop,
max(stopped) as max_stop --obsolete
from emps
where stopped < '2017-01-01'
group by person
)
select count(distinct e.person)
from emps e
join min_max_start mms
on e.person = mms.person
join min_max_end mme
on e.person = mme.person
where mms.max_start> mme.min_stop
Output: 1

Try the following:
With CTE as
(
Select D.person, D.started, T.stopped,
case
when Year(D.started) = Year(T.stopped) and D.started > T.stopped
then 1
else 0
end as chk
From
(Select person, started From Emps Where started >= '2016-01-01') D
Join
(Select person, stopped From Emps Where stopped <= '2017-01-01') T
On D.person = T.person
)
Select Count(Distinct person) as CNT
From CTE
Where chk = 1;
To get the employee list who met the criteria use the following on the CTE instead of the above Select Count... query:
Select person, started, stopped
From CTE
Where chk = 1;
See a demo from db<>fiddle.

Between date condition is not working in SQL Server

I have a data as below,
Trying to run below query but returning 0 rows,
Below query should return highlighted row data as shown above.
Can anybody please explain me, what i'm missing?
select * from Flt_OperativeFlight_SchedulePeriods
where
(
(cast('2018-04-05' as date) between cast(ScheduleStartDate as date) and cast(ScheduleEndDate as date) )
or
(cast('2018-04-11' as date) between cast(ScheduleStartDate as date) and cast(ScheduleEndDate as date) )
)
and CarrierCode='SQ' and FlightNumber='0004'

You could re-write as:
select *
from Flt_OperativeFlight_SchedulePeriods
where CarrierCode='SQ' and FlightNumber='0004' and
(ScheduleStartDate >= '2018-04-05' and ScheduleEndDate <= '2018-04-11')

It happens because
'2018-04-05' < '2018-04-06'
and
'2018-04-11' > '2018-04-10'
As variant maybe it is what you want
select *
from Flt_OperativeFlight_SchedulePeriods
where CarrierCode='SQ' and FlightNumber='0004' and
(
(ScheduleStartDate between '20180405' and '20180411')
or (ScheduleEndDate between '20180405' and '20180411')
)

You can try this
SELECT *
FROM `Flt_OperativeFlight_SchedulePeriods`
WHERE ScheduleStartDate >= '2018-04-05' AND ScheduleEndDate <= '2018-04-11'
AND CarrierCode='SQ' and FlightNumber='0004'

Seems like you want to get overlapping periods, then you need this logic:
start_1 <= end_2 and end_1 >= start_2
For your query:
where
(
cast('2018-04-05' as date) <= cast(ScheduleEndDate as date)
and
cast('2018-04-11' as date) >= cast(ScheduleStartDate as date)
)
Depending on your logic you might have to < or >

CONFLICT WITH "00:00:00" as END_TIME

I have this query:
SELECT `s`.`time` , SUM( s.love ) AS total_love, SUM( s.sad ) AS total_sad, SUM( s.angry ) AS total_angry, SUM( s.happy ) AS total_happy
FROM (`employee_workshift` AS e)
JOIN `workshift` AS w ON `e`.`workshift_uuid` = `w`.`uuid`
JOIN `shift_summary` AS s ON `w`.`uuid` = `s`.`workshift_uuid`
WHERE `s`.`location_uuid` = '81956feb-3fd7-0e84-e9fe-b640434dfad0'
AND `e`.`employee_uuid` = '3866a979-bc5e-56cb-cede-863afc47b8b5'
AND `s`.`workshift_uuid` = '8c9dbd85-18a3-6ca9-e3f3-06eb602b6f38'
AND `s`.`time` >= CAST( '18:00:00' AS TIME )
AND `s`.`time` <= CAST( '00:00:00' AS TIME )
AND `s`.`date` LIKE '%2014-03%'
My problem is it returns "NULL" but when I changed my 'end_time' to "23:59:59", it returned the right data. I've got an idea to pull the hour of both 'start_time' and 'end_time' and then insert it in a loop to get everything between them.
$time_start = 15;
$time_end = 03;
So it should produce: 15,16,17,18,19,20,21,22,23,00,01,02,03
Then I'll compare them all. But this would take a lot of line and effort than just simply using "BETWEEN". Or should I just use "in_array"? Have you encountered this? I hope someone could help. Thanks.

19:00 is certainly bigger then 00:00 - so your approach should not work.
Try using full timestamp (including date) to get all data you need.

Try to use this query. I don't know your data structure so check INNER JOIN between s and s1 tables. The join must be one row to one row - the difference only in date. Date of s1 rows must be earlier on 1 day than s table rows.
SELECT s.time , SUM( s.love ) AS total_love, SUM( s.sad ) AS total_sad, SUM( s.angry ) AS total_angry, SUM( s.happy ) AS total_happy
FROM (employee_workshift AS e)
JOIN workshift AS w ON e.workshift_uuid = w.uuid
JOIN shift_summary AS s ON w.uuid = s.workshift_uuid
JOIN shift_summary AS s1 ON (w.uuid = s.workshift_uuid AND CAST(s.date as DATE)=CAST(s1.date as DATE)+1)
WHERE s.location_uuid = '81956feb-3fd7-0e84-e9fe-b640434dfad0'
AND e.employee_uuid = '3866a979-bc5e-56cb-cede-863afc47b8b5'
AND s.workshift_uuid = '8c9dbd85-18a3-6ca9-e3f3-06eb602b6f38'
AND s1.time >= CAST( '18:00:00' AS TIME )
AND s.time <= CAST( '00:00:00' AS TIME )
AND s.date LIKE '%2014-03%'

SQL Table-value function optimisation / improvement

We've got a query that is taking a very long time to complete with a large dataset. I think I've tracked it down to a table-value function in the SQL server.
The query is designed to return the difference in printing usage between two dates. So if a printer had usage of 100 at date x and 200 at date y a row needs to be returned which reflects that it has had a usage change of 100.
These readings are taken periodically (but not every day) and stored in a table called MeterReadings. The code for the table-value function is below. This is then called from another SQL query which joins the returned table on a devices table with an inner join to get extra device information.
Any advise as to how to optimise the below would be appreciated.
ALTER FUNCTION [dbo].[DeviceUsage]
-- Add the parameters for the stored procedure here
( #StartDate DateTime , #EndDate DateTime )
RETURNS table
AS
RETURN
(
SELECT MAX(dbo.MeterReadings.ScanDateTime) AS MX,
MAX(dbo.MeterReadings.DeviceTotal - reading.DeviceTotal) AS TotalDiff,
MAX(dbo.MeterReadings.TotalCopy - reading.TotalCopy) AS CopyDiff,
MAX(dbo.MeterReadings.TotalPrint - reading.TotalPrint) AS PrintDiff,
MAX(dbo.MeterReadings.TotalScan - reading.TotalScan) AS ScanDiff,
MAX(dbo.MeterReadings.TotalFax - reading.TotalFax) AS FaxDiff,
MAX(dbo.MeterReadings.TotalMono - reading.TotalMono) AS MonoDiff,
MAX(dbo.MeterReadings.TotalColour - reading.TotalColour) AS ColourDiff,
MIN(reading.ScanDateTime) AS MN, dbo.MeterReadings.DeviceID
FROM dbo.MeterReadings INNER JOIN (SELECT * FROM dbo.MeterReadings WHERE
(dbo.MeterReadings.ScanDateTime > #StartDate) AND
(dbo.MeterReadings.ScanDateTime < #EndDate) )
AS reading ON dbo.MeterReadings.DeviceID = reading.DeviceID
WHERE (dbo.MeterReadings.ScanDateTime > #StartDate) AND (dbo.MeterReadings.ScanDateTime < #EndDate)
GROUP BY dbo.MeterReadings.DeviceID);

On the assumption that a value can only ever increase over time, it can certainly be simplified.
SELECT
DeviceID,
MIN(ScanDateTime) AS MN,
MAX(ScanDateTime) AS MX,
MAX(DeviceTotal ) - MIN(DeviceTotal) AS TotalDiff,
MAX(TotalCopy ) - MIN(TotalCopy ) AS CopyDiff,
MAX(TotalPrint ) - MIN(TotalPrint ) AS PrintDiff,
MAX(TotalScan ) - MIN(TotalScan ) AS ScanDiff,
MAX(TotalFax ) - MIN(TotalFax ) AS FaxDiff,
MAX(TotalMono ) - MIN(TotalMono ) AS MonoDiff,
MAX(TotalColour ) - MIN(TotalColour) AS ColourDiff
FROM
dbo.MeterReadings
WHERE
ScanDateTime > #StartDate
AND ScanDateTime < #EndDate
GROUP BY
DeviceID
This assumes that if you have reading on dates 1, 3, 5, 7, 9 and you want to report on 2 -> 8 then you want reading 7 - reading 3. I would have thought you wanted reading 7 - reading 1?
The above query should be fine for relatively small ranges. If you have Huge ranges of time, the MAX() - MIN() will be operating on large numbers of rows. This can then possibly be improved even further with the following (with correlated sub-queries to lookup just the two rows that you want).
As a side benefit, this also works even if the values can go down as well as up.
(I assume the existance of a Device table for a simpler query and faster performance.)
SELECT
Device.DeviceID,
start.ScanDateTime AS MN,
finish.ScanDateTime AS MX,
finish.DeviceTotal - start.DeviceTotal AS TotalDiff,
finish.TotalCopy - start.TotalCopy AS CopyDiff,
finish.TotalPrint - start.TotalPrint AS PrintDiff,
finish.TotalScan - start.TotalScan AS ScanDiff,
finish.TotalFax - start.TotalFax AS FaxDiff,
finish.TotalMono - start.TotalMono AS MonoDiff,
finish.TotalColour - start.TotalColour AS ColourDiff
FROM
dbo.Device AS device
INNER JOIN
dbo.MeterReadings AS start
ON start.DeviceID = device.DeviceID
AND start.ScanDateTime = (SELECT MIN(ScanDateTime)
FROM dbo.MeterReadings
WHERE DeviceID = device.DeviceID
AND ScanDateTime > #startDate
AND ScanDateTime < #endDate)
INNER JOIN
dbo.MeterReadings AS finish
ON finish.DeviceID = device.DeviceID
AND finish.ScanDateTime = (SELECT MAX(ScanDateTime)
FROM dbo.MeterReadings
WHERE DeviceID = device.DeviceID
AND ScanDateTime > #startDate
AND ScanDateTime < #endDate)
This can also be modified to pick up the start as being the first date on or before #startDate, if required.
EDIT: Modification to pick the start reading as being for the first date on or before #startDate.
SELECT
Device.DeviceID,
start.ScanDateTime AS MN,
finish.ScanDateTime AS MX,
COALESCE(finish.DeviceTotal, 0) - COALESCE(start.DeviceTotal, 0) AS TotalDiff,
COALESCE(finish.TotalCopy , 0) - COALESCE(start.TotalCopy , 0) AS CopyDiff,
COALESCE(finish.TotalPrint , 0) - COALESCE(start.TotalPrint , 0) AS PrintDiff,
COALESCE(finish.TotalScan , 0) - COALESCE(start.TotalScan , 0) AS ScanDiff,
COALESCE(finish.TotalFax , 0) - COALESCE(start.TotalFax , 0) AS FaxDiff,
COALESCE(finish.TotalMono , 0) - COALESCE(start.TotalMono , 0) AS MonoDiff,
COALESCE(finish.TotalColour, 0) - COALESCE(start.TotalColour, 0) AS ColourDiff
FROM
dbo.Device AS device
LEFT JOIN
dbo.MeterReadings AS start
ON start.DeviceID = device.DeviceID
AND start.ScanDateTime = (SELECT MAX(ScanDateTime)
FROM dbo.MeterReadings
WHERE DeviceID = device.DeviceID
AND ScanDateTime < #startDate)
LEFT JOIN
dbo.MeterReadings AS finish
ON finish.DeviceID = device.DeviceID
AND finish.ScanDateTime = (SELECT MAX(ScanDateTime)
FROM dbo.MeterReadings
WHERE DeviceID = device.DeviceID
AND ScanDateTime < #endDate)

Your query seems to compute a cross-product of all readings in a time range for each particular device. This works semantically because the MIN and MAX aggregates don't care about duplicates. But this is very slow. If you are comparing 100 dates with themselves you need to process 10,000 rows.
I suggest you calculate the MIN and MAX values for each metric/column over the entire time interval and then subtract them. That way you don't need to join and you need a single pass ofer the data. Like this:
select Diff = MAX(col) - MIN(col)
from readings
group by DeviceID

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hourly gaps in data for a month - sql

Related

Proportional distribution of a given value between two dates in SQL Server

SQL Rowwise comparison between groups

Between date condition is not working in SQL Server

CONFLICT WITH "00:00:00" as END_TIME

SQL Table-value function optimisation / improvement

Categories

Resources