Related
In my database I have a Reservation table and it has three columns Initial Day, Last Day and the House Id.
I want to count the total days and omit those who are repeated, for example:
+-------------+------------+------------+
| | Results | |
+-------------+------------+------------+
| House Id | InitialDay | LastDay |
+-------------+------------+------------+
| 1 | 2017-09-18 | 2017-09-20 |
| 1 | 2017-09-18 | 2017-09-22 |
| 19 | 2017-09-18 | 2017-09-22 |
| 20 | 2017-09-18 | 2017-09-22 |
+-------------+------------+------------+
If you noticed the House Id with the number 1 has two rows, and each row has dates but the first row is in the interval of dates of the second row. In total the number of days should be 5 because the first shouldn't be counted as those days already exist in the second.
The reason why this is happening is that each house has two rooms, and different persons can stay in that house on the same dates.
My question is: how can I omit those cases, and only count the real days the house was occupied?
In your are using SQL Server 2012 or higher you can use LAG() to get the previous final date and adjust the initial date:
with ReservationAdjusted as (
select *,
lag(LastDay) over(partition by HouseID order by InitialDay, LastDay) as PreviousLast
from Reservation
)
select HouseId,
sum(case when PreviousLast>LastDay then 0 -- fully contained in the previous reservation
when PreviousLast>=InitialDay then datediff(day,PreviousLast,LastDay) -- overlap
else datediff(day,InitialDay,LastDay)+1 -- no overlap
end) as Days
from ReservationAdjusted
group by HouseId
The cases are:
The reservation is fully included in the previous reservation: we only need to compare end dates because the previous row is obtained ordering by InitialDay, LastDay, so the previous start date is always minor or equal than the current start date.
The current reservation overlaps with the previous: in this case we adjust the start and don't add 1 (the initial day is already counted), this case include when the previous end is equal to the current start (is a one day overlap).
There is no overlap: we just calculate the difference and add 1 to count also the initial day.
Note that we don't need extra condition for the reservation of a HouseID because by default the LAG() function returns NULL when there isn't a previous row, and comparisons with null always are false.
Sample input and output:
| HouseId | InitialDay | LastDay |
|---------|------------|------------|
| 1 | 2017-09-18 | 2017-09-20 |
| 1 | 2017-09-18 | 2017-09-22 |
| 1 | 2017-09-21 | 2017-09-22 |
| 19 | 2017-09-18 | 2017-09-27 |
| 19 | 2017-09-24 | 2017-09-26 |
| 19 | 2017-09-29 | 2017-09-30 |
| 20 | 2017-09-19 | 2017-09-22 |
| 20 | 2017-09-22 | 2017-09-26 |
| 20 | 2017-09-24 | 2017-09-27 |
| HouseId | Days |
|---------|------|
| 1 | 5 |
| 19 | 12 |
| 20 | 9 |
select house_id,min(initialDay),max(LastDay)
group by houseId
If I understood correctly!
Try out and let me know how it works out for you.
Ted.
While thinking through your question I came across the wonder that is the idea of a Calendar table. You'd use this code to create one, with whatever range of dates your want for your calendar. Code is from http://blog.jontav.com/post/9380766884/calendar-tables-are-incredibly-useful-in-sql
declare #start_dt as date = '1/1/2010';
declare #end_dt as date = '1/1/2020';
declare #dates as table (
date_id date primary key,
date_year smallint,
date_month tinyint,
date_day tinyint,
weekday_id tinyint,
weekday_nm varchar(10),
month_nm varchar(10),
day_of_year smallint,
quarter_id tinyint,
first_day_of_month date,
last_day_of_month date,
start_dts datetime,
end_dts datetime
)
while #start_dt < #end_dt
begin
insert into #dates(
date_id, date_year, date_month, date_day,
weekday_id, weekday_nm, month_nm, day_of_year, quarter_id,
first_day_of_month, last_day_of_month,
start_dts, end_dts
)
values(
#start_dt, year(#start_dt), month(#start_dt), day(#start_dt),
datepart(weekday, #start_dt), datename(weekday, #start_dt), datename(month, #start_dt), datepart(dayofyear, #start_dt), datepart(quarter, #start_dt),
dateadd(day,-(day(#start_dt)-1),#start_dt), dateadd(day,-(day(dateadd(month,1,#start_dt))),dateadd(month,1,#start_dt)),
cast(#start_dt as datetime), dateadd(second,-1,cast(dateadd(day, 1, #start_dt) as datetime))
)
set #start_dt = dateadd(day, 1, #start_dt)
end
select *
into Calendar
from #dates
Once you have a calendar table your query is as simple as:
select distinct t.House_id, c.date_id
from Reservation as r
inner join Calendar as c
on
c.date_id >= r.InitialDay
and c.date_id <= r.LastDay
Which gives you a row for each unique day each room was occupied. If you need a sum of how many days each room was occupied it becomes:
select a.House_id, count(a.House_id) as Days_occupied
from
(select distinct t.House_id, c.date_id
from so_test as t
inner join Calendar as c
on
c.date_id >= t.InitialDay
and c.date_id <= t.LastDay) as a
group by a.House_id
Create a table of all the possible dates and then join it to the Reservations table so that you have a list of all days between InitialDay and LastDay. Like this:
DECLARE #i date
DECLARE #last date
CREATE TABLE #temp (Date date)
SELECT #i = MIN(Date) FROM Reservations
SELECT #last = MAX(Date) FROM Reservations
WHILE #i <= #last
BEGIN
INSERT INTO #temp VALUES(#i)
SET #i = DATEADD(day, 1, #i)
END
SELECT HouseID, COUNT(*) FROM
(
SELECT DISTINCT HouseID, Date FROM Reservation
LEFT JOIN #temp
ON Reservation.InitialDay <= #temp.Date
AND Reservation.LastDay >= #temp.Date
) AS a
GROUP BY HouseID
DROP TABLE #temp
I'm currently working on some reports from MS Project Server and found this oddity:
For some obscure reason, whenever you appoint to the same task with the same amount of time in consecutive days, instead of creating an entry for each appointment, the application updates the start date and the finish date fields on database, leaving only one entry for that task, but with a range between the dates.
If the amount of time appointed to the task in consecutive days are different, then there will be created one entry per appointment.
(Yes, I know, it's kind of confusing. I don't even know how to explain this better).
I want to know if it is somehow possible to generate more rows within SQL statement whenever there is a difference between the start and the finish date, one for each day in the range.
This is the query I have right now, I already can tell which rows have this date difference, but I don't know what I can do next.
select
r.WRES_ID, r.RES_NAME, PROJ_NAME, p.WPROJ_ID, TASK_NAME, WWORK_VALUE, WWORK_START, WWORK_FINISH,
datediff(d, WWORK_START, WWORK_FINISH) + 1 AS work_days
from MSP_WEB_RESOURCES r
join
MSP_WEB_ASSIGNMENTS a on a.WRES_ID = r.WRES_ID
join
MSP_WEB_PROJECTS p on p.WPROJ_ID = a.WPROJ_ID
join
MSP_WEB_WORK w on w.WASSN_ID = a.WASSN_ID
where RES_NAME = 'HenriqueBarcelos'
and WWORK_TYPE = 1
and WWORK_VALUE > 0
and WWORK_FINISH between '2014-01-27' and '2014-01-31'
order by WWORK_FINISH DESC
I know I could do this at the application level, but I was wondering if I could just do it within the database itself.
Thank's in advance.
Edit:
These are my current results:
WRES_ID | RES_NAME | TASK_NAME | WWORK_VALUE | WWORK_START | WWORK_FINISH | work_days
--------+------------------+-------------------------+---------------+---------------------+---------------------+----------
382 | HenriqueBarcelos | Outsourcing Initiatives | 60000.000000 | 2014-01-30 00:00:00 | 2014-01-30 00:00:00 | 1
382 | HenriqueBarcelos | Internal Training | 289800.000000 | 2014-01-29 00:00:00 | 2014-01-29 00:00:00 | 1
382 | HenriqueBarcelos | Outsourcing Initiatives | 120000.000000 | 2014-01-29 00:00:00 | 2014-01-29 00:00:00 | 1
382 | HenriqueBarcelos | Outsourcing Initiatives | 60000.000000 | 2014-01-27 00:00:00 | 2014-01-28 00:00:00 | 2
382 | HenriqueBarcelos | Infrastructure (TI) | 120000.000000 | 2014-01-27 00:00:00 | 2014-01-27 00:00:00 | 1
Notice that the second last register has a range of 2 days. In deed, there are 2 appointments, one on Jan 27th and other on 28th.
What I want to do is expand this and return one entry per day in this case.
It can be done, but it's not very elegant. First you need a function that will expand the date range into sequence of dates:
CREATE FUNCTION ufn_Expand(#start DATE, #end DATE)
RETURNS TABLE
AS
RETURN
WITH cte AS
(
SELECT #start AS dt
UNION ALL
SELECT DATEADD(dd, 1, dt) FROM cte WHERE dt < #end
)
SELECT dt FROM cte
Then use that in your query with CROSS APPLY:
SELECT /* your columns */, x.dt
FROM /* your joins */
CROSS APPLY ufn_Expand(WWORK_START, WWORK_FINISH) x
I'd use a numbers table (nice and set-based, yum!)
SELECT start_date
, end_date
, DateDiff(dd, start_date, end_date) + 1 As number_of_days --rows to display
FROM your_table
INNER
JOIN dbo.numbers
ON numbers.number BETWEEN 1 AND DateDiff(dd, start_date, end_date) + 1
Use your favourite search engine to find a numbers table script. Here's one I made earlier.
As an aside: if you remove the +1s you just modify the join to be between zero and the DateDiff() - I added the +1s as I thought it might be clearer!
You can see this from another perspective. You don't really want a row per each worked day. What you really need it's the number of worked days, multiplied by the reported worked time. Something like this:
(dbo.MSP_WEB_WORK.WWORK_VALUE / 60000) * (DATEDIFF(day, dbo.MSP_WEB_WORK.WWORK_START, dbo.MSP_WEB_WORK.WWORK_FINISH) + 1)
however, this creates an issue. Let's say you want a given period. If you use the WWORK_START and WWORK_FINISH dates for your report, you need to be careful to include all the work with only some days inside the period. Something like this will do it:
DECLARE #InitDate DATETIME;
DECLARE #EndDate DATETIME;
SET #InitDate = '2016/06/01';
SET #EndDate = '2016/07/01';
--Full list of tasks
SELECT dbo.MSP_WEB_RESOURCES.RES_NAME AS Name, dbo.MSP_WEB_PROJECTS.PROJ_NAME AS Project,
dbo.MSP_WEB_WORK.WWORK_VALUE / 60000 AS ReportedWork,
CASE
WHEN WWORK_START < #InitDate THEN DATEDIFF(day, #InitDate, dbo.MSP_WEB_WORK.WWORK_FINISH) + 1 --If the task started before the start of the period
WHEN WWORK_FINISH > DATEDIFF(day,-1,#EndDate) THEN DATEDIFF(day, WWORK_START, DATEDIFF(day,-1,#EndDate)) + 1 --if the task ended after the end of the period
ELSE DATEDIFF(day, dbo.MSP_WEB_WORK.WWORK_START, dbo.MSP_WEB_WORK.WWORK_FINISH) + 1 --All tasks with start and end date inside the period
END AS RepeatedDays,
CASE
WHEN WWORK_START < #InitDate THEN (dbo.MSP_WEB_WORK.WWORK_VALUE / 60000) * (DATEDIFF(day, #InitDate, dbo.MSP_WEB_WORK.WWORK_FINISH) + 1)
WHEN WWORK_FINISH > DATEDIFF(day,-1,#EndDate) THEN (dbo.MSP_WEB_WORK.WWORK_VALUE / 60000) * (DATEDIFF(day, WWORK_START, DATEDIFF(day,-1,#EndDate)) + 1)
ELSE (dbo.MSP_WEB_WORK.WWORK_VALUE / 60000) * (DATEDIFF(day, dbo.MSP_WEB_WORK.WWORK_START, dbo.MSP_WEB_WORK.WWORK_FINISH) + 1)
END AS ActualWork,
dbo.MSP_WEB_WORK.WWORK_START,
dbo.MSP_WEB_WORK.WWORK_FINISH
FROM dbo.MSP_WEB_RESOURCES INNER JOIN
dbo.MSP_WEB_ASSIGNMENTS INNER JOIN
dbo.MSP_WEB_PROJECTS ON dbo.MSP_WEB_ASSIGNMENTS.WPROJ_ID = dbo.MSP_WEB_PROJECTS.WPROJ_ID INNER JOIN
dbo.MSP_WEB_WORK ON dbo.MSP_WEB_ASSIGNMENTS.WASSN_ID = dbo.MSP_WEB_WORK.WASSN_ID ON
dbo.MSP_WEB_RESOURCES.WRES_ID = dbo.MSP_WEB_ASSIGNMENTS.WRES_ID
WHERE (dbo.MSP_WEB_WORK.WWORK_TYPE = 1) AND
(
#InitDate BETWEEN dbo.MSP_WEB_WORK.WWORK_START and dbo.MSP_WEB_WORK.WWORK_FINISH OR
DATEADD(day,-1,#EndDate) BETWEEN dbo.MSP_WEB_WORK.WWORK_START and dbo.MSP_WEB_WORK.WWORK_FINISH OR
(dbo.MSP_WEB_WORK.WWORK_START >= #InitDate) AND
(dbo.MSP_WEB_WORK.WWORK_FINISH < #EndDate)
)
ORDER BY dbo.MSP_WEB_WORK.WWORK_START;
I have a table with 2 columns. UTCTime and Values.
The UTCTime is in 15 mins increment. I want a query that would compare the value to the previous value in one hour span and display a value between 0 and 4 depends on if the values are constant. In other words there is an entry for every 15 minute increment and the value can be constant so I just need to check each value to the previous one per hour.
For example
+---------|-------+
| UTCTime | Value |
------------------|
| 12:00 | 18.2 |
| 12:15 | 87.3 |
| 12:30 | 55.91 |
| 12:45 | 55.91 |
| 1:00 | 37.3 |
| 1:15 | 47.3 |
| 1:30 | 47.3 |
| 1:45 | 47.3 |
| 2:00 | 37.3 |
+---------|-------+
In this case, I just want a Query that would compare the 12:45 value to the 12:30 and 12:30 to 12:15 and so on. Since we are comparing in only one hour span then the constant values must be between 0 and 4 (O there is no constant values, 1 there is one like in the example above)
The query should display:
+----------+----------------+
| UTCTime | ConstantValues |
----------------------------|
| 12:00 | 1 |
| 1:00 | 2 |
+----------|----------------+
I just wanted to mention that I am new to SQL programming.
Thank you.
See SQL fiddle here
Below is the query you need and a working solution Note: I changed the timeframe to 24 hrs
;with SourceData(HourTime, Value, RowNum)
as
(
select
datepart(hh, UTCTime) HourTime,
Value,
row_number() over (partition by datepart(hh, UTCTime) order by UTCTime) RowNum
from foo
union
select
datepart(hh, UTCTime) - 1 HourTime,
Value,
5
from foo
where datepart(mi, UTCTime) = 0
)
select cast(A.HourTime as varchar) + ':00' UTCTime, sum(case when A.Value = B.Value then 1 else 0 end) ConstantValues
from SourceData A
inner join SourceData B on A.HourTime = B.HourTime and
(B.RowNum = (A.RowNum - 1))
group by cast(A.HourTime as varchar) + ':00'
select SUBSTRING_INDEX(UTCTime,':',1) as time,value, count(*)-1 as total
from foo group by value,time having total >= 1;
fiddle
Mine isn't much different from Vasanth's, same idea different approach.
The idea is that you need recursion to carry it out simply. You could also use the LEAD() function to look at rows ahead of your current row, but in this case that would require a big case statement to cover every outcome.
;WITH T
AS (
SELECT a.UTCTime,b.VALUE,ROW_NUMBER() OVER(PARTITION BY a.UTCTime ORDER BY b.UTCTime DESC)'RowRank'
FROM (SELECT *
FROM #Table1
WHERE DATEPART(MINUTE,UTCTime) = 0
)a
JOIN #Table1 b
ON b.UTCTIME BETWEEN a.UTCTIME AND DATEADD(hour,1,a.UTCTIME)
)
SELECT T.UTCTime, SUM(CASE WHEN T.Value = T2.Value THEN 1 ELSE 0 END)
FROM T
JOIN T T2
ON T.UTCTime = T2.UTCTime
AND T.RowRank = T2.RowRank -1
GROUP BY T.UTCTime
If you run the portion inside the ;WITH T AS ( ) you'll see that gets us the hour we're looking at and the values in order by time. That is used in the recursive portion below by joining to itself and evaluating each row compared to the next row (hence the RowRank - 1) on the JOIN.
I'm writing an app that handles scheduling time off for some of our employees. As part of this, I need to calculate how many minutes throughout the day that they have requested off.
In the first version of this tool, we disallowed overlapping time off requests, because we wanted to be able to just add up the total of StartTime minus EndTime for all requests. Preventing overlaps makes this calculation very fast.
This has become problematic, because Managers now want to schedule team meetings but are unable to do so when someone has already asked for the day off.
So, in the new version of the tool, we have a requirement to allow overlapping requests.
Here is an example set of data like what we have:
UserId | StartDate | EndDate
----------------------------
1 | 2:00 | 4:00
1 | 3:00 | 5:00
1 | 3:45 | 9:00
2 | 6:00 | 9:00
2 | 7:00 | 8:00
3 | 2:00 | 3:00
3 | 4:00 | 5:00
4 | 1:00 | 7:00
The result that I need to get, as efficiently as possible, is this:
UserId | StartDate | EndDate
----------------------------
1 | 2:00 | 9:00
2 | 6:00 | 9:00
3 | 2:00 | 3:00
3 | 4:00 | 5:00
4 | 1:00 | 7:00
We can easily detect overlaps with this query:
select
*
from
requests r1
cross join
requests r2
where
r1.RequestId < r2.RequestId
and
r1.StartTime < r2.EndTime
and
r2.StartTime < r1.EndTime
This is, in fact, how we were detecting and preventing the problems originally.
Now, we are trying to merge the overlapping items, but I'm reaching the limits of my SQL ninja skills.
It wouldn't be too hard to come up with a method using temp tables, but we want to avoid this if at all possible.
Is there a set-based way to merge overlapping rows?
Edit:
It would also be acceptable for the all of the rows to show up, as long as they were collapsed into just their time. For example if someone wants off from three to five, and from four to six, it would be acceptable for them to have two rows, one from three to five, and the next from five to six OR one from three to four, and the next from four to six.
Also, here is a little test bench:
DECLARE #requests TABLE
(
UserId int,
StartDate time,
EndDate time
)
INSERT INTO #requests (UserId, StartDate, EndDate) VALUES
(1, '2:00', '4:00'),
(1, '3:00', '5:00'),
(1, '3:45', '9:00'),
(2, '6:00', '9:00'),
(2, '7:00', '8:00'),
(3, '2:00', '3:00'),
(3, '4:00', '5:00'),
(4, '1:00', '7:00');
Complete Rewrite:
;WITH new_grp AS (
SELECT r1.UserId, r1.StartTime
FROM #requests r1
WHERE NOT EXISTS (
SELECT *
FROM #requests r2
WHERE r1.UserId = r2.UserId
AND r2.StartTime < r1.StartTime
AND r2.EndTime >= r1.StartTime)
GROUP BY r1.UserId, r1.StartTime -- there can be > 1
),r AS (
SELECT r.RequestId, r.UserId, r.StartTime, r.EndTime
,count(*) AS grp -- guaranteed to be 1+
FROM #requests r
JOIN new_grp n ON n.UserId = r.UserId AND n.StartTime <= r.StartTime
GROUP BY r.RequestId, r.UserId, r.StartTime, r.EndTime
)
SELECT min(RequestId) AS RequestId
,UserId
,min(StartTime) AS StartTime
,max(EndTime) AS EndTime
FROM r
GROUP BY UserId, grp
ORDER BY UserId, grp
Now produces the requested result and really covers all possible cases, including disjunct sub-groups and duplicates.
Have a look at the comments to the test data in the working demo at data.SE.
CTE 1
Find the (unique!) points in time where a new group of overlapping intervals starts.
CTE 2
Count the starts of new group up to (and including) every individual interval, thereby forming a unique group number per user.
Final SELECT
Merge the groups, take earlies start and latest end for groups.
I faced some difficulty, because T-SQL window functions max() or sum() do not accept an ORDER BY clause in a in a window. They can only compute one value per partition, which makes it impossible to compute a running sum / count per partition. Would work in PostgreSQL or Oracle (but not in MySQL, of course - it has neither window functions nor CTEs).
The final solution uses one extra CTE and should be just as fast.
Ok, it is possible to do with CTEs. I did not know how to use them at the beginning of the night, but here is the results of my research:
A recursive CTE has 2 parts, the "anchor" statement and the "recursive" statements.
The crucial part about the recursive statement is that when it is evaluated, only the rows that have not already been evaluated will show up in the recursion.
So, for example, if we wanted to use CTEs to get an all-inclusive list of times for these users, we could use something like this:
WITH
sorted_requests as (
SELECT
UserId, StartDate, EndDate,
ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY StartDate, EndDate DESC) Instance
FROM #requests
),
no_overlap(UserId, StartDate, EndDate, Instance) as (
SELECT *
FROM sorted_requests
WHERE Instance = 1
UNION ALL
SELECT s.*
FROM sorted_requests s
INNER JOIN no_overlap n
ON s.UserId = n.UserId
AND s.Instance = n.Instance + 1
)
SELECT *
FROM no_overlap
Here, the "anchor" statement is just the first instance for every user, WHERE Instance = 1.
The "recursive" statement joins each row to the next row in the set, using the s.UserId = n.UserId AND s.Instance = n.Instance + 1
Now, we can use the property of the data, when sorted by start date, that any overlapping row will have a start date that is less than the previous row's end date. If we continually propagate the row number of the first intersecting row, every subsequent overlapping row will share that row number.
Using this query:
WITH
sorted_requests as (
SELECT
UserId, StartDate, EndDate,
ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY StartDate, EndDate DESC) Instance
FROM
#requests
),
no_overlap(UserId, StartDate, EndDate, Instance, ConnectedGroup) as (
SELECT
UserId,
StartDate,
EndDate,
Instance,
Instance as ConnectedGroup
FROM sorted_requests
WHERE Instance = 1
UNION ALL
SELECT
s.UserId,
s.StartDate,
CASE WHEN n.EndDate >= s.EndDate
THEN n.EndDate
ELSE s.EndDate
END EndDate,
s.Instance,
CASE WHEN n.EndDate >= s.StartDate
THEN n.ConnectedGroup
ELSE s.Instance
END ConnectedGroup
FROM sorted_requests s
INNER JOIN no_overlap n
ON s.UserId = n.UserId AND s.Instance = n.Instance + 1
)
SELECT
UserId,
MIN(StartDate) StartDate,
MAX(EndDate) EndDate
FROM no_overlap
GROUP BY UserId, ConnectedGroup
ORDER BY UserId
We group by the aforementioned "first intersecting row" (called ConnectedGroup in this query) and find the minimum start time and maximum end time in that group.
The first intersecting row is propagated using this statement:
CASE WHEN n.EndDate >= s.StartDate
THEN n.ConnectedGroup
ELSE s.Instance
END ConnectedGroup
Which basically says, "if this row intersects with the previous row (based on us being sorted by start date), then consider this row to have the same 'row grouping' as the previous row. Otherwise, use this row's own row number as the 'row grouping' for itself."
This gives us exactly what we were looking for.
EDIT
When I had originally thought this up on my whiteboard, I knew that I would have to advance the EndDate of each row, to ensure that it would intersect with the next row, if any of the previous rows in the connected group would have intersected. I accidentally left that out. This has been corrected.
This works for postgres. Microsoft might need some modifications.
SET search_path='tmp';
DROP TABLE tmp.schedule CASCADE;
CREATE TABLE tmp.schedule
( person_id INTEGER NOT NULL
, dt_from timestamp with time zone
, dt_to timestamp with time zone
);
INSERT INTO schedule( person_id, dt_from, dt_to) VALUES
( 1, '2011-12-03 02:00:00' , '2011-12-03 04:00:00' )
, ( 1, '2011-12-03 03:00:00' , '2011-12-03 05:00:00' )
, ( 1, '2011-12-03 03:45:00' , '2011-12-03 09:00:00' )
, ( 2, '2011-12-03 06:00:00' , '2011-12-03 09:00:00' )
, ( 2, '2011-12-03 07:00:00' , '2011-12-03 08:00:00' )
, ( 3, '2011-12-03 02:00:00' , '2011-12-03 03:00:00' )
, ( 3, '2011-12-03 04:00:00' , '2011-12-03 05:00:00' )
, ( 4, '2011-12-03 01:00:00' , '2011-12-03 07:00:00' );
ALTER TABLE schedule ADD PRIMARY KEY (person_id,dt_from)
;
CREATE UNIQUE INDEX ON schedule (person_id,dt_to);
SELECT * FROM schedule ORDER BY person_id, dt_from;
WITH RECURSIVE ztree AS (
-- Terminal part
SELECT p1.person_id AS person_id
, p1.dt_from AS dt_from
, p1.dt_to AS dt_to
FROM schedule p1
UNION
-- Recursive part
SELECT p2.person_id AS person_id
, LEAST(p2.dt_from, zzt.dt_from) AS dt_from
, GREATEST(p2.dt_to, zzt.dt_to) AS dt_to
FROM ztree AS zzt
, schedule AS p2
WHERE 1=1
AND p2.person_id = zzt.person_id
AND (p2.dt_from < zzt.dt_from AND p2.dt_to >= zzt.dt_from)
)
SELECT *
FROM ztree zt
WHERE NOT EXISTS (
SELECT * FROM ztree nx
WHERE nx.person_id = zt.person_id
-- the recursive query returns *all possible combinations of
-- touching or overlapping intervals
-- we'll have to filter, keeping only the biggest ones
-- (the ones for which there is no bigger overlapping interval)
AND ( (nx.dt_from <= zt.dt_from AND nx.dt_to > zt.dt_to)
OR (nx.dt_from < zt.dt_from AND nx.dt_to >= zt.dt_to)
)
)
ORDER BY zt.person_id,zt.dt_from
;
Result:
DROP TABLE
CREATE TABLE
INSERT 0 8
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "schedule_pkey" for table "schedule"
ALTER TABLE
CREATE INDEX
person_id | dt_from | dt_to
-----------+------------------------+------------------------
1 | 2011-12-03 02:00:00+01 | 2011-12-03 04:00:00+01
1 | 2011-12-03 03:00:00+01 | 2011-12-03 05:00:00+01
1 | 2011-12-03 03:45:00+01 | 2011-12-03 09:00:00+01
2 | 2011-12-03 06:00:00+01 | 2011-12-03 09:00:00+01
2 | 2011-12-03 07:00:00+01 | 2011-12-03 08:00:00+01
3 | 2011-12-03 02:00:00+01 | 2011-12-03 03:00:00+01
3 | 2011-12-03 04:00:00+01 | 2011-12-03 05:00:00+01
4 | 2011-12-03 01:00:00+01 | 2011-12-03 07:00:00+01
(8 rows)
person_id | dt_from | dt_to
-----------+------------------------+------------------------
1 | 2011-12-03 02:00:00+01 | 2011-12-03 09:00:00+01
2 | 2011-12-03 06:00:00+01 | 2011-12-03 09:00:00+01
3 | 2011-12-03 02:00:00+01 | 2011-12-03 03:00:00+01
3 | 2011-12-03 04:00:00+01 | 2011-12-03 05:00:00+01
4 | 2011-12-03 01:00:00+01 | 2011-12-03 07:00:00+01
(5 rows)
I've just received a new data source for my application which inserts data into a Derby database only when it changes. Normally, missing data is fine - I'm drawing a line chart with the data (value over time), and I'd just draw a line between the two points, extrapolating the expected value at any given point. The problem is that as missing data in this case means "draw a straight line," the graph would be incorrect if I did this.
There are two ways I could fix this: I could create a new class that handles missing data differently (which could be difficult due to the way prefuse, the drawing library I'm using, handles drawing), or I could duplicate the rows, leaving the y value the same while changing the x value in each row. I could do this in the Java that bridges the database and the renderer, or I could modify the SQL.
My question is, given a result set like the one below:
+-------+---------------------+
| value | received |
+-------+---------------------+
| 7 | 2000-01-01 08:00:00 |
| 10 | 2000-01-01 08:00:05 |
| 11 | 2000-01-01 08:00:07 |
| 2 | 2000-01-01 08:00:13 |
| 4 | 2000-01-01 08:00:16 |
+-------+---------------------+
Assuming I query it at 8:00:20, how can I make it look like the following using SQL? Basically, I'm duplicating the row for every second until it's already taken. received is, for all intents and purposes, unique (it's not, but it will be due to the WHERE clause in the query).
+-------+---------------------+
| value | received |
+-------+---------------------+
| 7 | 2000-01-01 08:00:00 |
| 7 | 2000-01-01 08:00:01 |
| 7 | 2000-01-01 08:00:02 |
| 7 | 2000-01-01 08:00:03 |
| 7 | 2000-01-01 08:00:04 |
| 10 | 2000-01-01 08:00:05 |
| 10 | 2000-01-01 08:00:06 |
| 11 | 2000-01-01 08:00:07 |
| 11 | 2000-01-01 08:00:08 |
| 11 | 2000-01-01 08:00:09 |
| 11 | 2000-01-01 08:00:10 |
| 11 | 2000-01-01 08:00:11 |
| 11 | 2000-01-01 08:00:12 |
| 2 | 2000-01-01 08:00:13 |
| 2 | 2000-01-01 08:00:14 |
| 2 | 2000-01-01 08:00:15 |
| 4 | 2000-01-01 08:00:16 |
| 4 | 2000-01-01 08:00:17 |
| 4 | 2000-01-01 08:00:18 |
| 4 | 2000-01-01 08:00:19 |
| 4 | 2000-01-01 08:00:20 |
+-------+---------------------+
Thanks for your help.
Due to the set based nature of SQL, there's no simple way to do this. I have used two solution strategies:
a) use a cycle to go from the initial to end date time and for each step get the value, and insert that into a temp table
b) generate a table (normal or temporary) with the 1 minute increments, adding the base date time to this table you can generate the steps.
Example of approach b) (SQL Server version)
Let's assume we will never query more than 24 hours of data. We create a table intervals that has a dttm field with the minute count for each step. That table must be populated previously.
select dateadd(minute,stepMinutes,'2000-01-01 08:00') received,
(select top 1 value from table where received <=
dateadd(minute,dttm,'2000-01-01 08:00')
order by received desc) value
from intervals
It seems like in this case you really don't need to generate all of these datapoints. Would it be correct to generate the following instead? If it's drawing a straight line, you don't need go generate a data point for each second, just two for each datapoint...one at the current time, one right before the next time. This example subtracts 5 ms from the next time, but you could make it a full second if you need it.
+-------+---------------------+
| value | received |
+-------+---------------------+
| 7 | 2000-01-01 08:00:00 |
| 7 | 2000-01-01 08:00:04 |
| 10 | 2000-01-01 08:00:05 |
| 10 | 2000-01-01 08:00:06 |
| 11 | 2000-01-01 08:00:07 |
| 11 | 2000-01-01 08:00:12 |
| 2 | 2000-01-01 08:00:13 |
| 2 | 2000-01-01 08:00:15 |
| 4 | 2000-01-01 08:00:16 |
| 4 | 2000-01-01 08:00:20D |
+-------+---------------------+
If that's the case, then you can do the following:
SELECT * FROM
(SELECT * from TimeTable as t1
UNION
SELECT t2.value, dateadd(ms, -5, t2.received)
from ( Select t3.value, (select top 1 t4.received
from TimeTable t4
where t4.received > t3.received
order by t4.received asc) as received
from TimeTable t3) as t2
UNION
SELECT top 1 t6.value, GETDATE()
from TimeTable t6
order by t6.received desc
) as t5
where received IS NOT NULL
order by t5.received
The big advantage of this is that it is a set based solution and will be much faster than any iterative approach.
You could just walk a cursor, keep vars for the last value & time returned, and if the current one is more than a second ahead, loop one second at a time using the previous value and the new time until you get the the current row's time.
Trying to do this in SQL would be painful, and if you went and created the missing data, you would possible have to add a column to track real / interpolated data points.
Better would be to have a table for each axial value you want to have on the graph, and then either join to it or even just put the data field there and update that record when/if values arrive.
The "missing values" problem is quite extensive, so I suggest you have a solid policy.
One thing that will happen is that you will have multiple adjacent slots with missing values.
This would be much easier if you could transform it into OLAP data.
Create a simple table that has all the minutes (warning, will run for a while):
Create Table Minutes(Value DateTime Not Null)
Go
Declare #D DateTime
Set #D = '1/1/2000'
While (Year(#D) < 2002)
Begin
Insert Into Minutes(Value) Values(#D)
Set #D = DateAdd(Minute, 1, #D)
End
Go
Create Clustered Index IX_Minutes On Minutes(Value)
Go
You can then use it somewhat like this:
Select
Received = Minutes.Value,
Value = (Select Top 1 Data.Value
From Data
Where Data.Received <= Minutes.Received
Order By Data.Received Desc)
From
Minutes
Where
Minutes.Value Between #Start And #End
I would recommend against solving this in SQL/the database due to the set based nature of it.
Also you are dealing with seconds here so I guess you could end up with a lot of rows, with the same repeated data, that would have to be transfered from the database to you application.
One way to handle this is to left join your data against a table that contains all of the received values. Then, when there is no value for that row, you calculate what the projected value should be based on the previous and next actual values you have.
You didn't say what database platform you are using. In SQL Server, I would create a User Defined Function that accepts a start datetime and end datetime value. It would return a table value with all of the received values you need.
I have simulated it below, which runs in SQL Server. The subselect aliased r is what would actually get returned by the user defined function.
select r.received,
isnull(d.value,(select top 1 data.value from data where data.received < r.received order by data.received desc)) as x
from (
select cast('2000-01-01 08:00:00' as datetime) received
union all
select cast('2000-01-01 08:00:01' as datetime)
union all
select cast('2000-01-01 08:00:02' as datetime)
union all
select cast('2000-01-01 08:00:03' as datetime)
union all
select cast('2000-01-01 08:00:04' as datetime)
union all
select cast('2000-01-01 08:00:05' as datetime)
union all
select cast('2000-01-01 08:00:06' as datetime)
union all
select cast('2000-01-01 08:00:07' as datetime)
union all
select cast('2000-01-01 08:00:08' as datetime)
union all
select cast('2000-01-01 08:00:09' as datetime)
union all
select cast('2000-01-01 08:00:10' as datetime)
union all
select cast('2000-01-01 08:00:11' as datetime)
union all
select cast('2000-01-01 08:00:12' as datetime)
union all
select cast('2000-01-01 08:00:13' as datetime)
union all
select cast('2000-01-01 08:00:14' as datetime)
union all
select cast('2000-01-01 08:00:15' as datetime)
union all
select cast('2000-01-01 08:00:16' as datetime)
union all
select cast('2000-01-01 08:00:17' as datetime)
union all
select cast('2000-01-01 08:00:18' as datetime)
union all
select cast('2000-01-01 08:00:19' as datetime)
union all
select cast('2000-01-01 08:00:20' as datetime)
) r
left outer join Data d on r.received = d.received
If you were in SQL Server, then this would be a good start. I am not sure how close Apache's Derby is to sql.
Usage: EXEC ElaboratedData '2000-01-01 08:00:00','2000-01-01 08:00:20'
CREATE PROCEDURE [dbo].[ElaboratedData]
#StartDate DATETIME,
#EndDate DATETIME
AS
--if not a valid interval, just quit
IF #EndDate<=#StartDate BEGIN
SELECT 0;
RETURN;
END;
/*
Store the value of 1 second locally, for readability
--*/
DECLARE #OneSecond FLOAT;
SET #OneSecond = (1.00000000/86400.00000000);
/*
create a temp table w/the same structure as the real table.
--*/
CREATE TABLE #SecondIntervals(TSTAMP DATETIME, DATAPT INT);
/*
For each second in the interval, check to see if we have a known value.
If we do, then use that. If not, make one up.
--*/
DECLARE #CurrentSecond DATETIME;
SET #CurrentSecond = #StartDate;
WHILE #CurrentSecond <= #EndDate BEGIN
DECLARE #KnownValue INT;
SELECT #KnownValue=DATAPT
FROM TESTME
WHERE TSTAMP = #CurrentSecond;
IF (0 = ISNULL(#KnownValue,0)) BEGIN
--ok, we have to make up a fake value
DECLARE #MadeUpValue INT;
/*
*******Put whatever logic you want to make up a fake value here
--*/
SET #MadeUpValue = 99;
INSERT INTO #SecondIntervals(
TSTAMP
,DATAPT
)
VALUES(
#CurrentSecond
,#MadeUpValue
);
END; --if we had to make up a value
SET #CurrentSecond = #CurrentSecond + #OneSecond;
END; --while looking thru our values
--finally, return our generated values + real values
SELECT TSTAMP, DATAPT FROM #SecondIntervals
UNION ALL
SELECT TSTAMP, DATAPT FROM TESTME
ORDER BY TSTAMP;
GO
As just an idea, you might want to check out Anthony Mollinaro's SQL Cookbook, chapter 9. He has a recipe, "Filling in Missing Dates" (check out pages 278-281), that discusses primarily what you are trying to do. It requires some sort of sequential handling, either via a helper table or doing the query recursively. While he doesn't have examples for Derby directly, I suspect you could probably adapt them to your problem (particularly the PostgreSQL or MySQL one, it seems somewhat platform agnostic).