In the plant at our company there is a physical process that has a two-stage start and a two-stage finish. As a widget starts to enter the process a new record is created containing the widget ID and a timestamp (DateTimeCreated) and once the widget fully enters the process another timestamp is logged in a different field for the same record (DateTimeUpdated). The interval is a matter of minutes.
Similarly, as a widget starts to exit the process another record is created containing the widget ID and the DateTimeCreated, with the DateTimeUpdated being populated when the widget has fully exited the process. In the current table design an "exiting" record is indistinguishable from an "entering" record (although a given widget ID occurs only either once or twice so a View could utilise this fact to make the distinction, but let's ignore that for now).
The overall time a widget is in the process is several days but that's not really of importance to the discussion. What is important is that the interval when exiting the process is always longer than when entering. So a very simplified, imaginary set of sorted interval values might look like this:
1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 6, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 10
You can see there is a peak in the occurrences of intervals around the 3-minute-mark (the "enters") and another peak around the 7/8-minute-mark (the "exits"). I've also excluded intervals of 5 minutes to demonstrate that enter-intervals and exit-intervals can be considered mutually exclusive.
We want to monitor the performance of each stage in the process daily by using a query to determine the local averages of the entry and exit data point clusters. So conceptually the two data sets could be split either side of an overall average (in this case 5.375) and then an average calculated for the values below the split (2.75) and another average above the split (8). Using the data above (in a random distribution) the averages are depicted as the dotted lines in the chart below.
My current approach is to use two Common Table Expressions followed by a final three-table-join query. It seems okay, but I can't help feeling it could be better. Would anybody like to offer an alternative approach or other observations?
WITH cte_Raw AS
DATEDIFF(minute, DateTimeCreated, DateTimeUpdated) AS [Interval]
DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime) -- Today
, cte_Midpoint AS
AVG(Interval) AS Interval
AVG([Entry].Interval) AS AverageEntryInterval
, AVG([Exit].Interval) AS AverageExitInterval
cte_Raw AS [Entry]
[Entry].Interval < cte_Midpoint.Interval
cte_Raw AS [Exit]
[Exit].Interval > cte_Midpoint.Interval

I don't think your query produces accurate results. Your two JOINs are producing a proliferation of rows, which throw the averages off. They might look correct (because one is less than the other), but it you did counts, you would see that the counts in your query have little to do with the sample data.
If you are just looking for the average of values that are less than the overall average and greater than the overall average, then you an use window functions:
SELECT t.*, v.[Interval],
AVG(v.[Interval]) OVER () as avg_interval
(VALUES (DATEDIFF(minute, DateTimeCreated, DateTimeUpdated))
) v(Interval)
WHERE DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime)
SELECT AVG(CASE WHEN t.[Interval] < t.avg_interval THEN t.[Interval] END) AS AverageEntryInterval,
AVG(CASE WHEN t.[Interval] > t.avg_interval THEN t.[Interval] END) AS AverageExitInterval

I decided to post my own answer as at the time of writing neither of the two proposed answers will run. I have however removed the JOIN statements and used the CASE statement approach proposed by Gordon.
I've also multiplied the DATEDIFF result by 1.0 to prevent rounding of results from the AVG function.
WITH cte_Raw AS
1.0 * DATEDIFF(minute, DateTimeCreated, DateTimeUpdated) AS [Interval]
DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime) -- Today
, cte_Midpoint AS
AVG(Interval) AS Interval
SELECT AVG(CASE WHEN cte_Raw.Interval < cte_Midpoint.Interval THEN cte_Raw.[Interval] END) AS AverageEntryInterval,
AVG(CASE WHEN cte_Raw.Interval > cte_Midpoint.Interval THEN cte_Raw.[Interval] END) AS AverageExitInterval
FROM cte_Raw CROSS JOIN cte_Midpoint
This solution does not cater for the theoretical pitfall indicated by Vladimir of uneven dispersions of Entry vs Exit intervals, as in practice we can be confident this does not occur.


Get time difference between Log records

I have a log table that tracks the bug's status. I would like to extract the amount of time spent when the log changes from OPEN (OldStatus) to FIXED or REQUEST CLOSE (NewStatus). Right now, my query looks at the max and min of the log which does not produce the result I want. For example, the bug #1 was fixed in 2 hours on 2020-01-01, then reopened (OldStatus) and got a REQUEST CLOSE (NewStatus) in 3 hours on 2020-12-12. I want the query result to return two rows with date and number of hours spent to fix the bug since its most recently opened time.
Here's the data:
BugID int,
CurrentTime timestamp,
Person varchar(20),
OldStatus varchar(20),
NewStatus varchar(20)
INSERT INTO Log (BugID, CurrentTime, Person, OldStatus, NewStatus)
VALUES (1, '2020-01-01 00:00:00', 'A', 'OPEN', 'In Progress'),
(1, '2020-01-01 00:00:01', 'A', 'In Progress', 'REVIEW In Progress'),
(1, '2020-01-01 02:00:00', 'A', 'In Progress', 'FIXED'),
(1, '2020-01-01 06:00:00', 'B', 'OPEN', 'In Progress'),
(1, '2020-01-01 00:00:00', 'B', 'In Progress', 'REQUEST CLOSE')
SELECT DATEDIFF(HOUR, start_time, finish_time) AS Time_Spent_Min
MAX(CurrentTime) as finish_time,
MIN(CurrentTime) as start_time
WHERE (OldStatus = 'OPEN' AND NewString = 'In Progress') OR NewString = 'FIXED'
The actual data looks as below:
This is a type of gaps-and-islands problem.
There are a number of solutions, here is one:
We need to assign a grouping ID to each island of OPEN -> In Progress. We can use windowed conditional COUNT to get a grouping number for each start point.
To get a grouping for the end point, we need to assign the previous row's NewStatus using LAG, then do another conditional COUNT on that.
We then simply group by BugId and our calculated grouping and return the start and end times
WITH IslandStart AS (
COUNT(CASE WHEN OldStatus = 'OPEN' AND NewStatus = 'In Progress' THEN 1 END)
LAG(NewStatus) OVER (PARTITION BY BugID ORDER BY CurrentTime) AS Prev_NewStatus
FROM Log l
IslandEnd AS (
FROM IslandStart l
MAX(CurrentTime) as finish_time,
MIN(CurrentTime) as start_time,
DATEDIFF(minute, MIN(CurrentTime), MAX(CurrentTime)) AS Time_Spent_Min
FROM IslandEnd l
WHERE GroupStart = GroupEnd + 1
timestamp is not meant for actual dates and times, instead use datetime or datetime2
You may need to adjust the COUNT condition if OPEN -> In Progress is not always the first row of an island
You have a few competing factors here:
You should use a SmallDateTime, DateTime2 or DateTimeOffset typed columns to store the actual time in the log, these types allow for calculating the differece between values using DateDiff() and DateAdd() and other date/time based comparison logic, where as Timestamp is designed to be used as a currency token, you can use it to determine if one record is more recent than another, you shouldn't try to use it to determine the actual time of the event.
What is difference between datetime and timestamp
You have not explained the expected workflow, we can only assume that the flow is [OPEN]=>[In Progress]=>[CLAIM FIXED]. There is also no mention of 'In Progress', which we assume is an interim state. What actually happens here is that this structure can really only tell you the time spent in the 'In Progress' state, which is probably OK for your needs as this is the time spent actually working, but it is important to recognise that we do not know when the bug is changed to 'OPEN' in the first place, unless that is also logged but we need to see the data to explain that.
Your example dataset does not cover enough combinations for you to notice that the existing logic will fail as soon as you add more than 1 bug. What is more you have asked to calculate the number of hours, but your example data only shows a variation minutes and has no example where the bug is completed at all.
Without a realistic set of data to test with, you will find it hard to debug your logic and hard to accept that it actually works before you execute this against a larger dataaset. It can help to have a scripted scenario, much like your post here, but you should create the data to reflect that script.
You use 'FIXED' in your example, but 'CLAIM FIXED' in query, so which one is it?
Step 1: Structure
Change the datatype of CurrentTime to a DateTime based column. Your application logic may drive requirements here. If your system is cloud based or international, then you may see benefits from using DateTimeOffset instead of having to convert into UTC, otherwise if you do not need high precision timing in your logs, it is very common to use SmallDateTime for logging.
Many ORM and application frameworks will allow you to configure a DateTime based column as the concurrency token, it you need one at all. If you are not happy using a lower precision value for concurrency, then you could have the two columns side by side, to compare the time difference between two records, we need to use a DateTime based type.
In the case of log, we rarely allow or expect logs to be edited, if your logs are read-only then having a concurrency token at all may not be necessary, especially if you only use the concurrency token to determine concurrency during edits of individual records.
NOTE: You should consider using an enum or FK for the Status concept. Already in your example dataset there was a typo for 'In Progerss', using a numeric comparison for the status may provide some performance benefits but it will help to prevent spelling mistakes, especially when FK or lookup lists are used from any application logic.
Step 2: Example Data
If the requirement is to calculate the number of hours spent between records, then we need to create some simple examples that show a difference of a few hours, and then add some examples where the same bug is opened, fixed and then re-opened.
bug #1 was fixed in 2 hours on 2020-01-01, then reopened and got fixed in 3 hours on 2020-12-12
The following table shows the known data states and the expected hrs, we need to throw in a few more data stories to validate that the end query handles obvious boundary conditions like multiple Bugs and overlapping dates
Previous State
New State
Hrs In Progress
2020-01-01 08:00:00
In Progress
2020-01-01 10:00:00
In Progress
(2 hrs)
2020-12-10 09:00:00
2020-12-12 9:30:00
In Progress
2020-12-12 12:30:00
In Progress
(3 hrs)
2020-03-17 11:15:00
In Progress
2020-03-17 14:30:00
In Progress
(3.25 hrs)
2020-08-22 10:00:00
In Progress
2020-08-22 16:30:00
In Progress
(6.5 hrs)
Step 3: Query
What is interesting to notice here is that 'In Progress' is actually the significant state to query against. What we actually want is to see all rows where the OldStatus is 'In Progress' and we want to link that row to the most recent record before this one with the same BugID and with a NewStatus equal to 'In Progress'
What is interesting in the above table is that not all the expected hours are whole numbers (integers) which makes using DateDiff a little bit tricky because it only counts the boundary changes, not the total number of hours. to highlight this, look at the next two queries, the first one represents 59 minutes, the other only 2 minutes:
SELECT DateDiff(HOUR, '2020-01-01 08:00:00', '2020-01-01 08:59:00') -- 0 (59m)
SELECT DateDiff(HOUR, '2020-01-01 08:59:00', '2020-01-01 09:01:00') -- 1 (1m)
However the SQL results show the first query as 0 hours, but the second query returns 1 hour. That is because it only compares the HOUR column, it is not actually doing a subtraction of the time value at all.
To work around this, we can use MINUTE or MI as the date part argument and divide the result by 60.
SELECT CAST(ROUND(DateDiff(MI, '2020-01-01 08:00:00', '2020-01-01 08:59:00')/60.0,2) as Numeric(10,2)) -- 0.98
SELECT CAST(ROUND(DateDiff(MI, '2020-01-01 08:59:00', '2020-01-01 09:01:00')/60.0,2) as Numeric(10,2)) -- 0.03
You can choose to format this in other ways by calculating the modulo to get the minutes in whole numbers instead of a fraction but that is out of scope for this post, understanding the limitations of DateDiff is what is important to take this further.
There are a number of ways to correlate a previous record within the same table, if you need other values form the record then you might use a join with a sub-query to return the TOP 1 from all the records before the current one, you could use window queries or a CROSS APPLY to perform a nested lookup. The following uses CROSS APPLY which is NOT standard across all RDBMS but I feel it keeps MS SQL queries really clean:
SELECT [Fixed].BugID, [start_time], [Fixed].[CurrentTime] as [finish_time]
, DATEDIFF(MI, [start_time], [Fixed].[CurrentTime]) / 60 AS Time_Spent_Hr
, DATEDIFF(MI, [start_time], [Fixed].[CurrentTime]) % 60 AS Time_Spent_Min
FROM Log as Fixed
CROSS APPLY (SELECT MAX(CurrentTime) AS start_time
FROM Log as Started
WHERE Fixed.BugID = Started.BugID
AND Started.NewStatus = 'In Progress'
AND CurrentTime < Fixed.CurrentTime) as Started
WHERE Fixed.OldStatus = 'In Progress'
You can play with this fiddle:!18/c408d4/3
However the results show this:
If I assume that every "open" is followed by one "fixed" before the next open, then you can basically use lead() to solve this problem.
This version unpivots the data, so you could have "open" and "fixed" in the same row:
select l.*, datediff(hour, currenttime, fixed_time)
from (select v.*,
lead(v.currenttime) over (partition by v.bugid order by v.currenttime) as fixed_time
from log l cross apply
(values (bugid, currentTime, oldStatus),
(bugid, currentTime, newStatus)
) v(bugid, currentTime, status)
where v.status in ('OPEN', 'FIXED')
) l
where status = 'OPEN';
Here is a db<>fiddle, which uses data compatible with your explanation. (Your sample data is not correct.)

Oracle SQL: How to best go about counting how many values were in time intervals? Database query vs. pandas (or more efficient libraries)?

I currently have to wrap my head around programming the following task.
Situation: suppose we have one column where we have time data (Year-Month-Day Hours-Minutes). Our program shall get the input (weekday, starttime, endtime, timeslot) and we want to return the interval (specified by timeslot) where there are the least values. For further information, the database has several million entries.
So our program would be specified as
def calculate_optimal_window(weekday, starttime, endtime, timeslot):
return optimal_window
Example: suppose we want to input
weekday = Monday, starttime = 10:00, endtime = 12:00, timeslot = 30 minutes.
Here we want to count how many entries there are between 10:00 and 12:00 o'clock, and compute the number of values in every single 30 minute slot (i.e. 10:00 - 10:30, 10:01 - 10:31 etc.) and in the end return the slot with the least values. How would you go about formulating an efficient query?
Since I'm working with an Oracle SQL database, my second question is: would it be more efficient to work with libraries like Dask or Vaex to get the filtering and counting done? Where is the bottleneck in this situation?
Happy to provide more information if the formulation was too blurry.
All the best.
This part:
Since I'm working with an Oracle SQL database, my second question is:
would it be more efficient to work with libraries like Dask or Vaex to
get the filtering and counting done? Where is the bottleneck in this
Depending on your server's specs and the cluster/machine you have available for Dask, it is rather likely that the bottleneck in your analysis would be the transfer of data between the SQL and Dask workers, even in the (likely) case that this can be efficiently parallelised. From the DB's point of view, selecting data and serialising it is likely at least as expensive as counting in a relatively small number of time bins.
I would start by investigating how long the process takes with SQL alone, and whether this is acceptable, before moving the analysis to Dask. Usual rules would apply: having good indexing and sharding on the time index.
You should at least do the basic filtering and counting in the SQL query. With a simple predicate, Oracle can decide whether to use an index or a partition and potentially reduce the database processing time. And sending fewer rows will significantly decrease the network overhead.
For example:
select trunc(the_time, 'MI') the_minute, count(*) the_count
from test1
where the_time between timestamp '2021-01-25 10:00:00' and timestamp '2021-01-25 11:59:59'
group by trunc(the_time, 'MI')
order by the_minute desc;
(The trickiest part of these queries will probably be off-by-one issues. Do you really want "between 10:00 and 12:00", or do you want "between 10:00 and 11:59:59"?)
Optionally, you can perform the entire calculation in SQL. I would wager the SQL version will be slightly faster, again because of the network overhead. But sending one result row versus 120 aggregate rows probably won't make a noticeable difference unless this query is frequently executed.
At this point, the question veers into the more subjective question about where to put the "business logic". I bet most programmers would prefer your Python solution to my query. But one minor advantage of doing all the work in SQL is keeping all of the weird date logic in one place. If you process the results in multiple steps there are more chances for an off-by-one error.
--Time slots with the smallest number of rows.
--(There will be lots of ties because the data is so boring.)
with dates as
--Enter literals or bind variables here:
cast(timestamp '2021-01-25 10:00:00' as date) begin_date,
cast(timestamp '2021-01-25 11:59:59' as date) end_date,
30 timeslot
from dual
--Choose the rows with the smallest counts.
select begin_time, end_time, total_count
--Rank the time slots per count.
select begin_time, end_time, total_count,
dense_rank() over (order by total_count) smallest_when_1
--Counts per timeslot.
select begin_time, end_time, sum(the_count) total_count
--Counts per minute.
select trunc(the_time, 'MI') the_minute, count(*) the_count
from test1
where the_time between (select begin_date from dates) and (select end_date from dates)
group by trunc(the_time, 'MI')
order by the_minute desc
) counts
--Time ranges.
begin_date + ((level-1)/24/60) begin_time,
begin_date + ((level-1)/24/60) + (timeslot/24/60) end_time
from dates
connect by level <=
--The number of different time ranges.
select (end_date - begin_date) * 24 * 60 - timeslot + 1
from dates
) time_ranges
on the_minute between begin_time and end_time
group by begin_time, end_time
where smallest_when_1 = 1
order by begin_time;
You can run a db<>fiddle here.

SQL: Average value per day

I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.
My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1.
At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.
SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
FROM `tweets`
WHERE `processed_text` LIKE '%{TICKER|IBM}%'
GROUP BY Date( `created_at` )
There are however two problems with this script:
It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.
When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:
SELECT, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
FROM tweets
) t
ON =;
This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?
SELECT, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
(SELECT DATE(created_at) AS date,
LOCATE('{TICKER|', processed_text) + 8,
LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
- LOCATE('{TICKER|', processed_text) - 8)) t
ON =
GROUP BY, t.ticker
This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.
See MySQL pivot table for how to do it in MySQL.

db2 suppress recursive warning

I have a recursive sql that I am running which works but gives me the following warning.
SQL0347W The recursive common table expression "DT_LAST_YEAR" may
contain an infinite loop. SQLSTATE=01605
How can I get rid of the warning?
WITH dt_this_year (level, seqdate) AS
SELECT 1, date(current timestamp) -7 DAYS FROM sysibm.sysdummy1
SELECT level, seqdate + level days FROM dt_this_year WHERE level < 1000 AND seqdate + 1 days < date(current timestamp)
,dt_last_year (level, seqdate) AS
SELECT 1, date(current timestamp) -7 DAYS - 1 year FROM sysibm.sysdummy1
SELECT level, seqdate + level days FROM dt_last_year WHERE level < 1000 AND seqdate + 1 days < date(current timestamp) -1 year
select 10049, date(dts.calendarday), count(*) trancount
from (
SELECT seqdate AS calendarday FROM dt_this_year
SELECT seqdate AS calendarday FROM dt_last_year
) dts LEFT JOIN ccftrxheader ccf
ON date(dts.calendarday) = date(ccf.businessdate)
WHERE ccf.sitedirectoryid=10049
GROUP BY ccf.sitedirectoryid,dts.calendarday
How do you get rid of warnings?
By changing the code so that it no longer generates the warning in the first place. Hiding warnings is problematic, because it often disguises a potentially larger problem. I'm fairly certain it's complaining here because the termination clause you provide for level can't ever be reached (because you never manipulate it).
Personally, I'd probably re-write your query into something like this:
INSERT INTO Rep_Man_Tran_Counts (siteDirectoryId, businessDate, tranCount)
WITH dt_Calendar_Data (level, calendarDay) AS
(SELECT l, c
(1, CURRENT_DATE - 7 DAYS - 1 YEAR)) t(l, c)
SELECT level + 1, calendarDay + 1 DAYS
FROM dt_Calendar_Data
WHERE level < 7)
SELECT 10049, dtCal.calendarDay, COALESCE(COUNT(*), 0) as tranCount
FROM dt_Calendar_Data dtCal
LEFT JOIN ccftrxHeader ccf
ON ccf.businessDate = dtCal.calendarDay
AND ccf.siteDirectoryId = 10049
GROUP BY dtCal.seqDate
(untested, as you've provided no sample data, and I don't have a DB2 instance)
I've assumed you actually wanted a LEFT JOIN, as opposed to the regular INNER JOIN you were actually getting (due to the condition in the WHERE clause, and probably the GROUP BY as well). To avoid adding nulls to your data, I've wrapped the count in COALESCE(...), which will give you 0 instead.
I've also assumed that businessDate is a DATE type, and not a timestamp. If it is a timestamp this query needs to be adjusted (note that the function you were using would for the optimizer to ignore indices).
Note that order of operations with dates matter! Thankfully when dealing with year ranges, you only have one day to worry about in the Gregorian calendar (February 29th). Your current ordering will compare identical calendar days at the start of the range (which one has the "gap" depends on whether this year or last year is a leap year).
Sure, lets look at that CTE:
(1, CURRENT_DATE - 7 DAYS - 1 YEAR)) t(l, c)
This is just a standard VALUES clause used as a table reference. This is the SQL Standard way to construct a small temp table (Rather than referencing the dummy tables, which tend to be vendor-specific). If the statement is run on 2014-02-26 then the resulting table will be:
l c
1 "2014-02-19"
1 "2013-02-19"
These columns get renamed by the column listing of the CTE, which are then referenced in the join (and in the case of a recursive CTE, by the recursive portion).
This then forms the starting data for the rest of the recursive query:
SELECT level + 1, calendarDay + 1 DAYS
FROM dt_Calendar_Data
WHERE level < 7
In DB2 (and some other RDBMSs), recursive CTEs essentially execute iteratively, acting off the results of the "previous" invocation. Every time around, we increment level, and add another day to calendarDay. The "next" rows are then:
level calendarDay
2 "2014-02-20"
2 "2013-02-20"
This continues until the "previous" row has level = 7, which means a new row is not generated (check the WHERE clause). In general, it's best to only have one termination condition (and make progress every iteration), to make it easier for the optimizer to spot. The resulting data is then in the ranges:
level calendarDay
1 "2014-02-19"
. .....
7 "2014-02-26"
1 "2013-02-19"
. .....
7 "2013-02-26"
... as a side note, I generated the this year/last year data together to make the number of references shorter. If you only needed the one year, level is unnecessary.

Calculating different tariff-periods for a call in SQL Server

For a call-rating system, I'm trying to split a telephone call duration into sub-durations for different tariff-periods. The calls are stored in a SQL Server database and have a starttime and total duration. Rates are different for night (0000 - 0800), peak (0800 - 1900) and offpeak (1900-235959) periods.
For example:
A call starts at 18:50:00 and has a duration of 1000 seconds. This would make the call end at 19:06:40, making it 10 minutes / 600 seconds in the peak-tariff and 400 seconds in the off-peak tariff.
Obviously, a call can wrap over an unlimited number of periods (we do not enforce a maximum call duration). A call lasting > 24 h can wrap all 3 periods, starting in peak, going through off-peak, night and back into peak tariff.
Currently, we are calculating the different tariff-periods using recursion in VB. We calculate how much of the call goes in the same tariff-period the call starts in, change the starttime and duration of the call accordingly and repeat this process till the full duration of the call has been reach (peakDuration + offpeakDuration + nightDuration == callDuration).
Regarding this issue, I have 2 questions:
Is it possible to do this effectively in a SQL Server statement? (I can think of subqueries or lots of coding in stored procedures, but that would not generate any performance improvement)
Will SQL Server be able to do such calculations in a way more resource-effective than the current VB scripts are doing it?
It seems to me that this is an operation with two phases.
Determine which parts of the phone call use which rates at which time.
Sum the times in each of the rates.
Phase 1 is trickier than Phase 2. I've worked the example in IBM Informix Dynamic Server (IDS) because I don't have MS SQL Server. The ideas should translate easily enough. The INTO TEMP clause creates a temporary table with an appropriate schema; the table is private to the session and vanishes when the session ends (or you explicitly drop it). In IDS, you can also use an explicit CREATE TEMP TABLE statement and then INSERT INTO temp-table SELECT ... as a more verbose way of doing the same job as INTO TEMP.
As so often in SQL questions on SO, you've not provided us with a schema, so everyone has to invent a schema that might, or might not, match what you describe.
Let's assume your data is in two tables. The first table has the call log records, the basic information about the calls made, such as the phone making the call, the number called, the time when the call started and the duration of the call:
CREATE TABLE clr -- call log record
phone_id VARCHAR(24) NOT NULL, -- billing plan
called_number VARCHAR(24) NOT NULL, -- needed to validate call
start_time TIMESTAMP NOT NULL, -- date and time when call started
duration INTEGER NOT NULL -- duration of call in seconds
CHECK(duration > 0),
PRIMARY KEY(phone_id, start_time)
-- other complicated range-based constraints omitted!
-- foreign keys omitted
-- there would probably be an auto-generated number here too.
INSERT INTO clr(phone_id, called_number, start_time, duration)
VALUES('650-656-3180', '650-794-3714', '2009-02-26 15:17:19', 186234);
For convenience (mainly to save writing the addition multiple times), I want a copy of the clr table with the actual end time:
SELECT phone_id, called_number, start_time AS call_start, duration,
start_time + duration UNITS SECOND AS call_end
FROM clr
INTO TEMP clr_end;
The tariff data is stored in a simple table:
tariff_code CHAR(1) NOT NULL -- code for the tariff
CHECK(tariff_code IN ('P','N','O'))
rate_start TIME NOT NULL, -- time when rate starts
rate_end TIME NOT NULL, -- time when rate ends
rate_charged DECIMAL(7,4) NOT NULL -- rate charged (cents per second)
INSERT INTO tariff(tariff_code, rate_start, rate_end, rate_charged)
VALUES('N', '00:00:00', '08:00:00', 0.9876);
INSERT INTO tariff(tariff_code, rate_start, rate_end, rate_charged)
VALUES('P', '08:00:00', '19:00:00', 2.3456);
INSERT INTO tariff(tariff_code, rate_start, rate_end, rate_charged)
VALUES('O', '19:00:00', '23:59:59', 1.2345);
I debated whether the tariff table should use TIME or INTERVAL values; in this context, the times are very similar to intervals relative to midnight, but intervals can be added to timestamps where times cannot. I stuck with TIME, but it made things messy.
The tricky part of this query is generating the relevant date and time ranges for each tariff without loops. In fact, I ended up using a loop embedded in a stored procedure to generate a list of integers. (I also used a technique that is specific to IBM Informix Dynamic Server, IDS, using the table ID numbers from the system catalog as a source of contiguous integers in the range 1..N, which works for numbers from 1 to 60 in version 11.50.)
FOR i = lo TO hi STEP 1
In the simple case (and the most common case), the call falls in a single-tariff period; the multi-period calls add the excitement.
Let's assume we can create a table expression that matches this schema and covers all the timestamp values we might need:
CREATE TEMP TABLE tariff_date_time
tariff_code CHAR(1) NOT NULL,
rate_charged DECIMAL(7,4) NOT NULL
Fortunately, you haven't mentioned weekend rates, so you charge the customers the same
rates at the weekend as during the week. However, the answer should adapt to such
situations if at all possible. If you were to get as complex as giving weekend rates on
public holidays, except that at Christmas or New Year, you charge peak rate instead of
weekend rate because of the high demand, then you would be best off storing the rates in a permanent tariff_date_time table.
The first step in populating tariff_date_time is to generate a list of dates which are relevant to the calls:
SELECT DISTINCT EXTEND(DATE(call_start) + number, YEAR TO SECOND) AS call_date
FROM clr_end,
TABLE(integers(0, (SELECT DATE(call_end) - DATE(call_start) FROM clr_end)))
AS date_list(number)
INTO TEMP call_dates;
The difference between the two date values is an integer number of days (in IDS).
The procedure integers generates values from 0 to the number of days covered by the call and stores the result in a temp table. For the more general case of multiple records, it might be better to calculate the minimum and maximum dates and generate the dates in between rather than generate dates multiple times and then eliminate them with the DISTINCT clause.
Now use a cartesian product of the tariff table with the call_dates table to generate the rate information for each day. This is where the tariff times would be neater as intervals.
SELECT r.tariff_code,
d.call_date + (r.rate_start - TIME '00:00:00') AS rate_start,
d.call_date + (r.rate_end - TIME '00:00:00') AS rate_end,
FROM call_dates AS d, tariff AS r
INTO TEMP tariff_date_time;
Now we need to match the call log record with the tariffs that apply. The condition is a standard way of dealing with overlaps - two time periods overlap if the end of the first is later than the start of the second and if the start of the first is before the end of the second:
SELECT tdt.*, clr_end.*
FROM tariff_date_time tdt, clr_end
WHERE tdt.rate_end > clr_end.call_start
AND tdt.rate_start < clr_end.call_end
INTO TEMP call_time_tariff;
Then we need to establish the start and end times for the rate. The start time for the rate is the later of the start time for the tariff and the start time of the call. The end time for the rate is the earlier of the end time for the tariff and the end time of the call:
SELECT phone_id, called_number, tariff_code, rate_charged,
call_start, duration,
CASE WHEN rate_start < call_start THEN call_start
ELSE rate_start END AS rate_start,
CASE WHEN rate_end >= call_end THEN call_end
ELSE rate_end END AS rate_end
FROM call_time_tariff
INTO TEMP call_time_tariff_times;
Finally, we need to sum the times spent at each tariff rate, and take that time (in seconds) and multiply by the rate charged. Since the result of SUM(rate_end - rate_start) is an INTERVAL, not a number, I had to invoke a conversion function to convert the INTERVAL into a DECIMAL number of seconds, and that (non-standard) function is iv_seconds:
SELECT phone_id, called_number, tariff_code, rate_charged,
call_start, duration,
SUM(rate_end - rate_start) AS tariff_time,
rate_charged * iv_seconds(SUM(rate_end - rate_start)) AS tariff_cost
FROM call_time_tariff_times
GROUP BY phone_id, called_number, tariff_code, rate_charged,
call_start, duration;
For the sample data, this yielded the data (where I'm not printing the phone number and called number for compactness):
N 0.9876 2009-02-26 15:17:19 186234 0 16:00:00 56885.760000000
O 1.2345 2009-02-26 15:17:19 186234 0 10:01:11 44529.649500000
P 2.3456 2009-02-26 15:17:19 186234 1 01:42:41 217111.081600000
That's a very expensive call, but the telco will be happy with that. You can poke at any of the intermediate results to see how the answer is derived. You can use fewer temporary tables at the cost of some clarity.
For a single call, this will not be much different than running the code in VB in the client. For a lot of calls, this has the potential to be more efficient. I'm far from convinced that recursion is necessary in VB - straight iteration should be sufficient.
--- the bikari field is unemployment time you can delete any where
hourwork =
case when
timein <= timeout
(abs(DATEDIFF(mi, timein, timeout)) - bikari)/60 --
calculate Hour
SUM(abs(DATEDIFF(mi, timein, '23:59:00:00') + DATEDIFF(mi, '00:00:00', timeout) + 1) - bikari)/60 --
minwork =
case when
timein <= timeout
(abs(DATEDIFF(MI, timein, timeout)) - bikari)%60 --
calclate Hour
starttime is later
than endtime
SUM(abs(DATEDIFF(mi, timein, '23:59:00:00') + DATEDIFF(mi, '00:00:00', timeout) + 1) - bikari)%60--
calculate minute
starttime is later
end, tozihat
from kar_vasile
by id, vid, datein, timein, timeout, tozihat, bikari
Effectively in T-SQL? I suspect not, with the schema as described at present.
It might be possible, however, if your rate table stores the three tariffs for each date. There is at least one reason why you might do this, apart from the problem at hand: it's likely at some point that rates for one period or another might change and you may need to have the historic rates available.
So say we have these tables:
from_date_time DATETIME
, to_date_time DATETIME
, rate MONEY
id INT
, started DATETIME
, ended DATETIME
I think there are three cases to consider (may be more, I'm making this up as I go):
a call occurs entirely within one
rate period
a call starts in one
rate period (a) and ends in the next (b)
a call spans at least one complete
rate period
Assuming rate is per second, I think you might produce something like the following (completely untested) query
SELECT id, DATEDIFF(ss, started, ended) * rate /* case 1 */
FROM rates JOIN calls ON started > from_date_time AND ended < to_date_time
SELECT id, DATEDIFF(ss, started, to_date_time) * rate /* case 2a and the start of case 3 */
FROM rates JOIN calls ON started > from_date_time AND ended > to_date_time
SELECT id, DATEDIFF(ss, from_date_time, ended) * rate /* case 2b and the last part of case 3 */
FROM rates JOIN calls ON started < from_date_time AND ended < to_date_time
SELECT id, DATEDIFF(ss, from_date_time, to_date_time) * rate /* case 3 for entire rate periods, should pick up all complete periods */
FROM rates JOIN calls ON started < from_date_time AND ended > to_date_time
You could apply a SUM..GROUP BY over that in SQL or handle it in your code. Alternatively, with carefully-constructed logic, you could probably merge the UNIONed parts into a single WHERE clause with lots of ANDs and ORs. I thought the UNION showed the intent rather more clearly.
HTH & HIW (Hope It Works...)
This is a thread about your problem we had over at take a look because it includes some pretty slick solutions.
Following on from Mike Woodhouse's answer, this may work for you:
SELECT id, SUM(DATEDIFF(ss, started, ended) * rate)
FROM rates
JOIN calls ON
CASE WHEN started < from_date_time
THEN DATEADD(ss, 1, from_date_time)
ELSE started > from_date_time
CASE WHEN ended > to_date_time
THEN DATEADD(ss, -1, to_date_time)
ELSE ended END
< ended
An actual schema for the relevant tables in your database would have been very helpful. I'll take my best guesses. I've assumed that the Rates table has start_time and end_time as the number of minutes past midnight.
Using a calendar table (a VERY useful table to have in most databases):
WHEN C.start_time < R.rate_start_time THEN R.rate_start_time
ELSE C.start_time
WHEN C.end_time > R.rate_end_time THEN R.rate_end_time
ELSE C.end_time
Calls C
DATEADD(mi, Rates.start_time, CAL.calendar_date) AS rate_start_time,
DATEADD(mi, Rates.end_time, CAL.calendar_date) AS rate_end_time,
Calendar CAL
1 = 1
CAL.calendar_date >= DATEADD(dy, -1, C.start_time) AND
CAL.calendar_date <= C.start_time
R.rate_start_time < C.end_time AND
R.rate_end_time > C.start_time
I just came up with this as I was typing, so it's untested and you will very likely need to tweak it, but hopefully you can see the general idea.
I also just realized that you use a start_time and a duration for your calls. You can just replace C.end_time wherever you see it with DATEADD(ss, C.start_time, C.duration) assuming that the duration is in seconds.
This should perform pretty quickly in any decent RDBMS assuming proper indexes, etc.
Provided that you calls last less than 100 days:
WITH generate_range(item) AS
SELECT item + 1
FROM generate_range
WHERE item < 100
SELECT tday, id, span
SELECT tday, id,
CASE WHEN tbegin < clbegin THEN clbegin ELSE tbegin END,
CASE WHEN tend < clend THEN tend ELSE clend END
) AS span
SELECT DATEADD(day, item, DATEDIFF(day, 0, clbegin)) AS tday,,
DATEADD(minute, rangestart, DATEADD(day, item, DATEDIFF(day, 0, clbegin))) AS tbegin,
DATEADD(minute, rangeend, DATEADD(day, item, DATEDIFF(day, 0, clbegin))) AS tend
FROM calls, generate_range, tariff ti
WHERE DATEADD(day, 1, DATEDIFF(day, 0, clend)) > DATEADD(day, item, DATEDIFF(day, 0, clbegin))
) t1
) t2
WHERE span > 0
I'm assuming you keep your tariffs ranges in minutes from midnight and count lengths in minutes too.
The big problem with performing this kind of calculation at the database level is that it takes resource away from your database while it's going on, both in terms of CPU and availability of rows and tables via locking. If you were calculating 1,000,000 tariffs as part of a batch operation, then that might run on the database for a long time and during that time you'd be unable to use the database for anything else.
If you have the resource, retrieve all the data you need with one transaction and do all the logic calculations outside the database, in a language of your choice. Then insert all the results. Databases are for storing and retrieving data, and any business logic they perform should be kept to an absolute bare minimum at all times. Whilst brilliant at some things, SQL isn't the best language for date or string manipulation work.
I suspect you're already on the right lines with your VBA work, and without knowing more it certainly feels like a recursive, or at least an iterative, problem to me. When done correctly recursion can be a powerful and elegant solution to a problem. Tying up the resources of your database very rarely is.