Arithmetic overflow error when averaging/summing milliseconds - sql

Having a lot of trouble solving this (I looked through other posts with the same problem but unfortunately I didn't have much luck in applying proposed solutions to my situation).
I have a single (very large) table with transactional information. One of the columns is Transaction Execution Time (field type Time). Execution Time is usually <1 sec but it can go up to a couple of minutes.
Daily, Weekly, Monthly and Yearly transaction reports need to be available and they need to contain average transaction time. Due to large number of entries I am facing overflow when doing average/sum.
Here is a (simplified) sample I am using to test:
SELECT
DATEPART (YEAR, TimeStamp) as 'Year',
COUNT(*) as 'Transaction count',
AVG((DATEDIFF(MILLISECOND, '0:00:00',ExecutionTime))) as 'Average execution time',
SUM((DATEDIFF(MILLISECOND, '0:00:00',ExecutionTime))) as 'Total execution time'
FROM RecordedTransactions
GROUP BY
DATEPART (YEAR, TimeStamp)
What would be the best approach to solve the overflow?

You need to use DATEDIFF_BIG instead of DATEDIFF:
DATEDIFF_BIG ( datepart , startdate , enddate )
This function returns the count (as a signed big integer value) of the specified datepart boundaries crossed between the specified startdate and enddate.

Try instead ensuring that all of your datatypes are treated as a BIGINT rather than an int:
SELECT DATEPART(YEAR, TimeStamp) AS Year,
COUNT_BIG(*) AS [Transaction count],
AVG((DATEDIFF_BIG(MILLISECOND, '0:00:00', ExecutionTime))) AS [Average execution time],
SUM((DATEDIFF_BIG(MILLISECOND, '0:00:00', ExecutionTime))) AS [Total execution time]
FROM RecordedTransactions
GROUP BY DATEPART(YEAR, TimeStamp);
SUM and AVG return the same datatype they were passed. COUNT returns an int, and COUNT_BIG returns a bigint. For SUM, this means that (as your query stands) if it surpasses 2,147,483,647, it'll fail. Using DATEDIFF_BIG, means that the value returned is a bigint, thus your SUM can return a value of up to 9,223,372,036,854,775,807.

Related

Why is the result of datediff year in Firebird too high?

I have question about function datediff in firebird. When I try to diff two dates like 15.12.1999 and 30.06.2000 in sql like this
SELECT
SUM(datediff (YEAR, W.FROM, W.TO)),
SUM(datediff (MONTH, W.FROM, W.TO)),
SUM(datediff (DAY, W.FROM, W.TO))
FROM WORKERS W
WHEN W.ID=1
I get in result 1 year, 6 month and 198 days but it is not true with value years (of course result should be 0) How I have to write my query to get correct result in parameter year? In that link https://firebirdsql.org/refdocs/langrefupd21-intfunc-datediff.html in documentation there is information about this case but there is not how to solve this problem.
The documentation is not very clear, but I'm pretty sure that datediff() is counting the number of boundaries between two dates. (This is how the very similar function in SQL Server works.) Hence, for year, it is counting the number of "Dec 31st/Jan 1st" boundaries. This is explicitly explained in the documentation.
If you want a more accurate count, you can use a smaller increment. The following is pretty close:
(datediff(day, w.from, t.to) / 365.25) as years_diff

Calculating the AVG value per GROUP in the GROUP BY Clause

I'm working on a query in SQL Server 2005 that looks at a table of recorded phone calls, groups them by the hour of the day, and computes the average wait time for each hour in the day.
I have a query that I think works, but I'm having trouble convincing myself it's right.
SELECT
DATEPART(HOUR, CallTime) AS Hour,
(AVG(calls.WaitDuration) / 60) AS WaitingTimesInMinutes
FROM (
SELECT
CallTime,
WaitDuration
FROM Calls
WHERE DATEADD(day, DATEDIFF(Day, 0, CallTime), 0) = DATEADD(day, DATEDIFF(Day, 0, GETDATE()), 0)
AND DATEPART(HOUR, CallTime) BETWEEN 6 AND 18
) AS calls
GROUP BY DATEPART(HOUR, CallTime)
ORDER BY DATEPART(HOUR, CallTime);
To clarify what I think is happening, this query looks at all calls made on the same day as today, and where the hour of the call is between 6 and 18 -- the times are recorded and SELECTed in 24-hour time, so this between hours is to get calls between 6am and 6pm.
Then, the outer query computes the average of the WaitDuration column (and converts seconds to minutes) and then groups each average by the hour.
What I'm uncertain of is this: Are the reported BY HOUR averages only for the calls made in that hour's timeframe? Or does it compute each reported average using all the calls made on the day and between the hours? I know the AVG function has a optional OVER/PARTITION clause, and it's been a while since I used the AVG group function. What I would like is that each result grouped by an hour shows ONLY the average wait time for that specific hour of the day.
Thanks for your time in this.
The grouping happens on the values that get spit out of datepart(hour, ...). You're already filtering on that value so you know they're going to range between 6 and 18. That's all that the grouping is going to see.
Now of course the datepart() function does what you're looking for in that it looks at the clock and gives the hour component of the time. If you want your group to coincide with HH:00:00 to HH:59:59.997 then you're in luck.
I've already noted in comments that you probably meant to filter your range from 6 to 17 and that your query will probably perform better if you change that and compare your raw CallTime value against a static range instead. Your reasoning looks correct to me. And because your reasoning is correct, you don't need the inner query (derived table) at all.
Also if WaitDuration is an integer then you're going to be doing decimal division in your output. You'd need to cast to decimal in that case or change the divisor a decimal value like 60.00.
Yes if you use the AVG function with a GROUP BY only the items in that group are averaged. Just like if you use the COUNT function with a GROUP BY only the items in that group are counted.
You can use windowing functions (OVER/PARTITION) to conceptually perform GROUP BYs on different criteria for a single function.
eg
AVG(zed) OVER (PARTITION BY DATEPART(YEAR, CallTime)) as YEAR_AVG
Are the reported BY HOUR averages only for the calls made in that hour's timeframe?
Yes. The WHERE clause is applied before the grouping and aggregation, so the aggregation will apply to all records that fit the WHERE clause and within each group.

Standard Deviation over Datetime column grouped by Processname

I have a Table with fields ProcessName, StartDate and EndDate. I need to know the standard deviation of the EndDate in order to know the limit of time I can wait for a process to finish. Based on this standard deviation e-mails will be sent if some process is taking too much time to run.
My idea was to use the following query:
select
ProcessName,
STDEV(tm)
from (
select
ProcessName,
cast(EndDate as decimal(18,6)) tm
from Reports..ExecutionControl
) t1
group by ProcessName
But, first I don't know what it returns (if it is percentage or not), and maybe this is a lack of statistical understanding, and also I need to get the time limit a process can take, and it is not calculating it.
Could someone help me to sort this out? Thanks in advance to all!
Hmmm, I'm not sure how you would use the standard deviation without an average, but that is your question.
I would expect a query like this:
select ProcessName,
STDEV(dateadiff(second, Startdate, Enddate)) as stdev_dur
from Reports..ExecutionControl
group by ProcessName;
That is, the standard deviation is calculated based on the duration, not EndDate.

DAX Utilization % - Measure Formula

We are trying to perform a Utilization % calculation in DAX and I cannot figure out the forumula. Here's the setup:
Table images are here: http://imgh.us/dax.png
Table 1: There is a [timesheet] Table for our resources (R1, R2, R3, R4). Each timesheet record has a date and number of hours charged.
Table 2: There is a [resource] table with resource [hire date] and [termination date].
Table 3: There is a [calendar] table with available hours for each date. Weekdays have 8 h/day, weekends have 0 h/day
For any given filter context we need to calculate:
Utilization % = (hours charged) / (hours available)
This is not so hard, except for the fact that the (hours available) must only include dates between the [hire date] and [termination date] for each employee. Getting this to work at the aggregate level has been very difficult for us since the calculation must consider each individual employees date range.
The closest we have gotten is:
[hours available] := SUMX(DISTINCT(timesheet[resource_key]), SUM(calendar[utility_hours]))
[hours charged] := SUM(timesheet[bill_hours])
[Utilization %] := [hours charged] / [hours available]
However, this does not perform the resource hire/term date range filtering that is required.
Thanks for any help you can offer!
Your measure for [hours available] needs to be revised so that instead of summing all the utility hours in the calendar, it only sums over a filtered set of rows where the calendar date falls between the start date and the termination date of the respective resource.
[hours available]:=SUMX(DISTINCT(timesheet[resource_key]),CALCULATE(SUM(calendar[utility_hours]),FILTER('calendar',calendar[date_key]>=FIRSTDATE(resources[hire_date_key])),FILTER('calendar',calendar[date_key]<=LASTDATE(resources[termination_date_key]))))
You may want to amend the ">=" and "<=" depending on whether you wish to include the start and finish dates in the calculation.
EDIT: Revised version to pick up where resources are not used in the month, but are 'active'
[hours available]:=SUMX(resources,CALCULATE(SUM(calendar[utility_hours]),FILTER('calendar',calendar[date_key]>=FIRSTDATE(resources[hire_date_key])),FILTER('calendar',calendar[date_key]<=LASTDATE(resources[termination_date_key]))))
But you also need to change your [hours charged] to give zeroes, rather than blanks by adding a zero:
[hours charged]:=SUM(timesheet[bill_hours])+0

SQL: select one record for each day nearest to a specific time

I have one table that stores values with a point in time:
CREATE TABLE values
(
value DECIMAL,
datetime DATETIME
)
There may be many values on each day, there may also be only one value for a given day. Now I want to get the value for each day in a given timespan (e.g. one month) which is nearest to a given time of day. I only want to get one value per day if there are records for this day or no value if there are no records. My database is PostgreSQL. I'm quite stuck with that. I could just get all values in the timespan and select the nearest value for each day programmatically, but that would mean to pull a huge amount of data from the database, because there can be many values on one day.
(Update)
To formulate it a bit more abstract: I have data of arbitrary precision (could be one minute, could be two hours or two days) and I want to convert it to a fixed precision of one day, with a specific time of day.
(second update)
This is the query from the accepted answer with correct postgresql type converstions, assuming the desired time is 16:00:
SELECT datetime, value FROM values, (
SELECT DATE(datetime) AS date, MIN(ABS(EXTRACT(EPOCH FROM TIME '16:00' - CAST(datetime AS TIME)))) AS timediff
FROM values
GROUP BY DATE(datetime)
) AS besttimes
WHERE
CAST(values.datetime AS TIME) BETWEEN TIME '16:00' - CAST(besttimes.timediff::text || ' seconds' AS INTERVAL)
AND TIME '16:00' + CAST(besttimes.timediff::text || ' seconds' AS INTERVAL)
AND DATE(values.datetime) = besttimes.date
How about going into this direction?
SELECT values.value, values.datetime
FROM values,
( SELECT DATE(datetime) AS date, MIN(ABS(_WANTED_TIME_ - TIME(datetime))) AS timediff
FROM values
GROUP BY DATE(datetime)
) AS besttimes
WHERE TIME(values.datetime) BETWEEN _WANTED_TIME_ - besttimes.timediff
AND _WANTED_TIME_ + besttimes.timediff
AND DATE(values.datetime) = besttimes.date
I am not sure about the date/time extracting and abs(time) functions, so you will have to replace them probably.
It appears you have two parts to solve:
Are there any results for a day at all?
If there are, then which is the nearest one?
By shortcircuiting the process at part 1 if you have no results you'll save a lot of execution time.
The next thing to note is that you don't have to pull the data from the database, wait until you have an answer or not by using PLSQL functions (or something else) to work it out on the server first.
Once you have a selection of times to check you can use intervals to compare them. Check the Postgres docs on intervals and datetime functions for precise instructions, but basically you minus the selected dates from the date you've given and the one with the smallest interval is the one you want.