Postgresql column must appear in the GROUP BY clause or be used in an aggregate function - sql

I'm trying to write a script which will take hourly averages of time Series Data. I.e. I have 5 years worth of data at 5 minute intervals, but I need 5 years of hourly averages. Postgresql seems to be complaining about a missing column in the group by clause which is already in there? What schoolboy error am I making here?
SELECT
time_audit_primary.clockyear,
time_audit_primary.clockmonth,
time_audit_primary.clockday,
time_audit_primary.clockhour,
AVG (time_audit_primary.time_offset) AS timeoffset,
AVG (time_audit_primary.std_dev) AS std_dev
FROM tempin.time_audit_primary
GROUP BY (time_audit_primary.clockyear,
time_audit_primary.clockmonth,
time_audit_primary.clockday,
time_audit_primary.clockhour)

It's the parenthesis around the group by columns that is causing the error. Just remove them:
SELECT
time_audit_primary.clockyear,
time_audit_primary.clockmonth,
time_audit_primary.clockday,
time_audit_primary.clockhour,
AVG (time_audit_primary.time_offset) AS timeoffset,
AVG (time_audit_primary.std_dev) AS std_dev
FROM tempin.time_audit_primary
GROUP BY time_audit_primary.clockyear,
time_audit_primary.clockmonth,
time_audit_primary.clockday,
time_audit_primary.clockhour

Related

How to get the Average of a Row After Getting Count?

I am trying to use an aggregate function to get the average(count( of a row in SQL Server. However, I continue to get this message:
"Cannot perform an aggregate function on an expression containing an aggregate or a subquery."
In the first picture is the table, in the second picture is the table with the counts for each officer_id, I am trying to find the average amount of calls per officer and cannot seem to find the right SQL query to do it.
The query I thought may work is:
SELECT AVG(COUNT(officer_id))
FROM crime_officers
ORDER BY officer_id;
But this is where I get the aggregate error. Does anyone have any recommendations?
UPDATED table with this query
SELECT officer_id, COUNT(crime_id)
FROM crime_officers
GROUP BY officer_id;
Original table: crime_officers
If I understand correctly, this query provides the average number of crimes per officer. A single value, which is equal to the total number of crimes divided between all officers.
SELECT COUNT(*)*1.0/COUNT(distinct officer_id) as 'Average Crimes per Officer'
FROM crime_officers;

DATEIME_DIFF throwing error when using safe divide

I've created a query that I'm hoping to use to fill a table with daily budgets at the end of every day. To do this, I just need some simple maths:
monthly budget / number of days left in the month.
Then, at the end of the month, I can SUM all of the rows to calculate an accurate budget.
The query is as follows:
SELECT *,
ROUND(SAFE_DIVIDE(budget, DATETIME_DIFF(CURRENT_DATE(), LAST_DAY(CURRENT_DATE()), DAY)),2) AS daily_budget
FROM `mm-demo-project.marketing_hub.budget_manager`
When executing the query, my results present as negative numbers, which according to the documentation for this function, is likely caused by the result overflowing the result type.
View the results of the query.
I've made a fools guess at rounding the calculation. Needless to say that it did not work at all.
How can I stop my query from returning negative number?
use below
SELECT *,
ROUND(SAFE_DIVIDE(budget, DATETIME_DIFF(LAST_DAY(CURRENT_DATE()), CURRENT_DATE(), DAY)),2) AS daily_budget
FROM `mm-demo-project.marketing_hub.budget_manager`

Calculating Outliers - Nested Aggregate Error

I am currently working SQL Workbench/J and Amazon Redshift.
I am working on a query with the intent to identify the number of outliers within a data set.
My source data contains one record per day for multiple symbols. I am utilizing 30 days of trailing data. In short, for 30 days there are ten symbols with 30 records each.
I am then utilizing the following query to calculate the mean, standard deviation, and upper/lower control limits for each unique symbol based upon the 30 day data set.
select
symbol,
avg(high) as MEAN,
cast(stddev_samp(high) as dec(14,2)) STDV,
(MEAN+STDV*3) as UCL,
(MEAN-STDV*3) as LCL
from historical
group by symbol
;
My next step will be calculating how many individual values from the 'high' column exceed the upper control limit calculated value. I have tried to add the following count(case...) statement, but it is failing:
select
symbol,
avg(high) as MEAN,
cast(stddev_samp(high) as dec(14,2)) STDV,
(MEAN+STDV*3) as UCL,
(MEAN-STDV*3) as LCL,
count(case when high>avg(high) then 1 else 0 end) as outlier
from historical
group by symbol
;
The specific error is
Amazon Invalid operation: aggregate function calls may not have nested aggregate or window function
Is a count(case..) statement the right method to utilize here, or what would the recommended approach or example be?
There are a number of ways to do this but I think all of them involve a sub-query. This is because you have an aggregate (avg) compared to a per-row value (high) and then summing the the comparison.
I'd go with a sub-query where you perform an avg() window function partitioned by symbol. This will give you the average of the group on every row then just do the query as you have it. Kinda like this:
I am currently working SQL Workbench/J and Amazon Redshift.
I am working on a query with the intent to identify the number of outliers within a data set.
My source data contains one record per day for multiple symbols. I am utilizing 30 days of trailing data. In short, for 30 days there are ten symbols with 30 records each.
I am then utilizing the following query to calculate the mean, standard deviation, and upper/lower control limits for each unique symbol based upon the 30 day data set.
select symbol, avg(high) as MEAN, cast(stddev_samp(high) as dec(14,2)) STDV, (MEAN+STDV3) as UCL, (MEAN-STDV3) as LCL from historical group by symbol ;
My next step will be calculating how many individual values from the 'high' column exceed the upper control limit calculated value. I have tried to add the following count(case...) statement, but it is failing:
select symbol, avg(high) as MEAN, cast(stddev_samp(high) as dec(14,2)) STDV, (MEAN+STDV3) as UCL,
(MEAN-STDV3) as LCL, count(case when high>group_avg then 1 else 0 end) as outlier
from (
select *, avg(high) over (partition by symbol) as group_avg
from historical )
group by symbol ;
(You could also replace "avg(high) as MEAN" with "min(group_avg) as MEAN" since you already computed the average in the window function. Just a possible slight optimization.)
Use window functions to calculate the values for the standard deviation and mean. Then aggregate:
select symbol, mean, STDV,
(MEAN+STDV*3) as UCL, (MEAN-STDV*3) as LCL,
sum( (high > mean)::int) ) as outlier
from (select h.*,
avg(high) over (partition by symbol) as mean,
cast(stddev_samp(high) over (partition by symbol) as dec(14,2)) as STDV
from historical h
) h
group by symbol, mean, STDV;
Your definition of "outlier" is rather strange -- merely being higher than the average is going to happen (very roughly) about half the time. The more typical definition I have seen is outside the range of 2 standard deviations.
As a comment not directly related to the SQL. It seems unusual for me to be using future data to determine outliers. I would expect that a trailing 30 days would be used for that purpose. However, that is not the question you have asked here.

Calculating the AVG value per GROUP in the GROUP BY Clause

I'm working on a query in SQL Server 2005 that looks at a table of recorded phone calls, groups them by the hour of the day, and computes the average wait time for each hour in the day.
I have a query that I think works, but I'm having trouble convincing myself it's right.
SELECT
DATEPART(HOUR, CallTime) AS Hour,
(AVG(calls.WaitDuration) / 60) AS WaitingTimesInMinutes
FROM (
SELECT
CallTime,
WaitDuration
FROM Calls
WHERE DATEADD(day, DATEDIFF(Day, 0, CallTime), 0) = DATEADD(day, DATEDIFF(Day, 0, GETDATE()), 0)
AND DATEPART(HOUR, CallTime) BETWEEN 6 AND 18
) AS calls
GROUP BY DATEPART(HOUR, CallTime)
ORDER BY DATEPART(HOUR, CallTime);
To clarify what I think is happening, this query looks at all calls made on the same day as today, and where the hour of the call is between 6 and 18 -- the times are recorded and SELECTed in 24-hour time, so this between hours is to get calls between 6am and 6pm.
Then, the outer query computes the average of the WaitDuration column (and converts seconds to minutes) and then groups each average by the hour.
What I'm uncertain of is this: Are the reported BY HOUR averages only for the calls made in that hour's timeframe? Or does it compute each reported average using all the calls made on the day and between the hours? I know the AVG function has a optional OVER/PARTITION clause, and it's been a while since I used the AVG group function. What I would like is that each result grouped by an hour shows ONLY the average wait time for that specific hour of the day.
Thanks for your time in this.
The grouping happens on the values that get spit out of datepart(hour, ...). You're already filtering on that value so you know they're going to range between 6 and 18. That's all that the grouping is going to see.
Now of course the datepart() function does what you're looking for in that it looks at the clock and gives the hour component of the time. If you want your group to coincide with HH:00:00 to HH:59:59.997 then you're in luck.
I've already noted in comments that you probably meant to filter your range from 6 to 17 and that your query will probably perform better if you change that and compare your raw CallTime value against a static range instead. Your reasoning looks correct to me. And because your reasoning is correct, you don't need the inner query (derived table) at all.
Also if WaitDuration is an integer then you're going to be doing decimal division in your output. You'd need to cast to decimal in that case or change the divisor a decimal value like 60.00.
Yes if you use the AVG function with a GROUP BY only the items in that group are averaged. Just like if you use the COUNT function with a GROUP BY only the items in that group are counted.
You can use windowing functions (OVER/PARTITION) to conceptually perform GROUP BYs on different criteria for a single function.
eg
AVG(zed) OVER (PARTITION BY DATEPART(YEAR, CallTime)) as YEAR_AVG
Are the reported BY HOUR averages only for the calls made in that hour's timeframe?
Yes. The WHERE clause is applied before the grouping and aggregation, so the aggregation will apply to all records that fit the WHERE clause and within each group.

Query to find a weekly average

I have an SQLite database with the following fields for example:
date (yyyymmdd fomrat)
total (0.00 format)
There is typically 2 months of records in the database. Does anyone know a SQL query to find a weekly average?
I could easily just execute:
SELECT COUNT(1) as total_records, SUM(total) as total FROM stats_adsense
Then just divide total by 7 but unless there is exactly x days that are divisible by 7 in the db I don't think it will be very accurate, especially if there is less than 7 days of records.
To get a daily summary it's obviously just total / total_records.
Can anyone help me out with this?
You could try something like this:
SELECT strftime('%W', thedate) theweek, avg(total) theaverage
FROM table GROUP BY strftime('%W', thedate)
I'm not sure how the syntax would work in SQLite, but one way would be to parse out the date parts of each [date] field, and then specifying which WEEK and DAY boundaries in your WHERE clause and then GROUP by the week. This will give you a true average regardless of whether there are rows or not.
Something like this (using T-SQL):
SELECT DATEPART(w, theDate), Avg(theAmount) as Average
FROM Table
GROUP BY DATEPART(w, theDate)
This will return a row for every week. You could filter it in your WHERE clause to restrict it to a given date range.
Hope this helps.
Your weekly average is
daily * 7
Obviously this doesn't take in to account specific weeks, but you can get that by narrowing the result set in a date range.
You'll have to omit those records in the addition which don't belong to a full week. So, prior to summing up, you'll have to find the min and max of the dates, manipulate them such that they form "whole" weeks, and then run your original query with a WHERE that limits the date values according to the new range. Maybe you can even put all this into one query. I'll leave that up to you. ;-)
Those values which are "truncated" are not used then, obviously. If there's not enough values for a week at all, there's no result at all. But there's no solution to that, apparently.