This question already has answers here:
Function to Calculate Median in SQL Server
(37 answers)
Closed 5 years ago.
I have a number of select statements which calculate fields such as sums, division and also averages. However I am now needing to include a median based on the query, as well as a mean(avg). The table contains 50,000 rows in the MSSQL database, so in the same query return I need to have the results return for each line.
I know there is not a Median formula in SQl, well not that I am aware of. I am using SQL 2012. So if anyone has an idea, I would welcome your thoughts, as I can not be the only person to come up against this.
Example would be something like this
Select
Round (AVG ([LENGTH]),2) as Length_X,
Round (Median ([LENGTH]),2) as Length_median,
I understand Median is not an accepted sql statement, so just for demo purposes to get my point across
Cheers
If you have numeric data, then there is a "median" function. It is just spelled differently:
select percentile_cont(0.5) within group (order by ??) over ()
or
select percentile_disc(0.5) within group (order by ??) over ()
(The difference between the two is subtle and probably doesn't matter for most purposes.)
It is, unfortunately, not an aggregation function, so it is often used in a subquery.
Related
This question already has answers here:
Is SQL GROUP BY a design flaw? [closed]
(9 answers)
Is it really necessary to have GROUP BY in the SQL standard
(3 answers)
Closed 12 months ago.
Lately I have been dealing with extremely wide queries that perform a lot of transforms on data, and I am annoyed by having to maintain wide group by statements. This has me wondering,
why do they exist?
For example
select
company,
sum(owing) as owing
from
receivables
group by
company
Given this statement, it seems to me that the group by is implied.
There is an aggregate function
There only field not part of an aggregation is company.
Therefore, I would expect that a query engine could determine that company should be the thing grouped on.
select
company,
sum(owing) as owing
from
receivables
My general assumption is always that something like this exists for a reason, I just don't understand the reason, but ... I don't understand the reason.
What is the scenario that makes the existence of group by necessary?
Update
Based on comments, a point regarding mult-table queries making it less obvious to the engine. Also, a point regarding multi-nonaggregate fields.
select
c.name as company,
t.curr as currency,
sum(t.amt) as owing
from
company c
inner join transactions t on c.id = t.comp_id
having
sum(t.amt) < 0
This (more realistic) version of the original query uses two tables. It is still unclear to me, why the engine would not know to group on company and currency as they are the non-aggregated fields
An example from Oracle which supports nested aggregate functions
Assume that you have a cube rolling results.
The following query shows us the throws distribution.
select result
,count(*) as count
from cube_roll
group by result
RESULT
COUNT
1
11
2
23
3
12
4
23
5
15
6
16
The following query shows us the maximum count for the results.
Please note that result does not appear in the SELECT clause.
select max(count(*)) as max_count
from cube_roll
group by result
MAX_COUNT
23
Please note that result cannot be added to the SELECT clause.
select result -- invalid reference
,max(count(*)) as max_count
from cube_roll
group by result
ORA-00937: not a single-group group function
Fiddle
I know it wasn't allowed in SQL-92. But since then it may have changed, particularly when there's a window applied. Can you explain the changes and give the version (or versions if there were more) in which they were introduced?
Examples
Is SUM(COUNT(votes.option_id)) OVER() valid syntax per standard SQL:2016 (or earlier)?
This is my comment (unanswered, an probably unlikely to in such an old question) in Why can you nest aggregate functions when using a window function in PostgreSQL?.
The Calculating Running Total (SQL) kata at Codewars has as its most upvoted solution (using PostgreSQL 13.0, a highly standard compliant engine, so the code is likely to be standard) this one:
SELECT
CREATED_AT::DATE AS DATE,
COUNT(CREATED_AT) AS COUNT,
SUM(COUNT(CREATED_AT)) OVER (ORDER BY CREATED_AT::DATE ROWS UNBOUNDED PRECEDING)::INT AS TOTAL
FROM
POSTS
GROUP BY
CREATED_AT::DATE
(Which could be simplified to:
SELECT
created_at::DATE date,
COUNT(*) COUNT,
SUM(COUNT(*)) OVER (ORDER BY created_at::DATE)::INT total
FROM posts
GROUP BY created_at::DATE
I assume the ::s are a new syntax for casting I didn't know of. And that casting from TIMESTAMP to DATE is now allowed (in SQL-92 it wasn't).)
As this SO answer explains, Oracle Database allows it even without a window, pulling in the GROUP BY from context. I don't know if the standard allows it.
You already noticed the difference yourself: It's all about the window. COUNT(*) without an OVER clause for instance is an aggregation function. COUNT(*) with an OVER clause is a window function.
By using aggregation functions you condense the original rows you get after the FROM clause and WHERE clause are applied to either the specified group in GROUP BY or to one row in the absence of a GROUP BY clause.
Window functions, aka analytic functions, are applied afterwards. They don't change the number of result rows, but merely add information by looking at all or some rows (the window) of the selected data.
In
SELECT
options.id,
options.option_text,
COUNT(votes.option_id) as vote_count,
COUNT(votes.option_id) / SUM(COUNT(votes.option_id)) OVER() * 100.0 as vote_percentage
FROM options
LEFT JOIN votes on options.id = votes.option_id
GROUP BY options.id;
we first join votes to options and then count the votes per option by aggregating the joined rows down to one result row per option (GROUP BY options.id). We count on a non-nullable column in the votes table (COUNT(votes.option_id), so we get a zero count in case there are no votes, because in an outer joined row this column is set to null.
After aggregating all rows and getting thus one row per option we apply a window function (SUM() OVER) on this result set. We apply the analytic SUM on the vote count (SUM(COUNT(votes.option_id)) by looking at the whole result set (empty OVER clause), thus getting the same total vote count in every row. We use this value for a calculation: option's vote count diveded by total vote count times 100, which is the option's percentage of total votes.
The PostgreSQL query is very similar. We select the number of posts per date (COUNT(created_at) is nothing else than a mere COUNT(*)) along with a running total of these counts (by using a window that looks at all rows up to the current row).
So, while this looks like we are nesting two aggregate functions, this is not really the case, because SUM OVER is not considered an agregation function but an analytic/window function.
Oracle does allow applying an aggregate function directly on another, thus invoking a final aggregation on a previous grouped by aggregation. This allows us to get one result row of, say, the average of sums without having to write a subquery for this. This is not compliant with the SQL standard, however, and very unpopular even among Oracle developers at that.
I don't know much at all about SQL, I've just toyed with it here and there through the years but never really 'used' it.
I'm trying to get a list of prices / volumes and aggregate them:
CREATE TABLE IF NOT EXISTS test (
ts timestamp without time zone NOT NULL,
price decimal NOT NULL,
volume decimal NOT NULL
);
what I'd like is to extract:
min price
max price
sum volume
sum (price * volume) / sum (volume)
By 1h slices
If I forget about the last line for now, I have:
SELECT MIN(price) min_price, MAX(price) max_price, SUM(volume) sum_vol, date_trunc('hour', ts) ts_group FROM test
GROUP BY ts_group;
My understanding is that 'GROUP BY ts_group' will calculate ts_group, build groups of rows and then apply the MIN / MAX / SUM functions after. Since the syntax doesn't make any sense to me (entries on the select line would be treated differently while being declared together vs. building groups and then declaring operations on the groups), I could be dramatically wrong here.
But that will not return the min_price, max_price and sum_vol results after the grouping; I get ts, price and volume in the results.
If I remove the GROUP BY line to try to see all the output, I get the error:
column "test.ts" must appear in the GROUP BY clause or be used in an aggregate function
Which I don't really understand either...
I looked at:
must appear in the GROUP BY clause or be used in an aggregate function but I don't really get it
and I looked at the doc (https://www.postgresqltutorial.com/postgresql-group-by/) which shows working example, but doesn't really clarify what is wrong with what I'm trying to do here.
While I'd be happy to have a working solution, I'm more looking from an explanation, or pointers toward good resources, that would allow me to understand this.
I have this working solution:
SELECT MIN(price) min_price, MAX(price) max_price, SUM(volume) sum_vol, (SUM(price * volume)/SUM(volume)) vwap FROM test
GROUP BY date_trunc('hour', ts);
but I still don't understand the error message from my question
All of your expressions in SQL must use data elements and functions that are known to PostgreSQL. In your first example, ts_group is neither an element of your table, nor a defined function, so it complained that it did not know how to calculate it.
Your second example works because date_trunc is a known function and ts is defined as a data element of the test table.
It also gets you the correct grouping (by hour intervals) because date_trunc 'blurs' all of those unique timestamps that by themselves would not combine into groups.
Without a GROUP BY, then having any aggregates in your select list means it will aggregate everything down to just one row. But how does it aggregate date_trunc('hour', ts) down to one row, since there is no aggregating function specified for it? If you were using MySQL, it would just pick some arbitrary value for the column from all the seen values and report that as the "aggregate". But PostgreSQL is not so cavalier with your data. If your query is vague in this way, it refuses to run it. If you just want to see some value from the set without caring which one it is, you could use min or max to aggregate it.
Since the syntax doesn't make any sense to me (entries on the select line would be treated differently while being declared together vs. building groups and then declaring operations on the groups),
You are trying to understand SQL as if it were C. But it is very different. Just learn it for what it is, without trying to force it to be something else. The select list is where you define the columns you want to see in the output. They may be computed in different ways, but what they have in common is that you want each of them to show up in the output, so they are listed together in that spot.
I'm trying to write a script which will take hourly averages of time Series Data. I.e. I have 5 years worth of data at 5 minute intervals, but I need 5 years of hourly averages. Postgresql seems to be complaining about a missing column in the group by clause which is already in there? What schoolboy error am I making here?
SELECT
time_audit_primary.clockyear,
time_audit_primary.clockmonth,
time_audit_primary.clockday,
time_audit_primary.clockhour,
AVG (time_audit_primary.time_offset) AS timeoffset,
AVG (time_audit_primary.std_dev) AS std_dev
FROM tempin.time_audit_primary
GROUP BY (time_audit_primary.clockyear,
time_audit_primary.clockmonth,
time_audit_primary.clockday,
time_audit_primary.clockhour)
It's the parenthesis around the group by columns that is causing the error. Just remove them:
SELECT
time_audit_primary.clockyear,
time_audit_primary.clockmonth,
time_audit_primary.clockday,
time_audit_primary.clockhour,
AVG (time_audit_primary.time_offset) AS timeoffset,
AVG (time_audit_primary.std_dev) AS std_dev
FROM tempin.time_audit_primary
GROUP BY time_audit_primary.clockyear,
time_audit_primary.clockmonth,
time_audit_primary.clockday,
time_audit_primary.clockhour
Hello All,
I am working on a report where I am doing calculations:
Let's take the first line as an example. In the remaining prior column we have 15 and in the taken column we have 0.5, so in the remaining column, we have 14.5.
Now the issue is to use the result in the remaining field and transfer it to the next line in the remaining prior column. So instead of having 14 we should be having 14.5.
Has anyone worked on something similar and guide me on how to work on this? I really want to learn how to solve such an issue.
The ANSI standard lag() function does exactly what you want. SQL tables represent unordered sets, so I need to assume that you have some column -- which I will call id -- that identifies the ordering of the rows.
The syntax for lag() is:
select t.*, lag(Remaining) over (order by 1) as prevRemaining
from table t;
If you have a database that does not support the ANSI standard window functions, you can get the same effect with a subquery. However, the syntax for that might vary slightly among databases.