Can I use non-aggregate columns with group by? - sql

You cannot (should not) put non-aggregates in the SELECT line of a GROUP BY query.
I would however like access the one of the non-aggregates associated with the max. In plain english, I want a table with the oldest id of each kind.
CREATE TABLE stuff (
id int,
kind int,
age int
);
This query gives me the information I'm after:
SELECT kind, MAX(age)
FROM stuff
GROUP BY kind;
But it's not in the most useful form. I really want the id associated with each row so I can use it in later queries.
I'm looking for something like this:
SELECT id, kind, MAX(age)
FROM stuff
GROUP BY kind;
That outputs this:
SELECT stuff.*
FROM
stuff,
( SELECT kind, MAX(age)
FROM stuff
GROUP BY kind) maxes
WHERE
stuff.kind = maxes.kind AND
stuff.age = maxes.age
It really seems like there should be a way to get this information without needing to join. I just need the SQL engine to remember the other columns when it's calculating the max.

You can't get the Id of the row that MAX found, because there might not be only one id with the maximum age.

You cannot (should not) put non-aggregates in the SELECT line of a GROUP BY query.
You can, and have to, define what you are grouping by for the aggregate function to return the correct result.
MySQL (and SQLite) decided in their infinite wisdom that they would go against spec, and allow queries to accept GROUP BY clauses missing columns quoted in the SELECT - it effectively makes these queries not portable.
It really seems like there should be a way to get this information without needing to join.
Without access to the analytic/ranking/windowing functions that MySQL doesn't support, the self join to a derived table/inline view is the most portable means of getting the result you desire.

I think it's tempting indeed to ask the system to solve the problem in one pass rather than having to do the job twice (find the max, and the find the corresponding id). You can do using CONCAT (as suggested in Naktibalda refered article), not sure that would be more effeciant
SELECT MAX( CONCAT( LPAD(age, 10, '0'), '-', id)
FROM STUFF1
GROUP BY kind;
Should work, you have to split the answer to get the age and the id.
(That's really ugly though)

In recent databases you can use sum() over (parition by ...) to solve this problem:
select id, kind, age as max_age from (
select id, kind, age, max(age) over (partition by kind) as mage
from table)
where age = mage
This can then be single pass

PostgesSQL's DISTINCT ON will be useful here.
SELECT DISTINCT ON (kind) kind, id, age
FROM stuff
ORDER BY kind, age DESC;
This groups by kind and returns the first row in the ordered format. As we have ordered by age in descending order, we will get the row with max age for kind.
P.S. columns in DISTINCT ON should appear first in order by

You have to have a join because the aggregate function max retrieves many rows and chooses the max.
So you need a join to choose the one that the agregate function has found.
To put it a different way how would you expect the query to behave if you replaced max with sum?
An inner join might be more efficient than your sub query though.

Related

SQL aggregate function when count(*)=1 so there can be only one value

Sometimes you write a grouped query where each group is a single row, as having count(*) = 1. This means that the usual aggregate functions like min, max, sum and so on are a bit pointless: the min equals the max, equals the sum, equals the average. Since there's exactly one value to aggregate.
I usually end up picking min arbitrarily. If we take the familiar example of a table mapping a book to its author(s), I might want to query just books that have a single author:
-- For books that have a single author, pull back that author's id.
select book_id,
min(author_id) as author_id
-- I could equally well use max(author_id) or even sum(author_id)...
from book_authors
group by book_id
having count(*) = 1
That works, but it seems it could be expressed better. I'm not actually interested in the 'minimum' per se, but just to get the single value which I know exists. Some column types (such as bit in Microsoft SQL Server) do not support the min aggregate function so you have to do workarounds like convert(bit, min(convert(int, mycol))).
So, I expect the answer will be no, but is there some better way to specify my intent?
select book_id,
there_must_be_one_value_so_just_return_it(author_id) as author_id
from book_author
group by book_id
having count(*) = 1
Clearly, if you're not requiring count(*)=1 then you no longer guarantee a single value and the special aggregate function could not be used. That error could be caught when the SQL is compiled.
The desired result would be equivalent to the min query above.
I'm using Microsoft SQL Server (2016) but as this is a fairly "blue sky" kind of question, I would be interested in replies about other SQL dialects too.
You could, instead, use a windowed COUNT and then filter based on that:
WITH CTE AS(
SELECT ba.book_id,
ba.author_id,
COUNT(ba.book_id) OVER (PARTITION BY ba.book_id) AS Authors
FROM dbo.book_authors ba)
SELECT c.book_id,
c.author_id
FROM CTE c
WHERE c.Authors = 1;
An alternative method would be to use a correlated subquery:
SELECT ba.book_id,
ba.author_id
FROM dbo.book_authors ba
WHERE EXISTS (SELECT 1
FROM dbo.book_authors e
WHERE e.book_id = ba.book_id
GROUP BY e.book_id
HAVING COUNT(*) = 1);
I have not tested performance on either with a decent amount of data, however, I would hope that for a correlated subquery with a well indexed table, you should see better performance.

In SQL, does groupby on an ordered query behave the same as doing both in the same query?

Are the following queries identical, or might I get different results (in any major DB system, e.g. MSSQL, MySQL, Postgres, SQLite):
Doing both in the same query:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
vs. ordering in a subquery:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Looking at the first sample:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
Let's think about what GROUP BY does by looking at this imaginary sample data:
A B
- -
1 1
1 2
Then think about this query:
SELECT A
FROM SampleData
GROUP BY A
ORDER BY B
The GROUP BY clause puts the two rows into a single group. Then we want to order by B... but the two rows in the group have different values for B. Which should it use?
Obviously in this situation it doesn't really matter: there's only one row in the results, so the order is not relevant. But generally, how does the database know what to do?
The database could guess which one you want, or just take the first value, or the last — whatever those mean in a setting where the data is unordered by definition. And in fact this is what MySql will try to do for you: it will try to guess are your meaning. But this response is really inappropriate. You specified an in-exact query; the only correct thing to do is throw an error, which is what most databases will do.
Now let's look at the second sample:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Here it is important to remember databases have their roots in relational set theory, and what we think of as "tables" are more formally described as Unordered Relations. Again: the idea of being "unordered" is baked into the very nature of a table at the deepest level.
In this case the inner query can run and create results in the specified order, and then the outer query can use that with GROUP BY to create a new set... but just like tables, query results are unordered relations. Without an ORDER BY clause the final result is also unordered by definition.
Now you might tend to get results in the order you want, but the reality is all bets are off. In fact, the databases that run this query will tend to give you results in the order in which they first encountered each group, which will not tend to match the ORDER BY because the GROUP BY expression is looking at completely different columns. Other databases (Sql Server is in this group) will not even allow the query to run, though I might prefer a warning here.
So now we come to the final section, where we must re-think the question, like this:
How can I use GROUP BY on the one group column, while also ordering by some_other_column not in the group?
The answer is each group can contain multiple rows, and so you must tell the database which row to look at to get the correct (specific) some_other_column value. The typical way to do this is with another aggregate function, which might look like this:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_agg_func(some_other_column)
That code will run without error on pretty much any database.
Just be careful here. On one hand, when people want to do this it's often for the common case where they know every record for some_other_column in each group will have the same value. For example, you might GROUP BY UserID, but ORDER BY Email, where of course every record with the same UserID should have the same Email address. As humans, we have the ability to make that kind of inference. Computers, however, don't handle that kind of thinking as well, and so we help it out with an extra aggregate function like MIN() or MAX().
On the other hand, if you're not careful sometimes the two different aggregate functions don't match up, and you end up showing the value from one row in the group, while using a completely different row from the group for the ORDER BY expression in a way that is not good.
Tables are unordered sets of data. A query result is a table. So if you select from a subquery that contains an ORDER BY clause, that clause means nothing; the data set is unordered by definition. The DBMS is free to ignore the ORDER BY clause. Some DBMS may even issue a warning or error, but I suppose it's more common that the ORDER BY clause just has no effect - at least not guaranteed.
In this query
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
you try to order your results by some_other_value. If this is meant to be a column, you can't, because that other column is no part of your results. You'll get a syntax error. If some_other_value is a fixed value, then there is nothing ordered, because you'd have the same sort key for every row. But it can be an expression based on your result data (group key and aggreation results) and you can order your result rows by that.
In this query
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
the ORDER BY clause has no effect. You could just as well just select FROM my_table directly:
SELECT group, some_agg_func(some_value)
FROM my_table as alias
GROUP BY group
This gets the results unordered (or at least the order you see is not guaranteed to be thus every time you run that query), because your query doesn't have an ORDER BY clause.

Why isn't FIRST_VALUE and LAST_VALUE an aggregation function in SQL?

Is there any special reason that SQL only implements FIRST_VALUE and LAST_VALUE as a windowing function instead of an aggregation function? I find it quite common to encounter problems such as "find the item with highest price in each category". While other languages (such as python) provide MIN/MAX functions with keywords such that
MAX(item_name, key=lambda x: revenue[x])
is possible, In SQL the only way to tackle this problem seems to be:
WITH temp as(
SELECT *, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog)
SELECT category, MAX(fv) -- MIN(fv) also OK
FROM temp
GROUP BY category;
Is there a special reason that there is no "aggregation version" of FIRST_VALUE such that
SELECT category, FIRST_VALUE(item_name, revenue)
FROM catalog
GROUP BY
category
or is it just the way it is?
That’s just the way it is, as far as I’m concerned. I suspect the only real answer would be “because it’s not in the SQL spec” and the only people who could really answer as to why it’s not in the spec are the people who write it. Questions of the form “what was (name of relevant external authority) thinking when they mandated that (name of product) should operate like this” are actually typically off topic here because very few people can reliably and factually answer.. I don’t even like my own answer here, as it feels like an extended comment on a question that cannot realistically be answered
Aggregate functions work on sets of data and while some of them might require some implied ordering operation such as median, the functions are always about the column they’re operating on, not a “give me the value of this column based on the ordering of that column”.
There are plenty of window/analytic functions that don’t have a corollary aggregation version, and window functions have a different end use intent than aggregation. You could conceive that some of them perform aggregation and then join the aggregation result back to the main data in order to relate the agg result to the particular row, but I wouldn’t assume the two facilities (agg vs window) are related at all
As far as I understand the python (not a python dev), it is not doing any aggregation, it's searching a list of item_name strings and looking each up in a dictionary that returns the revenue for that item, and returning the item_name that has the largest revenue. There wasn't any grouping there, it's much more like a SELECT TOP 1 item_name ORDER BY revenue and is only really good for returning a single item, rather than a load of items that are all maxes within their group, unless it's used within a loop that is processing a different list of item name each time
I know your question wasn't exactly about this particular SQL query but it may be helpful for you if I mention a couple of things on it. I'm not really sure what:
WITH temp as(
SELECT *, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog
)
SELECT category, MAX(fv) -- MIN(fv) also OK
FROM temp
GROUP BY category;
Gives you over something like:
SELECT DISTINCT category, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog
The analytic/window will produce the same value for every category (the partition) so it seems that really all the extra group by is doing is reducing the repeated values - which could be more simply answered by just getting the values you want and using distinct to quash the duplicates (one of the few cases where I would advocate such)
In the more general sense of "I want the entire most X row as determined by highest/lowest Y" we typically use row number for that:
WITH temp as(
SELECT *, ROW_NUMBER(item_name) OVER(PARTITION BY category ORDER BY revenue) as rn
FROM catalog)
SELECT *
FROM temp
WHERE rn = 1;
Though I find it more compact/readable to dispense with the CTE and just use a sub query but YMMV

SQL Nested Where with Sums

I've run into a syntax issue with SQL. What I'm trying to do here is add together all of the amounts paid on each order (paid each) an then only select those that are greater than sum of of paid each for a specific order# (1008). I've been trying to move around lots of different things here and I'm not having any luck.
This is what I have right now, though I've had many different things. Trying to use this simply returns an SQL statement not ended properly error. Any help you guys could give would be greatly appreciated. Do I have to use DISTINCT anywhere here?
SELECT ORDER#,
TO_CHAR(SUM(PAIDEACH), '$999.99') AS "Amount > Order 1008"
FROM ORDERITEMS
GROUP BY ORDER#
WHERE TO_CHAR > (SUM (PAIDEACH))
WHERE ORDER# = 1008;
Some versions of SQL regard the hash character (#) as the beginning of a comment. Others use double hyphen (--) and some use both. So, my first thought is that your ORDER# field is named incorrectly (though I can't imagine the engine would let you create a field with that name).
You have two WHERE keywords, which isn't allowed. If you have multiple WHERE conditions, you must link them together using boolean logic, with AND and OR keywords.
You have your WHERE condition after GROUP BY which should be reversed. Specify WHERE conditions before GROUP BY.
One of your WHERE conditions makes no sense. TO_CHAR > (SUM(paideach)): TO_CHAR() is a function which as far as I know is an Oracle function that converts numeric values to strings according to a specified format. The equivalent in SQL Server is CAST or CONVERT.
I'm guessing that you are trying to write a query that finds orders with amounts exceeding a particular value, but it's not very clear because one of your WHERE conditions specifies that the order number should be 1008, which would presumably only return one record.
The query should probably look more like this:
SELECT order,
SUM(paideach) AS amount
FROM orderitems
GROUP BY order
HAVING amount > 999.99;
This would select records from the orderitems table where the sum of paideach exceeds 999.99.
I'm not sure how order 1008 fits into things, so you will have to elaborate on that.
Other have commented on some of the things wrong with your query. I'll try to give more explicit hints about what I think you need to do to get the result I think you're looking for.
The problem seems to break into distinct sections, first finding the total for each order which you're close to and I think probably started from:
SELECT ORDER#, SUM(PAIDEACH) AS AMOUNT
FROM ORDERITEMS
GROUP BY ORDER#;
... finding the total for a specific order:
SELECT SUM(PAIDEACH)
FROM ORDERITEMS
WHERE ORDER# = 1008;
... and combining them, which is where you're stuck. The simplest way, and hopefully something you've recently been taught, is to use the HAVING clause, which comes after the GROUP BY and acts as a kind of filter that can be applied to the aggregated columns (which you can't do in the WHERE clause). If you had a fixed amount you could do this:
SELECT ORDER#, SUM(PAIDEACH) AS AMOUNT
FROM ORDERITEMS
GROUP BY ORDER#
HAVING SUM(PAIDEACH) > 5;
(Note that as #Bridge indicated you can't use the column alias, AMOUNT, in the having clause, you have to repeat the aggregation function SUM). But you don't have a fixed value, you want to use the actual total for order 1008, so you need to replace that fixed value with another query. I'll let you take that last step...
I'm not familiar with Oracle, and since it's homework I won't give you the answers, just a few ideas of what I think is wrong.
select statement should only have one where statement - can have more than one condition of course, just separated by logical operators (anything that evaluates to true will be included). E.g. : WHERE (column1 > column2) AND (column3 = 100)
Group by statements should after WHERE clauses
You can't refer to columns you've aliased in the select in the where clause of the same statement by their aliased name. For example this won't work:
SELECT column1 as hello
FROM table1
WHERE hello = 1
If there's a group by, the columns you're selecting should be the same as in that statement (or aggregates of those). This page does a better explanation of this than I do.

SQL Having on columns not in SELECT

I have a table with 3 columns:
userid mac_address count
The entries for one user could look like this:
57193 001122334455 42
57193 000C6ED211E6 15
57193 FFFFFFFFFFFF 2
I want to create a view that displays only those MAC's that are considered "commonly used" for this user. For example, I want to filter out the MAC's that are used <10% compared to the most used MAC-address for that user. Furthermore I want 1 row per user. This could easily be achieved with a GROUP BY, HAVING & GROUP_CONCAT:
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
And indeed, the result is as follows:
57193 001122334455,000C6ED211E6 42
However I really don't want the count-column in my view. But if I take it out of the SELECT statement, I get the following error:
#1054 - Unknown column 'count' in 'having clause'
Is there any way I can perform this operation without being forced to have a nasty count-column in my view? I know I can probably do it using inner queries, but I would like to avoid doing that for performance reasons.
Your help is very much appreciated!
As HAVING explicitly refers to the column names in the select list, it is not possible what you want.
However, you can use your select as a subselect to a select that returns only the rows you want to have.
SELECT a.userid, a.macs
FROM
(
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
) as a
UPDATE:
Because of a limitation of MySQL this is not possible, although it works in other DBMS like Oracle.
One solution would be to create a view for the subquery. Another solution seems cleaner:
CREATE VIEW YOUR_VIEW (userid, macs) AS
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
This will declare the view as returning only the columns userid and macs although the underlying SELECT statement returns more columns than those two.
Although I am not sure, whether the non-DBMS MySQL supports this or not...