SQL - Using MAX in a WHERE clause

SQL - Using MAX in a WHERE clause - sql

Assume value is an int and the following query is valid:
SELECT blah
FROM table
WHERE attribute = value
Though MAX(expression) returns int, the following is not valid:
SELECT blah
FROM table
WHERE attribute = MAX(expression)
OF course the desired effect can be achieved using a subquery, but my question is why was SQL designed this way - is there some reason why this sort of thing is not allowed? Students coming from programming languages where you can always replace a data-type by a function call that returns that type find this issue confusing. Is there an explanation one can give them rather than just saying "that's the way it is"?

It's just because of the order of operations of a query.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
WHERE just filters the rows returned by FROM. An aggregate function like MAX() can't have a result returned because it hasn't even been applied to anything.
That's also the reason, why you can't use aliases defined in the SELECT clause in a WHERE clause, but you can use aliases defined in FROM clause.

A where clause checks every row to see if it matches the conditions specified.
A max computes a single value from a row set. If you put a max, or any other aggregate function into a where clause, how can SQL server figure out what rows the max function can use until the where clause has finished it filter?
This deals with the order that SQL Server processes commands in. It runs the WHERE clause before a GROUP BY or any aggregate. Since a where clause runs first, SQL Server can't tell if a row will be included in an aggregate until it processes the where. That is what the HAVING clause is for. HAVING runs after the GROUP BY and the WHERE and can include MAX since you have already filtered out the rows you don't want to use. See http://www.bennadel.com/blog/70-SQL-Query-Order-of-Operations.htm for a good explanation of the order in which SQL commands run.

Maybe this work
SELECT blah
FROM table
WHERE attribute = (SELECT MAX(expresion) FROM table1)

The WHERE clause is specifically designed to test conditions against raw data (individual rows of the table). However, MAX is an aggregate function over multiple rows of data. Basically, without a sub-select, the WHERE clause knows nothing about any rows in the table except for the current row. So how can you determine the maximum value over a whole bunch of rows when you don't even know what those rows are?
Yes, it's a little bit of a simplification, especially when dealing with joins, but the same principle applies. WHERE is always row-by-row, so that's all it really knows about.
Even if you have a GROUP BY clause, the WHERE clause still only processes one row at a time in the raw data before grouping. It doesn't know the value of a column in any other rows, so it has no way of knowing which row has the maximum value.

Assuming this is MS SQL Server, the following would work.
SELECT TOP 1 blah
FROM table
ORDER BY expression DESC

Related

SQL Server order by expression

While watching Troy Hunt's fantastic course on SQLi, I've noticed that he ends up using this strategy to see if a table has a specific column:
select * from TableA order by (select top 1 some_column from TableB) desc
This expression will get executed by SQL Server, but what will it do for the order by clause? I've seen expressions being used with order by before (case when then else end), but I'm really curious to understand how SQL can process the previous query without any errors...
EDIT: Giving more info because it seems like my initial post was not clear enough:
I know this is not the best strategy for getting table or column name though SQLi (that's not what I'm asking)
I'm not interested in knowing how to protect against this (I know how to do that already)
I know that sorting by a constant value doesn't make sense (though it allows you to run these types of "boolean queries")
What t I really want to know is why it works.
So, going back to the docs, the order by clause expects an order_by_expression, which is described as:
Specifies a column or expression on which to sort the query result set. A sort column can be specified as a name or column alias, or a nonnegative integer representing the position of the column in the select list.
According to the docs, an expression is:
Is a combination of symbols and operators that the SQL Server Database Engine evaluates to obtain a single data value. Simple expressions can be a single constant, variable, column, or scalar function. Operators can be used to join two or more simple expressions into a complex expression.
As #SMor demonstrated, the query does run if you replace the order by select expression with a simple select 'A':
select * from TableA order by (select 'A') desc
But this does not work:
select * from TableA order by 'A' desc
So, the question is: why is select 'A' accepted by SQL Server in the order by clause? Doesn't it produce a constant too? Since a constant is an expression and taking into account the definition for the order by clause, shouldn't it thrown an error in both cases?
Thanks.

The use of (select top 1 some_column from TableB) is an example of a scalar subquery. This is a subquery that returns exactly one column and at most one row. It can be used anywhere a literal value can be used -- and perhaps in some other places as well. Apparently, it can be used in an order by, even though SQL Server does not allow a literal value for order by.
The most common type of scalar subquery is a correlated subquery, which has a where clause that connects the subquery to the outer query. This is not an example of a scalar subquery.
In fact, this is not an example of anything useful as far as I can tell. It has one major shortcoming, which is the use of top without order by. The value returned by the subquery is indeterminate. That seems like a bad practice, and particularly bad if you are trying to teach people SQL.
And, it is probably going to be evaluated once. So the subquery would return a constant value and would not contribute much to a meaningful ordering.

exclude a column from group by statement

I would like to exclude a column from group by statement, because it results in some redundant records. Are there any recommendations?
I use Oracle, and have a complex query which join 6 tables together, and want to use sql aggregate function (count), without duplicate result.

You can't.
When using aggregate functions every column/column expression which is not an aggregate must be in the GROUP BY.
This is completely logical. If you're not aggregating the column then excluding it from the GROUP BY would force Oracle to chose a random value, which is not very useful.
If you don't want this column in your GROUP BY then you must decide what aggregation to apply to this column in order to return the appropriate data for your situation. You can't hand this responsibility off to the database engine.

Does this query include a correlated or non correlated subquery?

So I've written a simple query that gives me the ID #s for properties that show up only once in the property_usage table, along with the code for their associated usage type. Since I didn't want to include a column that shows the count of how many times each property ID shows up in the property_usage table, I wrote two subqueries to get a list of all the property IDs that only show up once. I then use the result of those subqueries (a single column of propertyIDs) to filter out those properties that show up more than once in the table.
here's the query:
select pu.property_id, pu.usage_type_id
from acres_final_40.property_usage pu
where pu.property_id not in
(select multiple_use_properties
from
(select pu.property_id multiple_use_properties, count(pu.property_id)
from acres_final_40.property_usage pu
group by pu.property_id having count(pu.property_id) > 1))
order by pu.property_id;
My question is: is that innermost subquery correlated or noncorrelated with the outermost query?
I have the following thoughts (see the paragraph below), but I'd like to know for sure whether I'm right about this. I'm learning all this stuff on my own and don't have anyone I can ask about this in person!
My feeling is that it's not, because it seems like the pu.propertyID column from the outermost query isn't a value that's passed into the innermost query. It seems like the innermost query may technically be a derived table, in which case my code is sloppy because I don't alias the table name in the FROM clause of that SELECT statement.

In a SQL database query, a correlated subquery (also known as a synchronized subquery) is a subquery (a query nested inside another query) that uses values from the outer query.
(Wikipedia, emph. mine.)
Yours does not, so it's not a correlated subquery. Basically, if you can cut away the subquery and run it as an independent query, without the outer context, it's definitely uncorrelated. It can be done in your case.
BTW you could probably rewrite it using a not exists clause to check if another record with the same PK but another property_id exists, and get a better query plan than using count(). This is my speculation, though; only an explain plan would show if there's a benefit.

How should my query look like?

I use SQL Server 2008 and dispose a table, that I group data on one column. For some reasons I have to use GROUP BY instead of DISTINCT (it's a part of more complex query). Query results (only one column is returned) are fine for me.
The problem is, that I want to use this query, as a subquery in WHERE clause to filter data based on this subquery. As I see, WHERE clause sees in this subquery not only data that is displayed in query results after grouping by, but also the rest of them and this is what I do not like.
My question is, how can I use my group by query as a subquery in where clause that would see only results after grouping by?

You can use a subquery to select the required column from your existing subquery. And then pass this new subquery as input to where clause.

I'm guessing that your issue is that your main query is returning multiple rows for each of the items in your WHERE sub-query. This is the correct behaviour. In order to limit the results in your main query you must use DISTINCT or GROUP BY in the main query.

Why can't I perform an aggregate function on an expression containing an aggregate but I can do so by creating a new select statement around it?

Why is it that in SQL Server I can't do this:
select sum(count(id)) as 'count'
from table
But I can do
select sum(x.count)
from
(
select count(id) as 'count'
from table
) x
Are they not essentially the same thing? How am I meant to be thinking about this in order to understand why the first block of code isn't allowed?

SUM() in your example is a no-op - SUM() of a COUNT() means the same as just COUNT(). So neither of your example queries appear to do anything useful.
It seems to me that nesting aggregates would only make sense if you wanted to apply two different aggregations - meaning GROUP BY on different sets of columns. To specify two different aggregations you would need to use the GROUPING SETS feature or SUM() OVER feature. Maybe if you explain what you want to achieve someone could show you how.

The gist of the issue is that there is no such concept as aggregate of an aggregate applied to a relation, see Aggregation. Having such a concept would leave too many holes in the definition and makes the GROUP BY clause impossible to express: it needs to define both the inner aggregate GROUP BY clause and the outer aggregate as well! This applies also to the other aggregate attributes, like the HAVING clause.
However, the result of an aggregate applied to a relation is another relation, and this result relation in turn can support a new aggregate operator. This explains why you can aggregate the result into an outer SELECT. This leaves no ambiguity in the definition, each SELECT has its own distinct GROUP BY/HAVING clauses.

In simple terms, aggregation functions operate over a column and generate a scalar value, hence they cannot be applied over their result. When you create a select statement over a scalar value you transform it into an artificial column, that's why it can be used by an aggregation function again.
Please note that most of the times there's no point in applying an aggregation function over the result of another aggregation function: in your sample sum(count(id)) == count(id).

i would like to know what your expected result in this sql
select sum(count(id)) as 'count'
from table
when you use the count function, only 1 result(total count) will be return. So, may i ask why you want to sum the only 1 result.
You will surely got the error because an aggregate function cannot perform on an expression containing an aggregate or a subquery.

It's working for me using SQLFiddle, not sure why it would't work for you. But I do have an explanation as to why it might not be working for you and why the alternative would work...
Your example is using a keyword as a column name, that may not always work. But when the column is only in a sub expression, the query engine is free to discard the name (in fact it probaly does) so the fact that it potentially potentially conflicts with a key word may be disregarded.
EDIT: in response to your edit/comment. No, the two aren't equivalent. The RESULT would be equivalent, but the process of getting to that result is not at all similar. For the first to work, the parser has do some work that simply doesn't make sense for it to do (applying an aggregate to a single value, either on a row by row basis or as), in the second case, an aggregate is applied to a table. The fact that the table is a temporary virtual table will be unimportant to the aggregate function.

I think you can write the sql query, which produces 'count' of rows for the required output. Functions do not take aggregated functions like 'sum' or aggregated subquery. My problem was resolved by using a simple sql query to get the count out....

Microsoft SQL Server doesn’t support it.
You can get around this problem by using a Derived table:
select sum(x.count)
from
(
select count(id) as 'count'
from table
) x
On the other hand using the below code will give you an error message.
select sum(count(id)) as 'count'
from table
Cannot perform an aggregate function on an expression containing an
aggregate or a subquery

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas