SQL Server order by expression - sql

While watching Troy Hunt's fantastic course on SQLi, I've noticed that he ends up using this strategy to see if a table has a specific column:
select * from TableA order by (select top 1 some_column from TableB) desc
This expression will get executed by SQL Server, but what will it do for the order by clause? I've seen expressions being used with order by before (case when then else end), but I'm really curious to understand how SQL can process the previous query without any errors...
EDIT: Giving more info because it seems like my initial post was not clear enough:
I know this is not the best strategy for getting table or column name though SQLi (that's not what I'm asking)
I'm not interested in knowing how to protect against this (I know how to do that already)
I know that sorting by a constant value doesn't make sense (though it allows you to run these types of "boolean queries")
What t I really want to know is why it works.
So, going back to the docs, the order by clause expects an order_by_expression, which is described as:
Specifies a column or expression on which to sort the query result set. A sort column can be specified as a name or column alias, or a nonnegative integer representing the position of the column in the select list.
According to the docs, an expression is:
Is a combination of symbols and operators that the SQL Server Database Engine evaluates to obtain a single data value. Simple expressions can be a single constant, variable, column, or scalar function. Operators can be used to join two or more simple expressions into a complex expression.
As #SMor demonstrated, the query does run if you replace the order by select expression with a simple select 'A':
select * from TableA order by (select 'A') desc
But this does not work:
select * from TableA order by 'A' desc
So, the question is: why is select 'A' accepted by SQL Server in the order by clause? Doesn't it produce a constant too? Since a constant is an expression and taking into account the definition for the order by clause, shouldn't it thrown an error in both cases?
Thanks.

The use of (select top 1 some_column from TableB) is an example of a scalar subquery. This is a subquery that returns exactly one column and at most one row. It can be used anywhere a literal value can be used -- and perhaps in some other places as well. Apparently, it can be used in an order by, even though SQL Server does not allow a literal value for order by.
The most common type of scalar subquery is a correlated subquery, which has a where clause that connects the subquery to the outer query. This is not an example of a scalar subquery.
In fact, this is not an example of anything useful as far as I can tell. It has one major shortcoming, which is the use of top without order by. The value returned by the subquery is indeterminate. That seems like a bad practice, and particularly bad if you are trying to teach people SQL.
And, it is probably going to be evaluated once. So the subquery would return a constant value and would not contribute much to a meaningful ordering.

Related

SQL clause vs expression terms

I had a discussion with a teammate on the topic whether the terms clause and expression can be used interchangeably. For example, is it correct/common to call a variable that stands for an expression a=b (e.g. that participates in a statement SELECT * WHERE expression) a clause?
Edit
It would be useful is someone could give precise definitions of what clause, expression and statement are in SQL world.
In SQL Terms, "clause" is usually used to refer to a section of a statement, usually introduced by the keyword it's named after - e.g. a typical SELECT statement would be composed of a SELECT clause, a FROM clause and a WHERE clause. Within the FROM clause, some people may refer to JOIN clauses and ON clauses. However, this is by no means 100% accepted usage.
When it comes to "statement" and "expression", it's fairly standard usage - an expression is something that produces a value. In most languages, this is understood, further, to be something that produces a scalar value. In SQL, this is slightly modified because when you encounter an expression when working with a row set, the expression will produce one scalar value per row (or per group or partition, if grouping or partitioning are involved and it's in the relevant location).
Finally, a statement is a complete "something" that your database engine can understand and produce results for. It doesn't produce a value but it may produce a result set. You can't just send a FROM clause to the database - it has to be part of a larger statement, such as the SELECT statement I mentioned in my first paragraph.
The answer is NO, expression evaluates to something may be a boolean value or string or number where as a clause forms a rule for the data to satisfy and only then the record forms part of the result.
select * from TABLE where /*clause 1*/ field1 = field2
and /*clause 2*/field3 = /*expression*/ field1 + field2
In the above select statement
first clause forms a rule which is field1 should be equal to field2
Second clause form a rule which is field3 should be equal to the result of the > expression field1 + field2
UPDATE
There are various clauses in SQL like from, where, order by, group by and having. from clause tells from which table to read and order by tells how to arrange the result. Clauses control from where data to be read, what data be formed as part of the select statement and how the data to be presented.
Expression on the other hand evaluate to a value of some datatype.
A Statement, is a structured query build with the clauses.

Using Order By in IN condition

Let's say we have this silly query:
Select * From Emp Where Id In (Select Id From Emp)
Let's do a small modification inside IN condition by adding an Order By clause.
Select * From Mail Where Id In (Select Id From Mail Order By Id)
Then we are getting the following error:
ORA-00907: missing right parenthesis
Oracle assumes that IN condition will end after declaring the From table. As a result waits for right parenthesis, but we give an Order By instead.
Why can't we add an Order By inside IN condition?
FYI: I don't ask for a equivalent query. This is an example after all. I just can't understand why this error occurs.
Consider the condition x in (select something from somewhere). It returns true if x is equal to any of the somethings returned from the query - regardless of whether it's the first, second, last, or anything in the middle. The order that the somethings are returned is inconsequential. Adding an order by clause to a query often comes with a hefty performance hit, so I'm guessing this this limitation was introduced to prevent adding a clause that has no impact on the query's correctness on the one hand and may be quite expensive on the other.
It would not make sense to sort the values inside the IN clause. Think in this way:
IN (a, b, c)
is same as
IN (c, b, a)
IS SAME AS
IN (b, c, a)
Internally Oracle applies OR condition, so it is equivalent to:
WHERE id = a OR id = b OR id = c
Would it make any sense to order the conditions?
Ordering comes at it's own expense. So, when there is no need to sort, just don't do it. And in this case, Oracle applied the same rule.
When it comes to performance of the query, optimizer needs choose the best possible execution plan i.e. with the least cost to achieve the desired output. ORDER BY is useless in this case, and Oracle did a good job to prevent using it at all.
For the documentation,
Type of Condition Operation
----------------- ----------
IN Equal-to-any-member-of test. Equivalent to =ANY.
So, when you need to test ANY value for membership in a list of values, there is no need of ordered list, just a random matching does the job.
If you look at Oracle SQL reference (syntax diagrams) you will find a reason. ORDER BY is part of "select" statement, while IN clause uses lover level "subquery" statement. Your problem relates to nature of the Oracle's SQL grammar.
PS: there might be more gotchas like multiple UNION, MINUS on "subqueries" and then also you can use ONLY one ORDER BY clause, as this applies only onto result of UNION operation.
This will fail too:
select * from dual order by 1
union all
select * from dual order by 1;

SQL - Using MAX in a WHERE clause

Assume value is an int and the following query is valid:
SELECT blah
FROM table
WHERE attribute = value
Though MAX(expression) returns int, the following is not valid:
SELECT blah
FROM table
WHERE attribute = MAX(expression)
OF course the desired effect can be achieved using a subquery, but my question is why was SQL designed this way - is there some reason why this sort of thing is not allowed? Students coming from programming languages where you can always replace a data-type by a function call that returns that type find this issue confusing. Is there an explanation one can give them rather than just saying "that's the way it is"?
It's just because of the order of operations of a query.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
WHERE just filters the rows returned by FROM. An aggregate function like MAX() can't have a result returned because it hasn't even been applied to anything.
That's also the reason, why you can't use aliases defined in the SELECT clause in a WHERE clause, but you can use aliases defined in FROM clause.
A where clause checks every row to see if it matches the conditions specified.
A max computes a single value from a row set. If you put a max, or any other aggregate function into a where clause, how can SQL server figure out what rows the max function can use until the where clause has finished it filter?
This deals with the order that SQL Server processes commands in. It runs the WHERE clause before a GROUP BY or any aggregate. Since a where clause runs first, SQL Server can't tell if a row will be included in an aggregate until it processes the where. That is what the HAVING clause is for. HAVING runs after the GROUP BY and the WHERE and can include MAX since you have already filtered out the rows you don't want to use. See http://www.bennadel.com/blog/70-SQL-Query-Order-of-Operations.htm for a good explanation of the order in which SQL commands run.
Maybe this work
SELECT blah
FROM table
WHERE attribute = (SELECT MAX(expresion) FROM table1)
The WHERE clause is specifically designed to test conditions against raw data (individual rows of the table). However, MAX is an aggregate function over multiple rows of data. Basically, without a sub-select, the WHERE clause knows nothing about any rows in the table except for the current row. So how can you determine the maximum value over a whole bunch of rows when you don't even know what those rows are?
Yes, it's a little bit of a simplification, especially when dealing with joins, but the same principle applies. WHERE is always row-by-row, so that's all it really knows about.
Even if you have a GROUP BY clause, the WHERE clause still only processes one row at a time in the raw data before grouping. It doesn't know the value of a column in any other rows, so it has no way of knowing which row has the maximum value.
Assuming this is MS SQL Server, the following would work.
SELECT TOP 1 blah
FROM table
ORDER BY expression DESC

Is it possible to have an SQL query that uses AGG functions in this way?

Assuming I have the following aggregate functions:
AGG1
AGG2
AGG3
AGG4
Is it possible to write valid SQL (in a db agnostic way) like this:
SELECT [COL1, COL2 ....], AGG1(param1), AGG2(param2) FROM [SOME TABLES]
WHERE [SOME CRITERIA]
HAVING AGG3(param2) >-1 and AGG4(param4) < 123
GROUP BY COL1, COL2, ... COLN
ORDER BY COL1, COLN ASC
LIMIT 10
Where COL1 ... COLN are columns in the tables being queried, and param1 ... paramX are parameters passed to the AGG funcs.
Note: AGG1 and AGG2 are returned in the results as columns (but do not appear in the HAVING CLAUSE, and AGG3 and AGG4 appear in the HAVING CLAUSE but are not returned in the result set.
Ideally, I want a DB agnostic answer to the solution, but if I have to be tied to a db, I am using PostgreSQL (v9.x).
Edit
Just a matter of clarification: I am not opposed to using GROUP BY in the query. My SQL is not very good, so the example SQL above may have been slightly misleading. I have edited the pseudo sql statement above to hopefully make my intent more clear.
The main thing I wanted to find out was whether a select query that used AGG functions could:
Have agg functions values in the returned column without them being specified in a HAVING clause.
Have agg functions specified in a HAVING clause, but are not returned in the result set.
From the answers I have received so far, it would seem the answer to both questions is YES. The only think I have to do to correct my SQL is to add a GROUP BY clause to make sure that the returned rows are unique.
PostgreSQL major version include the first digit after the dot, thus "PostgreSQL (v9.x)" is not specific enough. As #kekekela said, there is no (cheap) completely db agnostic way. Even between PostgreSQL 9.0 and 9.1 there is an important syntactical difference.
If you had only the grouped values AGG1(param1), AGG2(param2) you would get away without providing an explicit GROUP BY clause. Since you mix grouped and non-grouped columns you have to provide a GROUP BY clause with all non-grouped columns that appear in the SELECT. That's true for any version of PostgreSQL. Read about GROUP BY and HAVING it in the manual.
Starting with version 9.1, however, once you list a primary key in the GROUP BY you can skip additional columns for this table and still use them in the SELECT list. The release notes for version 9.1 tell us:
Allow non-GROUP BY columns in the query target list when the primary
key is specified in the GROUP BY clause (Peter Eisentraut)
Concerning parameters
Do you intend to feed a constant value to an aggregate function? What's the point? The docs tell us
An aggregate function computes a single result from multiple input rows.
Or do you want those parameters to be column names? That kind of dynamic SQL works as long as the statement is generated before committing to the database. Does not work for prepared statements or simple sql or plpgsql functions. You have to use EXECUTE in a plpgsql function for that purpose.
As safeguard against SQLi use the USING $1, $2 syntax for values and quote_ident() for your column or table names.
The only way to aggregate over columns without using GROUP BY is to use windowing functions. You left out details of your problem, but the following might be what you are looking for:
SELECT *
FROM (
SELECT [COL1, COL2 ....],
AGG1(param1) over (partition by some_grouping_column) as agg1,
AGG2(param2) over (partition by some_grouping_column) as agg2,
row_number() over () as rn
FROM [SOME TABLES]
WHERE [SOME CRITERIA]
ORDER BY COL1
) t
WHERE AGG3 >-1
AND AGG4 < 123
AND rn <= 10
ORDER BY col1
This is standard ANSI SQL and works on most database including PostgreSQL (since 8.4).
Note that you do not need to use the same grouping column for both aggregates in the partition by clause.
If you want to stick with ANSI SQL then you should use the row_number() function to limit the result. If you run this only on PostgreSQL (or other DBMS that support LIMIT in some way) move the LIMIT cause into the derived table (the inner query)
That should work from a high level perspective, except you'd need COL1, COL2 etc in a GROUP BY statement or else they won't be valid in the SELECT list. Having AGG1, etc in the SELECT list and not in the HAVING is not a problem.
As far as db agnostic, you're going to have to tweak syntax no matter what you do (the LIMIT for example is going to be different in PostgreSQL, SQL SERVER and Oracle that I know off the top of my head), but you could build logic to construct the statements properly for each provided your high-level representation is solid.

HAVING without GROUP BY

Is the following possible according to standard(!) SQL?
What minimal changes should be neccessary in order to be conforming to the standard (if it wasn't already)?
It works as expected in MySQL, iff the first row has the maximum value for NumberOfPages.
SELECT *
FROM Book
HAVING NumberOfPages = MAX(NumberOfPages)
The following is written in the standard:
HAVING <search condition>
Let G be the set consisting of every column referenced by a <column reference> contained in the <group by clause>.
Each column reference directly contained in the <search condition> shall be one of the following:
An unambiguous reference to a column that is functionally dependent on G.
An outer reference.
source
Can somebody explain me, why it should be possible according to the standard?
In MySQL, it perfectly works.
Despite the Mimer Validator result, I don't believe yours is valid Standard SQL.
A HAVING clause without a GROUP BY clause is valid and (arguably) useful syntax in Standard SQL. Because it operates on the table expression all-at-once as a set, so to speak, it only really makes sense to use aggregate functions. In your example:
Book HAVING NumberOfPages = MAX(NumberOfPages)
is not valid because when considering the whole table, which row does NumberOfPages refer to? Likewise, it only makes sense to use literal values in the SELECT clause.
Consider this example, which is valid Standard SQL:
SELECT 'T' AS result
FROM Book
HAVING MIN(NumberOfPages) < MAX(NumberOfPages);
Despite the absence of the DISTINCT keyword, the query will never return more than one row. If the HAVING clause is satisfied then the result will be a single row with a single column containing the value 'T' (indicating we have books with differing numbers of pages), otherwise the result will be the empty set i.e. zero rows with a single column.
I think the reason why the query does not error in mySQL is due to propritary extensions that cause the HAVING clause to (logically) come into existence after the SELECT clause (the Standard behaviour is the other way around), coupled with the implicit GROUP BY clause mentioned in other answers.
“When GROUP BY is not used, HAVING behaves like a WHERE clause.”
The difference between where and having: WHERE filters ROWS while HAVING filters groups
SELECT SUM(spending) as totSpending
FROM militaryspending
HAVING SUM(spending) > 200000;
Result
totSpending
1699154.3
More detail, please consult
https://dba.stackexchange.com/questions/57445/use-of-having-without-group-by-in-sql-queries/57453
From the standard (bold added from emphasis)
1) Let HC be the having clause. Let TE be the table expression that immediately contains HC. If TE does not immediately contain a group by clause, then “GROUP BY ()” is implicit. Let T be the
descriptor of the table defined by the GBC immediately contained in TE and let R be
the result of GBC.
With the implicit group by clause, the outer reference can access the TE columns.
However, the certification to these standards is very much a self-certification these days, and the example you gave would not work across all of the main RDBMS providers.
Yes We can write the SQL query without Group by but write the aggregate function
in our query.
select sum(Salary) from ibs having max(Salary)>1000