Are Postgres SELECT DISTINCT queries deterministic?
Will SELECT DISTINCT somecolumn FROM sometable return the same result (including order) if the table (and entire database) goes unchanged?
In the Select Query Documentation the Description section notes:
If the ORDER BY clause is specified, the returned rows are sorted in the specified order. If ORDER BY is not given, the rows are returned in whatever order the system finds fastest to produce.
In the DISTINCT ON clause section they add:
Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.
Generally, is this still true when the database goes un-changed?
This answer assumes that the expressions in the select are deterministic. Otherwise, the question seems trivial.
The ordering is not specified, so it could change between runs of the query -- or on a different system. However, the result set should be the same.
Your second quote from the documentation is for distinct on. That is not-deterministic, unless you are using a stable sort.
Note: You might get non-deterministic results if you are using a case-insensitive collation. The built-in collations are case-sensitive; and case insensitivity means that the original expressions are not deterministic.
Related
enter image description here I am learning how to order by is used in SQL query, then I learned that order by and distinctly don't work together but, when I try to do it practically it worked. I am so confused even after asking chatgpt what the relationship is between order by and distinct.
I learned that when executing SQL queries, the ORDER BY clause comes after the SELECT clause. This means that the database will first retrieve the data specified in the SELECT clause, and then sort it based on the criteria specified in the ORDER BY clause. If the column used in the ORDER BY clause is not present in the SELECT clause, the database will automatically include that column in the select and do order by on both the columns and give result column only column given in SELECT.
However, when using both DISTINCT and ORDER BY together, the outcome may not be what is expected. This is because DISTINCT acts on both the columns in the SELECT clause and the column in the ORDER BY clause. This may cause unexpected results, especially in MySQL.
I found that when I tried this in practice, it still produced the desired results, which makes me question if I learned something incorrectly or if there is missing information that I am unaware of.
I am using the MYSQL database.
It seems there are some spaces before your unique value. It will be appropriate to perform TRIM/RTRIM and remove these spaces in the DISTINC clause itself.
It should be something like this:
DISTINCT TRIM(value) AS trim_value
...
ORDER BY trim_value
Also, it is possible that these are not spaces but some other characters which need to be replace, too.
While watching Troy Hunt's fantastic course on SQLi, I've noticed that he ends up using this strategy to see if a table has a specific column:
select * from TableA order by (select top 1 some_column from TableB) desc
This expression will get executed by SQL Server, but what will it do for the order by clause? I've seen expressions being used with order by before (case when then else end), but I'm really curious to understand how SQL can process the previous query without any errors...
EDIT: Giving more info because it seems like my initial post was not clear enough:
I know this is not the best strategy for getting table or column name though SQLi (that's not what I'm asking)
I'm not interested in knowing how to protect against this (I know how to do that already)
I know that sorting by a constant value doesn't make sense (though it allows you to run these types of "boolean queries")
What t I really want to know is why it works.
So, going back to the docs, the order by clause expects an order_by_expression, which is described as:
Specifies a column or expression on which to sort the query result set. A sort column can be specified as a name or column alias, or a nonnegative integer representing the position of the column in the select list.
According to the docs, an expression is:
Is a combination of symbols and operators that the SQL Server Database Engine evaluates to obtain a single data value. Simple expressions can be a single constant, variable, column, or scalar function. Operators can be used to join two or more simple expressions into a complex expression.
As #SMor demonstrated, the query does run if you replace the order by select expression with a simple select 'A':
select * from TableA order by (select 'A') desc
But this does not work:
select * from TableA order by 'A' desc
So, the question is: why is select 'A' accepted by SQL Server in the order by clause? Doesn't it produce a constant too? Since a constant is an expression and taking into account the definition for the order by clause, shouldn't it thrown an error in both cases?
Thanks.
The use of (select top 1 some_column from TableB) is an example of a scalar subquery. This is a subquery that returns exactly one column and at most one row. It can be used anywhere a literal value can be used -- and perhaps in some other places as well. Apparently, it can be used in an order by, even though SQL Server does not allow a literal value for order by.
The most common type of scalar subquery is a correlated subquery, which has a where clause that connects the subquery to the outer query. This is not an example of a scalar subquery.
In fact, this is not an example of anything useful as far as I can tell. It has one major shortcoming, which is the use of top without order by. The value returned by the subquery is indeterminate. That seems like a bad practice, and particularly bad if you are trying to teach people SQL.
And, it is probably going to be evaluated once. So the subquery would return a constant value and would not contribute much to a meaningful ordering.
I had a discussion with a teammate on the topic whether the terms clause and expression can be used interchangeably. For example, is it correct/common to call a variable that stands for an expression a=b (e.g. that participates in a statement SELECT * WHERE expression) a clause?
Edit
It would be useful is someone could give precise definitions of what clause, expression and statement are in SQL world.
In SQL Terms, "clause" is usually used to refer to a section of a statement, usually introduced by the keyword it's named after - e.g. a typical SELECT statement would be composed of a SELECT clause, a FROM clause and a WHERE clause. Within the FROM clause, some people may refer to JOIN clauses and ON clauses. However, this is by no means 100% accepted usage.
When it comes to "statement" and "expression", it's fairly standard usage - an expression is something that produces a value. In most languages, this is understood, further, to be something that produces a scalar value. In SQL, this is slightly modified because when you encounter an expression when working with a row set, the expression will produce one scalar value per row (or per group or partition, if grouping or partitioning are involved and it's in the relevant location).
Finally, a statement is a complete "something" that your database engine can understand and produce results for. It doesn't produce a value but it may produce a result set. You can't just send a FROM clause to the database - it has to be part of a larger statement, such as the SELECT statement I mentioned in my first paragraph.
The answer is NO, expression evaluates to something may be a boolean value or string or number where as a clause forms a rule for the data to satisfy and only then the record forms part of the result.
select * from TABLE where /*clause 1*/ field1 = field2
and /*clause 2*/field3 = /*expression*/ field1 + field2
In the above select statement
first clause forms a rule which is field1 should be equal to field2
Second clause form a rule which is field3 should be equal to the result of the > expression field1 + field2
UPDATE
There are various clauses in SQL like from, where, order by, group by and having. from clause tells from which table to read and order by tells how to arrange the result. Clauses control from where data to be read, what data be formed as part of the select statement and how the data to be presented.
Expression on the other hand evaluate to a value of some datatype.
A Statement, is a structured query build with the clauses.
I have used NOT IN clause in Select Statement. When I run that query, each time it returns the same result set but the order is different.
Is this the default behavior of "NOT IN" clause?
The query which I am using is as below:
SELECT *,(ISNULL(AppFirstName,'')+' '+ISNULL(AppMiddleName,'')+' '+ISNULL(AppLastName,'')) as AppName FROM BApp AF WHERE AF.SId=11 AND AF.SCId=5 AND AF.CCId= 1 AND AF.IsActive=1 AND AF.ASId=16 AND AF.AId NOT IN (SELECT AId FROM NumberDetails where AId = AF.AId)
The order of an SQL result is not defined and left for the database to pick unless you use an ORDER clause. If you need to know more, post the query and what DB you are using.
If you don't specify an ORDER BY clause, then no query has a defined order. The database is free to return you the rows in whatever order is easiest for it.
The reason this sometimes seems consistent is that the rows will often be read out either in the order they exist on disk (probably the order they were inserted) or in the order of some index that was used to find the result.
The more complex your query, the more complex the processing the database needs to do, so the less likely the results are to come out in some obvious, repeatable, order.
Moral of the story: always use an ORDER BY clause.
SQL, by default, does not order or sort the records it returns. This behavior isn't specific to 'NOT IN', but is a general premise of the language. However, you can easily order your results by adding an 'ORDER BY table.column_name' to the end of your query.
Assume value is an int and the following query is valid:
SELECT blah
FROM table
WHERE attribute = value
Though MAX(expression) returns int, the following is not valid:
SELECT blah
FROM table
WHERE attribute = MAX(expression)
OF course the desired effect can be achieved using a subquery, but my question is why was SQL designed this way - is there some reason why this sort of thing is not allowed? Students coming from programming languages where you can always replace a data-type by a function call that returns that type find this issue confusing. Is there an explanation one can give them rather than just saying "that's the way it is"?
It's just because of the order of operations of a query.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
WHERE just filters the rows returned by FROM. An aggregate function like MAX() can't have a result returned because it hasn't even been applied to anything.
That's also the reason, why you can't use aliases defined in the SELECT clause in a WHERE clause, but you can use aliases defined in FROM clause.
A where clause checks every row to see if it matches the conditions specified.
A max computes a single value from a row set. If you put a max, or any other aggregate function into a where clause, how can SQL server figure out what rows the max function can use until the where clause has finished it filter?
This deals with the order that SQL Server processes commands in. It runs the WHERE clause before a GROUP BY or any aggregate. Since a where clause runs first, SQL Server can't tell if a row will be included in an aggregate until it processes the where. That is what the HAVING clause is for. HAVING runs after the GROUP BY and the WHERE and can include MAX since you have already filtered out the rows you don't want to use. See http://www.bennadel.com/blog/70-SQL-Query-Order-of-Operations.htm for a good explanation of the order in which SQL commands run.
Maybe this work
SELECT blah
FROM table
WHERE attribute = (SELECT MAX(expresion) FROM table1)
The WHERE clause is specifically designed to test conditions against raw data (individual rows of the table). However, MAX is an aggregate function over multiple rows of data. Basically, without a sub-select, the WHERE clause knows nothing about any rows in the table except for the current row. So how can you determine the maximum value over a whole bunch of rows when you don't even know what those rows are?
Yes, it's a little bit of a simplification, especially when dealing with joins, but the same principle applies. WHERE is always row-by-row, so that's all it really knows about.
Even if you have a GROUP BY clause, the WHERE clause still only processes one row at a time in the raw data before grouping. It doesn't know the value of a column in any other rows, so it has no way of knowing which row has the maximum value.
Assuming this is MS SQL Server, the following would work.
SELECT TOP 1 blah
FROM table
ORDER BY expression DESC