Does BigQuery guarantee column order when performing SELECT * query from subselect? - sql

Does BigQuery guarantee column order when performing SELECT * FROM subquery?
For example, given table t with columns a, b, c, d if I'd execute a query:
SELECT * EXCEPT (a) FROM (SELECT a, b, c, d FROM t)
Will the columns in the result always have the same order (b, c, d in this case)?

I am going to answer "yes"; however, it is not documented to do so. The only guarantee (according to the documentation) is:
SELECT *, often referred to as select star, produces one output column for each column that is visible after executing the full query.
This does NOT explicitly state that the columns are in the order they are defined -- i.e. first by the order of references in the FROM clause and then by the ordering within each reference. I'm not even sure if the SQL standard specifies the ordering (although I have a vague recollection that the 92 standard might have).
That said, I have never seen any database not produce the columns in the specified order. Plus, databases (in general) have to keep the ordering of the column to support INSERT. You can see it yourself in INFORMATION_SCHEMA.COLUMNS.ORDINAL_POSITION. And, the REPLACE functionality (to replace a column expression in the SELECT list) would make less sense if the ordering were not guaranteed.
I might argue that you are "pretty safe" assuming that the ordering is the same. And that we should lobby Google to fix the documentation to make this clear.

Related

Is SELECT DISTINCT ON (col) * valid?

SELECT DISTINCT ON (some_col)
*
FROM my_table
I'm wondering if this is valid and will work as expected. Meaning, will this return all columns from my_table, based on distinct some_col? I've read the Postgres docs and don't see any reason why this wouldn't work as expected, but have read old comments here on SO which state that columns need to be explicitly listed when using distinct on.
I do know it's best practice to explicitly list columns, and also to use order by when doing the above.
Background that you probably don't need or care about
For background and the reason I ask, is we are migrating from MySQL to Postgres. MySQL has a very non-standards compliant "trick" which allows a SELECT * ... GROUP BY which allows one to easily select * based on a group by. Previous answers and comments about migrating this non-standard-compliant trick to Postgres are murky at best.
SELECT DISTINCT ON (some_col) *
FROM my_table;
I'm wondering if this is valid
Yes. Typically, you want ORDER BY to go with it to determine which row to pick from each set of peers. But choosing an arbitrary row (without ORDER BY) is a valid (and sometimes useful!) application. You just need to know what you are doing. Maybe add a comment for the afterworld?
See:
Select first row in each GROUP BY group?
will this return all columns from my_table, based on distinct some_col?
It will return all columns. One arbitrary row per distinct value of some_col.
Note how I used the word "arbitrary", not "random". Returned rows are not chosen randomly at all. Just arbitrarily, depending on current implementation details. Typically the physically first row per distinct value, but that depends.
I do know it's best practice to explicitly list columns.
That really depends. Often it is. Sometimes it is not. Like when I want to get all columns to match a given row type.

SQL Server order by expression

While watching Troy Hunt's fantastic course on SQLi, I've noticed that he ends up using this strategy to see if a table has a specific column:
select * from TableA order by (select top 1 some_column from TableB) desc
This expression will get executed by SQL Server, but what will it do for the order by clause? I've seen expressions being used with order by before (case when then else end), but I'm really curious to understand how SQL can process the previous query without any errors...
EDIT: Giving more info because it seems like my initial post was not clear enough:
I know this is not the best strategy for getting table or column name though SQLi (that's not what I'm asking)
I'm not interested in knowing how to protect against this (I know how to do that already)
I know that sorting by a constant value doesn't make sense (though it allows you to run these types of "boolean queries")
What t I really want to know is why it works.
So, going back to the docs, the order by clause expects an order_by_expression, which is described as:
Specifies a column or expression on which to sort the query result set. A sort column can be specified as a name or column alias, or a nonnegative integer representing the position of the column in the select list.
According to the docs, an expression is:
Is a combination of symbols and operators that the SQL Server Database Engine evaluates to obtain a single data value. Simple expressions can be a single constant, variable, column, or scalar function. Operators can be used to join two or more simple expressions into a complex expression.
As #SMor demonstrated, the query does run if you replace the order by select expression with a simple select 'A':
select * from TableA order by (select 'A') desc
But this does not work:
select * from TableA order by 'A' desc
So, the question is: why is select 'A' accepted by SQL Server in the order by clause? Doesn't it produce a constant too? Since a constant is an expression and taking into account the definition for the order by clause, shouldn't it thrown an error in both cases?
Thanks.
The use of (select top 1 some_column from TableB) is an example of a scalar subquery. This is a subquery that returns exactly one column and at most one row. It can be used anywhere a literal value can be used -- and perhaps in some other places as well. Apparently, it can be used in an order by, even though SQL Server does not allow a literal value for order by.
The most common type of scalar subquery is a correlated subquery, which has a where clause that connects the subquery to the outer query. This is not an example of a scalar subquery.
In fact, this is not an example of anything useful as far as I can tell. It has one major shortcoming, which is the use of top without order by. The value returned by the subquery is indeterminate. That seems like a bad practice, and particularly bad if you are trying to teach people SQL.
And, it is probably going to be evaluated once. So the subquery would return a constant value and would not contribute much to a meaningful ordering.

Using Order By in IN condition

Let's say we have this silly query:
Select * From Emp Where Id In (Select Id From Emp)
Let's do a small modification inside IN condition by adding an Order By clause.
Select * From Mail Where Id In (Select Id From Mail Order By Id)
Then we are getting the following error:
ORA-00907: missing right parenthesis
Oracle assumes that IN condition will end after declaring the From table. As a result waits for right parenthesis, but we give an Order By instead.
Why can't we add an Order By inside IN condition?
FYI: I don't ask for a equivalent query. This is an example after all. I just can't understand why this error occurs.
Consider the condition x in (select something from somewhere). It returns true if x is equal to any of the somethings returned from the query - regardless of whether it's the first, second, last, or anything in the middle. The order that the somethings are returned is inconsequential. Adding an order by clause to a query often comes with a hefty performance hit, so I'm guessing this this limitation was introduced to prevent adding a clause that has no impact on the query's correctness on the one hand and may be quite expensive on the other.
It would not make sense to sort the values inside the IN clause. Think in this way:
IN (a, b, c)
is same as
IN (c, b, a)
IS SAME AS
IN (b, c, a)
Internally Oracle applies OR condition, so it is equivalent to:
WHERE id = a OR id = b OR id = c
Would it make any sense to order the conditions?
Ordering comes at it's own expense. So, when there is no need to sort, just don't do it. And in this case, Oracle applied the same rule.
When it comes to performance of the query, optimizer needs choose the best possible execution plan i.e. with the least cost to achieve the desired output. ORDER BY is useless in this case, and Oracle did a good job to prevent using it at all.
For the documentation,
Type of Condition Operation
----------------- ----------
IN Equal-to-any-member-of test. Equivalent to =ANY.
So, when you need to test ANY value for membership in a list of values, there is no need of ordered list, just a random matching does the job.
If you look at Oracle SQL reference (syntax diagrams) you will find a reason. ORDER BY is part of "select" statement, while IN clause uses lover level "subquery" statement. Your problem relates to nature of the Oracle's SQL grammar.
PS: there might be more gotchas like multiple UNION, MINUS on "subqueries" and then also you can use ONLY one ORDER BY clause, as this applies only onto result of UNION operation.
This will fail too:
select * from dual order by 1
union all
select * from dual order by 1;

How does SQL choose which row to show when grouping multiple rows together?

Consider the following table:
CREATE TABLE t
(
a INTEGER NOT NULL,
b INTEGER NOT NULL,
c INTEGER,
PRIMARY KEY (a, b)
)
Now if I do this:
SELECT a,b,c FROM t GROUP BY a;
I expect to have get each distinct value of a only once. But since I'm asking for b and c as well, it's going to give me a row for every value of a. Therefor, if, for a single value of a, there are many rows to choose from, how can I predict which row SQL will choose? My tests show that it chooses to return the row for which b is the greatest. But what is the logic in that? How would this apply to strings of blobs or dates or anything else?
My question: How does SQL choose which row to show when grouping multiple rows together?
btw: My particular problem concerns SQLITE3, but I'm guessing this is an SQL issue not dependent of the DBMS...
That shouldn't actually work in a decent DBMS :-)
Any column not used in the group by clause should be subject to an aggregation function, such as:
select a, max(b), sum(c) from t group by a
If it doesn't complain in SQLite (and I have no immediate reason to doubt you), I'd just put it down to the way the DBMS is built. From memory, there's a few areas where it doesn't worry too much about the "purity" of the data (such as every column being able to hold multiple types, the type belonging to the data in that row/column intersect rather than the column specification).
All the SQL engines that I know will complain about the query that you mentioned with an error message like "b and c appear in the field list but not in the group by list". You are only allowed to use b or c in an aggregate function (like MAX / MIN / COUNT / AVG whatever) or you'll be forced to add them in the GROUP BY list.
You're not quite correct about your assumption that this is RDBMS-independent. Most RDBMS don't allow to select fields that are not also in the GROUP BY clause. Exceptions to this (to my knowledge) are SQLite and MySQL. In general, you shouldn't do this, because values for b and c are chosen pretty arbitrarily (depending on the applied grouping algorithm). Even if this may be documented in your database, it's always better to express a query in a way that fully and non-ambiguously specifies the outcome
It's not a matter of what the database will choose, but the order your data are going to be returned.
Your primary key is handling your sort order by default since you didn't provide one.
You can use Order By a, c if that's what you want.

SQL Server UNION - What is the default ORDER BY Behaviour

If I have a few UNION Statements as a contrived example:
SELECT * FROM xxx WHERE z = 1
UNION
SELECT * FROM xxx WHERE z = 2
UNION
SELECT * FROM xxx WHERE z = 3
What is the default order by behaviour?
The test data I'm seeing essentially does not return the data in the order that is specified above. I.e. the data is ordered, but I wanted to know what are the rules of precedence on this.
Another thing is that in this case xxx is a View. The view joins 3 different tables together to return the results I want.
There is no default order.
Without an Order By clause the order returned is undefined. That means SQL Server can bring them back in any order it likes.
EDIT:
Based on what I have seen, without an Order By, the order that the results come back in depends on the query plan. So if there is an index that it is using, the result may come back in that order but again there is no guarantee.
In regards to adding an ORDER BY clause:
This is probably elementary to most here but I thought I add this.
Sometimes you don't want the results mixed, so you want the first query's results then the second and so on. To do that I just add a dummy first column and order by that. Because of possible issues with forgetting to alias a column in unions, I usually use ordinals in the order by clause, not column names.
For example:
SELECT 1, * FROM xxx WHERE z = 'abc'
UNION ALL
SELECT 2, * FROM xxx WHERE z = 'def'
UNION ALL
SELECT 3, * FROM xxx WHERE z = 'ghi'
ORDER BY 1
The dummy ordinal column is also useful for times when I'm going to run two queries and I know only one is going to return any results. Then I can just check the ordinal of the returned results. This saves me from having to do multiple database calls and most empty resultset checking.
Just found the actual answer.
Because UNION removes duplicates it does a DISTINCT SORT. This is done before all the UNION statements are concatenated (check out the execution plan).
To stop a sort, do a UNION ALL and this will also not remove duplicates.
If you care what order the records are returned, you MUST use an order by.
If you leave it out, it may appear organized (based on the indexes chosen by the query plan), but the results you see today may NOT be the results you expect, and it could even change when the same query is run tomorrow.
Edit: Some good, specific examples: (all examples are MS SQL server)
Dave Pinal's blog describes how two very similar queries can show a different apparent order, because different indexes are used:
SELECT ContactID FROM Person.Contact
SELECT * FROM Person.Contact
Conor Cunningham shows how the apparent order can change when the table gets larger (if the query optimizer decides to use a parallel execution plan).
Hugo Kornelis proves that the apparent order is not always based on primary key. Here is his follow-up post with explanation.
A UNION can be deceptive with respect to result set ordering because a database will sometimes use a sort method to provide the DISTINCT that is implicit in UNION , which makes it look like the rows are deliberately ordered -- this doesn't apply to UNION ALL for which there is no implicit distinct, of course.
However there are algorithms for the implicit distinct, such as Oracle's hash method in 10g+, for which no ordering will be applied.
As DJ says, always use an ORDER BY
It's very common to come across poorly written code that assumes table data is returned in insert order, and 95% of the time the coder gets away with it and is never aware that this is a problem as on many common databases (MSSQL, Oracle, MySQL). It is of course a complete fallacy and should always be corrected when it's come across, and always, without exception, use an Order By clause yourself.