How does SQL choose which row to show when grouping multiple rows together? - sql

Consider the following table:
CREATE TABLE t
(
a INTEGER NOT NULL,
b INTEGER NOT NULL,
c INTEGER,
PRIMARY KEY (a, b)
)
Now if I do this:
SELECT a,b,c FROM t GROUP BY a;
I expect to have get each distinct value of a only once. But since I'm asking for b and c as well, it's going to give me a row for every value of a. Therefor, if, for a single value of a, there are many rows to choose from, how can I predict which row SQL will choose? My tests show that it chooses to return the row for which b is the greatest. But what is the logic in that? How would this apply to strings of blobs or dates or anything else?
My question: How does SQL choose which row to show when grouping multiple rows together?
btw: My particular problem concerns SQLITE3, but I'm guessing this is an SQL issue not dependent of the DBMS...

That shouldn't actually work in a decent DBMS :-)
Any column not used in the group by clause should be subject to an aggregation function, such as:
select a, max(b), sum(c) from t group by a
If it doesn't complain in SQLite (and I have no immediate reason to doubt you), I'd just put it down to the way the DBMS is built. From memory, there's a few areas where it doesn't worry too much about the "purity" of the data (such as every column being able to hold multiple types, the type belonging to the data in that row/column intersect rather than the column specification).

All the SQL engines that I know will complain about the query that you mentioned with an error message like "b and c appear in the field list but not in the group by list". You are only allowed to use b or c in an aggregate function (like MAX / MIN / COUNT / AVG whatever) or you'll be forced to add them in the GROUP BY list.

You're not quite correct about your assumption that this is RDBMS-independent. Most RDBMS don't allow to select fields that are not also in the GROUP BY clause. Exceptions to this (to my knowledge) are SQLite and MySQL. In general, you shouldn't do this, because values for b and c are chosen pretty arbitrarily (depending on the applied grouping algorithm). Even if this may be documented in your database, it's always better to express a query in a way that fully and non-ambiguously specifies the outcome

It's not a matter of what the database will choose, but the order your data are going to be returned.
Your primary key is handling your sort order by default since you didn't provide one.
You can use Order By a, c if that's what you want.

Related

How to order alphabetically in SQL with two columns, but where either column may be empty

I'm using SQL Server, and I have a table with two columns, both varchar, column A and column B. I need to produce a list in alphabetical order, however only one of these columns will ever have a value in it (ie, if column A has a value, then column B will be NULL and vice versa).
How can I write an ORDER BY clause in my T-SQL query to produce a list that checks both columns to see which one has the value present, then order the rows alphabetically?
Use COALESCE which takes the first non-null argument
order by coalesce(columnA, columnB) asc
There are some standard options to do this. What you choose is mostly personal "taste". The most "explicit" way is using CASE WHEN:
ORDER BY CASE WHEN columnA IS NULL THEN columnB ELSE columnA END;
By explicit, I mean you clearly understand it without knowing about specific functions that check this.
The standard function to do this which works on every DB is COALESCE:
ORDER BY COALESCE(columnA,columnB);
This has the advantage it's much shorter, especially when you have more columns that should replace each other when null.
SQL Server DB furthermore provides the function ISNULL that expects exact two arguments:
ORDER BY ISNULL(columnA,columnB);
The advantage of this is the name tells a bit more than "COALESCE", also it is faster than the other two options according to some performance articles and tests. The disadvantage is this function will not work on other DB's.
Overall, as I said, it's mainly kind of personal taste which option you should take.

Is SELECT DISTINCT ON (col) * valid?

SELECT DISTINCT ON (some_col)
*
FROM my_table
I'm wondering if this is valid and will work as expected. Meaning, will this return all columns from my_table, based on distinct some_col? I've read the Postgres docs and don't see any reason why this wouldn't work as expected, but have read old comments here on SO which state that columns need to be explicitly listed when using distinct on.
I do know it's best practice to explicitly list columns, and also to use order by when doing the above.
Background that you probably don't need or care about
For background and the reason I ask, is we are migrating from MySQL to Postgres. MySQL has a very non-standards compliant "trick" which allows a SELECT * ... GROUP BY which allows one to easily select * based on a group by. Previous answers and comments about migrating this non-standard-compliant trick to Postgres are murky at best.
SELECT DISTINCT ON (some_col) *
FROM my_table;
I'm wondering if this is valid
Yes. Typically, you want ORDER BY to go with it to determine which row to pick from each set of peers. But choosing an arbitrary row (without ORDER BY) is a valid (and sometimes useful!) application. You just need to know what you are doing. Maybe add a comment for the afterworld?
See:
Select first row in each GROUP BY group?
will this return all columns from my_table, based on distinct some_col?
It will return all columns. One arbitrary row per distinct value of some_col.
Note how I used the word "arbitrary", not "random". Returned rows are not chosen randomly at all. Just arbitrarily, depending on current implementation details. Typically the physically first row per distinct value, but that depends.
I do know it's best practice to explicitly list columns.
That really depends. Often it is. Sometimes it is not. Like when I want to get all columns to match a given row type.

How to avoid exponencial time cost with SQL Server SUM function?

I realized my query takes exponential time by every time I use SUM function...
For Example, the below code takes 2 seconds
SELECT sub.a, SUM(sub.b)
FROM (
SELECT a, b, c
FROM temp
)sub
GROUP BY a;
And using a second SUM now takes 4 seconds and so on...
SELECT sub.a, SUM(sub.b), SUM(sub.c)
FROM (
SELECT a, b, c
FROM temp
)sub
GROUP BY a;
it seems the subquery is bein executed again by every SUM I do, whether this is correct, what would be the best practice to avoid the time cost?
example above is just representing in the most basic way the question
TL;DR: No, this is completely wrong.
When you run a query in SQL Server, the optimizer compiles it into the most efficient method it can find. You can see the result by clicking Include Actual Execution Plan in SSMS.
For the query you specify, it would typically do something like this:
It notes that the subquery can be inlined into the query, and does so:
SELECT sub.a, SUM(sub.b), SUM(sub.c)
FROM temp
GROUP BY a;
It then evaluates the best way to aggregate the table by a values. Let's assume there is no index at all, a Hash Aggregate would be most likely chosen here.
On execution, every row is fed into the Hash, which builds up an in-memory hash table, with a values as the key. Each row is looked up based on a, a key is added to the hash table if it hasn't been seen before. Then b and c values are added to that key.
Let's say you have an index on a,b,c. Now a much faster method is possible, called a Stream Aggregate, because now the values are passing through the Aggregate sorted by a.
Each row passes through the aggregate. If the a value is the same as the row before, it's b and c values are added to whatever we have so far. When the a value changes, the existing result is output, and we start aggregating again.
It is true that summing extra columns is extra overhead, but it's pretty small compared to reading the table of disk or hashing, which is only done once per the whole query.

Does BigQuery guarantee column order when performing SELECT * query from subselect?

Does BigQuery guarantee column order when performing SELECT * FROM subquery?
For example, given table t with columns a, b, c, d if I'd execute a query:
SELECT * EXCEPT (a) FROM (SELECT a, b, c, d FROM t)
Will the columns in the result always have the same order (b, c, d in this case)?
I am going to answer "yes"; however, it is not documented to do so. The only guarantee (according to the documentation) is:
SELECT *, often referred to as select star, produces one output column for each column that is visible after executing the full query.
This does NOT explicitly state that the columns are in the order they are defined -- i.e. first by the order of references in the FROM clause and then by the ordering within each reference. I'm not even sure if the SQL standard specifies the ordering (although I have a vague recollection that the 92 standard might have).
That said, I have never seen any database not produce the columns in the specified order. Plus, databases (in general) have to keep the ordering of the column to support INSERT. You can see it yourself in INFORMATION_SCHEMA.COLUMNS.ORDINAL_POSITION. And, the REPLACE functionality (to replace a column expression in the SELECT list) would make less sense if the ordering were not guaranteed.
I might argue that you are "pretty safe" assuming that the ordering is the same. And that we should lobby Google to fix the documentation to make this clear.

Query applying Where clause BEFORE Joins?

I'm honestly really confused here, so I'll try to keep it simple.
We have Table A:
id
Table B:
id || number
Table A is a "prefilter" to B, since B contains a lot of different objects, including A.
So my query, trying to get all A's with a filter;
SELECT * FROM A a
JOIN B b ON b.id = a.id
WHERE CAST(SUBSTRING(b.number, 2, 30) AS integer) between 151843 and 151865
Since ALL instances of A starts with a letter ("X******"), I just want to truncate the first letter to let the filter do his work with the number specified by the user.
At first glance, there should be absolutely no worries. But it seems I was wrong. And on something I didn't expect to be...
It seems like my WHERE clause is executed BEFORE my JOIN. Therefore, since many B's have number with more than one Letter at the start, I have an invalid conversion happening. Despite the fact that it would NEVER happen if we stay in A's.
I always thought that where clause was executed after joins, but in this case, it seems postgres wants to prove me wrong.
Any explanations ?
SQLFiddle demonstrating problem: http://sqlfiddle.com/#!15/cd7e6e/7
And even with the SubQuery, it still makes the same error...
You can use the regex substr function to remove everything but digits: CAST(substring(B.number from '\d') AS integer).
See working example here: http://sqlfiddle.com/#!15/cd7e6e/18
SQL is a declarative language. For a select statement, you declare the criteria the data you are looking for must meet. You don't get to choose the execution path, your query isn't executed procedurally.
Thus the optimizer is free to choose any execution plan it likes, as long as it returns records specified by your criteria.
I suggest you change your query to cast to string instead of to integer. Something like:
WHERE SUBSTRING(b.number, 2, 30) between CAST(151843 AS varchar) and CAST(151865 AS varchar)
Do the records of A that are in B have the same id in table B as in A. If those records are inserted in a different order, this may not be the case and therefore return different records than expected.