Have multiple aggregations in a query always the same order? - sql

Im asking for PostgreSQL specifically, but answers for other popular SQL implementations are appreciated as well.
Given an sql query with multiple aggregates, especially array_agg, is the order of the aggregated values deterministic?
Example:
SELECT ARRAY_AGG(columnA), ARRAY_AGG(columnB) FROM myTable
GROUP BY columnC
Can I rely on both arrays to have the same order, meaning values at position i in both arrays will belong to the same source row?
I can't find anything about this in the docs and I'm unsure because I've read that parallelization could be used in calculating aggregates, which I'm afraid could possibly result in non-deterministic orders.

The order is never deterministic if you don't provide an order by
So if you need a specific order, then specify it:
SELECT ARRAY_AGG(columnA order by some_sort_column),
ARRAY_AGG(columnB order by some_sort_column)
FROM myTable
GROUP BY columnC

I'm not sure why people here didn't understand the question and all of them quote the documentation incorrectly, by providing correct answer, but for different question.
Yes, both aggregate functions are called on the same row, at the same time, and this is guided by the query parser.
If you don't believe or trust it you can do something better that doing the sorting:
SELECT ARRAY_AGG(CONCAT(myTable.ctid, ':', columnA)),
ARRAY_AGG(CONCAT(myTable.ctid, ':', columnB))
FROM myTable
GROUP BY columnC
now you'll see that both result aggregates having the same ctid on both sides, for each element of the array, simply because both are called at the same time and there is no way around for PG to do something else.
Assuming that is not enough to convince you, then you have no other choice but to refactor your original query like this:
SELECT ARRAY_AGG(ARRAY[columnA, columnB])
FROM myTable
GROUP BY columnC
Now you have what you want: columnA and columnB together!

Related

why distinct and order by doesn't work together in sql query?

enter image description here I am learning how to order by is used in SQL query, then I learned that order by and distinctly don't work together but, when I try to do it practically it worked. I am so confused even after asking chatgpt what the relationship is between order by and distinct.
I learned that when executing SQL queries, the ORDER BY clause comes after the SELECT clause. This means that the database will first retrieve the data specified in the SELECT clause, and then sort it based on the criteria specified in the ORDER BY clause. If the column used in the ORDER BY clause is not present in the SELECT clause, the database will automatically include that column in the select and do order by on both the columns and give result column only column given in SELECT.
However, when using both DISTINCT and ORDER BY together, the outcome may not be what is expected. This is because DISTINCT acts on both the columns in the SELECT clause and the column in the ORDER BY clause. This may cause unexpected results, especially in MySQL.
I found that when I tried this in practice, it still produced the desired results, which makes me question if I learned something incorrectly or if there is missing information that I am unaware of.
I am using the MYSQL database.
It seems there are some spaces before your unique value. It will be appropriate to perform TRIM/RTRIM and remove these spaces in the DISTINC clause itself.
It should be something like this:
DISTINCT TRIM(value) AS trim_value
...
ORDER BY trim_value
Also, it is possible that these are not spaces but some other characters which need to be replace, too.

Is SELECT DISTINCT ON (col) * valid?

SELECT DISTINCT ON (some_col)
*
FROM my_table
I'm wondering if this is valid and will work as expected. Meaning, will this return all columns from my_table, based on distinct some_col? I've read the Postgres docs and don't see any reason why this wouldn't work as expected, but have read old comments here on SO which state that columns need to be explicitly listed when using distinct on.
I do know it's best practice to explicitly list columns, and also to use order by when doing the above.
Background that you probably don't need or care about
For background and the reason I ask, is we are migrating from MySQL to Postgres. MySQL has a very non-standards compliant "trick" which allows a SELECT * ... GROUP BY which allows one to easily select * based on a group by. Previous answers and comments about migrating this non-standard-compliant trick to Postgres are murky at best.
SELECT DISTINCT ON (some_col) *
FROM my_table;
I'm wondering if this is valid
Yes. Typically, you want ORDER BY to go with it to determine which row to pick from each set of peers. But choosing an arbitrary row (without ORDER BY) is a valid (and sometimes useful!) application. You just need to know what you are doing. Maybe add a comment for the afterworld?
See:
Select first row in each GROUP BY group?
will this return all columns from my_table, based on distinct some_col?
It will return all columns. One arbitrary row per distinct value of some_col.
Note how I used the word "arbitrary", not "random". Returned rows are not chosen randomly at all. Just arbitrarily, depending on current implementation details. Typically the physically first row per distinct value, but that depends.
I do know it's best practice to explicitly list columns.
That really depends. Often it is. Sometimes it is not. Like when I want to get all columns to match a given row type.

SQL Server order by expression

While watching Troy Hunt's fantastic course on SQLi, I've noticed that he ends up using this strategy to see if a table has a specific column:
select * from TableA order by (select top 1 some_column from TableB) desc
This expression will get executed by SQL Server, but what will it do for the order by clause? I've seen expressions being used with order by before (case when then else end), but I'm really curious to understand how SQL can process the previous query without any errors...
EDIT: Giving more info because it seems like my initial post was not clear enough:
I know this is not the best strategy for getting table or column name though SQLi (that's not what I'm asking)
I'm not interested in knowing how to protect against this (I know how to do that already)
I know that sorting by a constant value doesn't make sense (though it allows you to run these types of "boolean queries")
What t I really want to know is why it works.
So, going back to the docs, the order by clause expects an order_by_expression, which is described as:
Specifies a column or expression on which to sort the query result set. A sort column can be specified as a name or column alias, or a nonnegative integer representing the position of the column in the select list.
According to the docs, an expression is:
Is a combination of symbols and operators that the SQL Server Database Engine evaluates to obtain a single data value. Simple expressions can be a single constant, variable, column, or scalar function. Operators can be used to join two or more simple expressions into a complex expression.
As #SMor demonstrated, the query does run if you replace the order by select expression with a simple select 'A':
select * from TableA order by (select 'A') desc
But this does not work:
select * from TableA order by 'A' desc
So, the question is: why is select 'A' accepted by SQL Server in the order by clause? Doesn't it produce a constant too? Since a constant is an expression and taking into account the definition for the order by clause, shouldn't it thrown an error in both cases?
Thanks.
The use of (select top 1 some_column from TableB) is an example of a scalar subquery. This is a subquery that returns exactly one column and at most one row. It can be used anywhere a literal value can be used -- and perhaps in some other places as well. Apparently, it can be used in an order by, even though SQL Server does not allow a literal value for order by.
The most common type of scalar subquery is a correlated subquery, which has a where clause that connects the subquery to the outer query. This is not an example of a scalar subquery.
In fact, this is not an example of anything useful as far as I can tell. It has one major shortcoming, which is the use of top without order by. The value returned by the subquery is indeterminate. That seems like a bad practice, and particularly bad if you are trying to teach people SQL.
And, it is probably going to be evaluated once. So the subquery would return a constant value and would not contribute much to a meaningful ordering.

Most efficient way of getting mariaDB/SQL record count

A question on a simple SQL statement, but one which I sometimes wonder over. Thought I'd see if anyone knew the answer to.
When counting the records in a table using a simple SQL statement, which has the least overheard:
1) SELECT COUNT(single_primary_field) FROM table, i.e. SELECT COUNT(user_ID) FROM users;
2) SELECT COUNT(*) FROM table
I initially thought the first may be quickest. But perhaps not having a specific field to associate with makes the second quicker?
Probably makes very little difference speed wise either way.
Thanks
COUNT(column) counts only selected column and ignore the null values.
COUNT(*) count rows and don't care values in the columns.
Using COUNT(*) is a better way for counting rows.
Count(*) is most efficent way to count according to mysql:
Count
Have a read through https://mariadb.com/kb/en/library/explain/ to look at what type of indexing your query is using, it usually hints at its performance.
I think Count(*) is going to be the fastest because maria does store a running count.

SQL Query: Which one should i use? count("columnname") or count(1)

In my SQL query I just need to check whether data exists for a particular userid.
I always only want one row that will be returned when data exist.
I have two options
1. select count(columnname) from table where userid=:userid
2. select count(1) from tablename where userid=:userid
I am thinking second one is the one I should use because it may have a better response time as compared with first one.
There can be differences between count(*) and count(column). count(*) is often fastest for reasons discussed here. Basically, with count(column) the database has to check if column is null or not in each row. With count(column) it just returns the total number of rows in the table which is probably has on hand. The exact details may depend on the database and the version of the database.
Short answer: use count(*) or count(1). Hell, forget the count and select userid.
You should also make sure the where clause is performing well and that its using an index. Look into EXPLAIN.
I'd like to point out that this:
select count(*) from tablename where userid=:userid
has the same effect as your second solution, with th advantage that count(*) it unambigously means "count all rows".
The * in COUNT(*) will not expand into all columns - that is to say, the * in SELECT COUNT(*) is not the same as in SELECT *. So you need not worry about performance when writing COUNT(*)
The disadvantage of writing COUNT(1) is that it is less clear: what did you mean? A literal one (1) may look like a lower case L (this: l) in some fonts.
Will give different results if columnname can be NULL, otherwise identical performance.
The optimiser (SQL Server at least) realises COUNT(1) is trivial. You can also use COUNT(1/0)
It depends what you want to do.
The first one counts rows with non-null values of columnname. The second one counts ALL rows.
Which behaviour do you want? From the way your question is worded, I guess that you want the second one.
To count the number of records you should use the second option, or rather:
select count(*) from tablename where userid=:userid
You could also use the exists() function:
select case when exists(select * from tablename where userid=:userid) then 1 else 0 end
It might be possible for the database to do the latter more efficiently in some cases, as it can stop looking as soon as a match is found instead of comparing all records.
Hey how about Select count(userid) from tablename where userid=:userid ? That way the query looks more friendly.