MySQL GROUP BY, and testing the grouped items - sql

I've got a query like this:
select a, b, c, group_concat(d separator ', ')
from t
group by a;
This seems to work just fine. As I understand it (forgive me, I'm a MySQL rookie!), it's returning rows of:
each unique a value
for each a value, one b and c value
also for each a value, all the d values, concatenated into one string
This is what I want, but I also want to check that for each a, the b and c are always the same, for all rows with that a value.
My first thought is to compare:
select count(*) from t group by a, b, c;
with:
select count(*) from t group by a;
and make sure they're equal. But I've not convinced myself that this is correct, and I certainly am not sure there isn't a better way. Is there a SQL/MySQL idiom for this?
Thanks!

The issue with relying on MySQL's Hidden Columns functionality is spelled out in the documentation:
When using this feature, all rows in each group should have the same values for the columns that are ommitted from the GROUP BY part. The server is free to return any value from the group, so the results are indeterminate unless all values are the same.
Applied to your example, that means that the values for b and c are arbitrary -- the results can't be relied upon to consistently return the same value, and the likelihood of seeing the behavior increases with the number of possible values that b/c can return. So there's no a lot of value to compare to GROUP BY a and GROUP BY a, b, c...

Related

Can I (in a many to many relationship) select only those id:s in column A that has a connection to all id:s in column B?

I need to retrieve only those id's in "A" that has a connection to all id´s in "B".
In the example below, the result should be '...fa3e' because '...65d6' does NOT have a reference to all id´s in "B".
However, if '...fa3e' and '...65d6' reference the same id's in column B, then the query should return both '...fa3e' and '...65d6'.
And, subsequently, if a fifth row would connect '...fa3e' with a completely new id in "B". Then '...65d6' would be excluded because it no longer holds a reference to all id's in column "B".
Is there a way to accomplish this in SQL server?
I can´t really come up with a good description/search term of what it is I´m trying to do ("Exclude column A based on values in column B" is not quite right). Hence I´m striking out looking for resources.
I believe these values reside in the same table.
For distinct a values only:
select a
from T
group by a
having count(distinct b) = (select count(distinct b) from T);
To return all the rows:
select * from T where a in (
select a from T group by a
having count(distinct b) = (select count(distinct b) from T)
);
If (a, b) pairs are always unique then you wouldn't need the distinct qualifier on the left-hand counts. In fact you could even use count(*) for that.
This seems like it's going to be a terrible query, but at it's most basic, you want
All A where B in...
All B that are fully distinct
In SQL, that looks like
select distinct A
from test
where B in (select B from test group by B having count(1) = 1);
Absolutely zero guarantees on performance, but, this gives you the right value A. If you want to see which A/B pair actually made the cut, it could be SELECT A, B FROM test... too.

Aggregating on a column that is also being grouped on

I know there's a lot of confusion related to grouping/aggregation etc, and I thought that I had a pretty decent grasp on the whole thing until I saw something along the lines of
SELECT A, SUM(B)
FROM T
GROUP BY A
HAVING COUNT(A)>1;
At first this puzzled me since it seemed performing an aggregate on a column that is also being grouped on is redundant, since by definition the value for the group will be distinct. But then I thought about it and it kind of makes sense for duplicate values in the table, if the aggregation was done before the grouping. In my head, it seems like its treating it more like this kind of query
SELECT A, SUM(B)
FROM T
WHERE A in (SELECT A FROM T GROUP BY A HAVING COUNT(*)>1)
GROUP BY A;
As opposed to another selection operator on each group after the grouping is done (since to me that doesn't make much sense).
So my question is multifold: Can elements being grouped on be included in the HAVING clause at all? Can elements being grouped on be aggregated on (in the HAVING clause or elsewhere like SELECT clause)? If the previous statements hold, is my understanding of what this operation means correct?
NOTE: This question is mainly about standard (ansi) SQL but info on particular implementations would also be interesting
The arguments to an aggregation function can include the keys being aggregated.
That said, the more common way to count rows in each group is to use COUNT(*). I would recommend:
SELECT A, SUM(B)
FROM T
GROUP BY A
HAVING COUNT(*) > 1;
There is a slight overhead to using COUNT(A) because the value of A needs to be checked against NULL in each row.

Postgres select distinct based on timestamp

I have a table with two columns a and b where a is an ID and b is a timestamp.
I need to select all of the a's which are distinct but I only care about the most up to date row per ID.
I.e. I need a way of selecting distinct a's conditional on the b values.
Is there a way to do this using DISTINCT ON in postgres?
Cheers
Like #a_horse_with_no_name suggests, the solution is
SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC
As the manual says,
Note that the "first row" of a set is unpredictable unless the query
is sorted on enough columns to guarantee a unique ordering of the rows
arriving at the DISTINCT filter. (DISTINCT ON processing occurs after
ORDER BY sorting.)
As posted by the upvoted answers, SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC works on Postgre 12. However, I am posting this answer to highlight few important points:
The results will be sorted based on column a; not column b.
Within each result row, the most recent (highest value) for column b would be picked.
In case, someone wants to get the most recent value for column b on the entire result set, in sql, we can run : SELECT MAX(b) from (SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC).

SQL Syntax - Why do we need to list individual fields in an SQL group-by statement?

My understanding of using summary functions in SQL is that each field in the select statement that doesn't use a summary function, should be listed in the group by statement.
select a, b, c, sum(n) as sum_of_n
from table
group by a, b, c
My question is, why do we need to list the fields? Shouldn't the SQL syntax parser be implemented in a way that we can just tell it to group and it can figure out the groups based on whichever fields are in the select and aren't using summary functions?:
select a, b, c, sum(n) as sum_of_n
from table
group
I feel like I'm unnecessarily repeating myself when I write SQL code. What circumstances exist where we would not want it to automatically figure this out, or where it couldn't automatically figure this out?
To decrease the chances of errors in your statement. Explicitly spelling out the GROUP BY columns helps to ensure that the user wrote would they intended to write. You might be surprised at the number of posts that show up on Stackoverflow in which the user is grouping on columns that make no sense, but they have no idea why they aren't getting the data that they expect.
Also, consider the scenario where a user might want to group on more columns than are actually in the SELECT statement. For example, if I wanted the average of the most money that my customers have spent then I might write something like this:
SELECT
AVG(max_amt)
FROM (SELECT MAX(amt) FROM Invoices GROUP BY customer_id) SQ
In this case I can't simply use GROUP, I need to spell out the column(s) on which I'm grouping. The SQL engine could allow the user to explicitly list columns, but use a default if they are not listed, but then the chances of bugs drastically increases.
One way to think of it is like strongly typed programming languages. Making the programmer explicitly spell things out decreases the chance of bugs popping up because the engine made an assumption that the programmer didn't expect.
This is required to determine explicitly how do you want to group the records because, for example, you may use columns for grouping that are not listed in result set.
However, there are RDBMS which allow to not specify GROUP BY clause using aggregate functions like MySQL.
My first reaction would be that 'it is what it is' =)
But on thinking it through, the reason TSQL works like this is because the SELECT and the GROUP BY are two distinct parts of all the operations going on in the query.
This might not be the best example, but it does show that you can GROUP on different (well, 'more') fields than you are actually SELECTing.
SELECT brand = Convert(varchar(100), ''), model = Convert(varchar(100), ''), some_number = Convert(int, 0)
INTO #test
WHERE 1 = 2
INSERT #test (brand, model, some_number)
VALUES ('Ford', 'Focus', 10),
('Ford', 'Focus', 25),
('Ford', 'Kagu', 23),
('DMC', '12', 88)
SELECT brand, model, MAX(some_number)
FROM #test
GROUP BY brand, model
SELECT brand, MAX(some_number)
FROM #test
GROUP BY brand, model
Not all RDBMS's are like this, e.g. MySQL allows for omitting fields from the GROUP BY that are nevertheless in the SELECT part. From what I've seen, it then picks a random value ('there is no such a thing as an implicit first') and uses that in the SELECT .. I think, my knowledge on MySQL is rather limited but I've seen some examples here and there and they always confused me as I'm used to the strict requirement of TSQL you just described.
In addition, you can group by your columns in a different order than select
select a, b, c, sum(d)
from table
group by c,a,b
Also a lot of DBs allow you to skip column names, you can just specify which columns are going to be included in the group by using select position
select a, b, c, sum(d)
from table
group by 3,1,2

Custom Sorting in SQL order by clause?

Here is the situation that I am trying to solve:
I have a query that could return a set of records. The field being sorted by could have a number of different values - for the sake of this question we will say that the value could be A, B, C, D, E or Z
Now depending on the results of the query, the sorting needs to behave as follows:
If only A-E records are found then sorting them "naturally" is okay. But if a Z record is in the results, then it needs to be the first result in the query, but the rest of the records should be in "natural" sort order.
For instance, if A C D are found, then the result should be
A
C
D
But if A B D E Z are found then the result should be sorted:
Z
A
B
D
E
Currently, the query looks like:
SELECT NAME, SOME_OTHER_FIELDS FROM TABLE ORDER BY NAME
I know I can code a sort function to do what I want, but because of how I am using the results, I can't seem to use because the results are being handled by a third party library, to which I am just passing the SQL query. It is then processing the results, and there seems to be no hooks for me to sort the results and just pass the results to the library. It needs to do the SQL query itself, and I have no access to the source code of the library.
So for all of you SQL gurus out there, can you provide a query for me that will do what I want?
How do you identify the Z record? What sets it apart? Once you understand that, add it to your ORDER BY clause.
SELECT name, *
FROM [table]
WHERE (x)
ORDER BY
(
CASE
WHEN (record matches Z) THEN 0
ELSE 1
END
),
name
This way, only the Z record will match the first ordering, and all other records will be sorted by the second-order sort (name). You can exclude the second-order sort if you really don't need it.
For example, if Z is the character string 'Bob', then your query might be:
SELECT name, *
FROM [table]
WHERE (x)
ORDER BY
(
CASE
WHEN name='Bob' THEN 0
ELSE 1
END
), name
My examples are for T-SQL, since you haven't mentioned which database you're using.
There are a number of ways to solve this problem and the best solution depends on a number of factors that you don't discuss such as the nature of those A..Z values and what database product you're using.
If you have only a single value that has to sort on top, you can ORDER BY an expression that maps that value to the lowest possible sort value (with CASE or IIF or IFEQ, depending on your database).
If you have several different special sort values you could ORDER BY a more complicated expression or you could UNION together several SELECTs, with one SELECT for the default sorts and an extra SELECT for each special value. The SELECTs would include a sort column.
Finally, if you have quite a few values you can put the sort values into a separate table and JOIN that table into your query.
Not sure what DB you use - the following works for Oracle:
SELECT
NAME,
SOME_OTHER_FIELDS,
DECODE (NAME, 'Z', '_', NAME ) SORTFIELD
FROM TABLE
ORDER BY DECODE (NAME, 'Z', '_', NAME ) ASC