SQL two criteria from one group-by - sql

I have a table with some "functionally duplicate" records - different IDs, but the 4 columns of "user data" (of even more columns) are identical. I've got a query working that will select all records that have such duplicates.
Now I want to select, from each group of duplicates, first any of them that have column A not null - and I've verified from the data that there are at most 1 such rows per group - and if there are none in this particular group, then the minimum of column ID.
How do I select that? I can't exactly use a non-aggregate in the THEN of a CASE and an aggregate in the ELSE. E.g. this doesn't work:
SELECT CASE
WHEN d.A IS NULL THEN d.ID
ELSE MIN(d.ID) END,
d.B,
d.C,
d.E,
d.F
FROM TABLE T
JOIN (my duplicate query here) D ON T.B=D.B
AND T.C=D.C
AND T.E=D.E
AND T.F=D.F
GROUP BY T.B,
T.C,
T.E,
T.F
Error being:
column A must appear in the GROUP BY clause or be used in an aggregate function.

This can be radically simpler:
SELECT DISTINCT ON (b, c, e, f)
b, c, e, f, id -- add more columns freely
FROM (<duplicate query here>) sub
ORDER BY b, c, e, f, (a IS NOT NULL), id
Your duplicate query has all columns. No need to JOIN to the base table again.
Use the Postgres extension of the standard SQL DISTINCT: DISTINCT ON:
Select first row in each GROUP BY group?
Postgres has a proper boolean type. You can ORDER BY boolean expression directly. The sequence is FALSE (0), TRUE (1), NULL (NULL). If a is NULL, this expression is FALSE and sorts first: (a IS NOT NULL). The rest is ordered by id. Voilá.
Selection of ID happens automatically. According to your description you want the ID of the row selected in this query. Nothing more to do.
You can probably integrate this into your duplicate query directly.

Related

Why doesn't my DISTINCT ON expression work?

Query:
SELECT DISTINCT ON (geom_line),gid
FROM edge_table;
I have a edge table which contains duplicates and I want to remove duplicate edges keeping one of them, but the syntax itself is wrong?
The comma is the problem.
If you want geom_line included in the result, use
SELECT DISTINCT ON (geom_line) geom_line, gid FROM edge_table;
Else use
SELECT DISTINCT ON (geom_line) gid FROM edge_table;
But if your objective is just to remove duplicates, I'd say that you should use
SELECT DISTINCT geom_line, gid FROM edge_table;
DISTINCT guarantees uniqueness over the whole result set, while DISTINCT ON guarantees uniqueness over the expression in parentheses. If there are several rows where the expression in parentheses is identical, one of these rows is picked. If you have an ORDER BY clause, the first row will be picked.
DISTINCT a, b is the same as DISTINCT ON (a, b) a, b.

Oracle SQL query to GROUP BY and subtract?

If I have four columns: A, B, C and D in a table how would an Oracle SQL query group by column D, then amongst each grouping select the rows where C = 'c' and for those selected rows, returns the value of B minus A?
SELECT Aggfunction(B - A), D FROM TABLENAME WHERE C='c' GROUP BY D
Replace Aggfunction with aggregate function you want e.g. SUM or AVG. You can only include ungrouped columns from a grouped sql query result set in an aggregate function (which makes sense because you only get one record out per group so have to accumulate the ungrouped columns in some way in order to represent a value per group)

SQL Basic Syntax

I have the following problem:
What happens if the query didn't ask for B in the select?. I think it would give an error because the aggregate is computed based on the values in the select clause.
I have the following relation schema and queries:
Suppose R(A,B) is a relation with a single tuple (NULL, NULL).
SELECT A, COUNT(B)
FROM R
GROUP BY A;
SELECT A, COUNT(*)
FROM R
GROUP BY A;
SELECT A, SUM(B)
FROM R
GROUP BY A;
The first query returns NULL and 0. I am not sure about what the second query returns. The aggregate COUNT(*) count the number of tuples in one table; however, I don't know what it does to a group. The third returns NULL,NULL
The only rule about SELECT and GROUP BY is that the unaggregated columns in the SELECT must be in the GROUP BY (with very specific exceptions).
You can have columns in the GROUP BY that never appear in the SELECT. That is fine. It doesn't affect the definition of a group, but multiple rows may seem to have the same values in the GROUP BY columns.

Postgres select distinct based on timestamp

I have a table with two columns a and b where a is an ID and b is a timestamp.
I need to select all of the a's which are distinct but I only care about the most up to date row per ID.
I.e. I need a way of selecting distinct a's conditional on the b values.
Is there a way to do this using DISTINCT ON in postgres?
Cheers
Like #a_horse_with_no_name suggests, the solution is
SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC
As the manual says,
Note that the "first row" of a set is unpredictable unless the query
is sorted on enough columns to guarantee a unique ordering of the rows
arriving at the DISTINCT filter. (DISTINCT ON processing occurs after
ORDER BY sorting.)
As posted by the upvoted answers, SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC works on Postgre 12. However, I am posting this answer to highlight few important points:
The results will be sorted based on column a; not column b.
Within each result row, the most recent (highest value) for column b would be picked.
In case, someone wants to get the most recent value for column b on the entire result set, in sql, we can run : SELECT MAX(b) from (SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC).

MySQL GROUP BY, and testing the grouped items

I've got a query like this:
select a, b, c, group_concat(d separator ', ')
from t
group by a;
This seems to work just fine. As I understand it (forgive me, I'm a MySQL rookie!), it's returning rows of:
each unique a value
for each a value, one b and c value
also for each a value, all the d values, concatenated into one string
This is what I want, but I also want to check that for each a, the b and c are always the same, for all rows with that a value.
My first thought is to compare:
select count(*) from t group by a, b, c;
with:
select count(*) from t group by a;
and make sure they're equal. But I've not convinced myself that this is correct, and I certainly am not sure there isn't a better way. Is there a SQL/MySQL idiom for this?
Thanks!
The issue with relying on MySQL's Hidden Columns functionality is spelled out in the documentation:
When using this feature, all rows in each group should have the same values for the columns that are ommitted from the GROUP BY part. The server is free to return any value from the group, so the results are indeterminate unless all values are the same.
Applied to your example, that means that the values for b and c are arbitrary -- the results can't be relied upon to consistently return the same value, and the likelihood of seeing the behavior increases with the number of possible values that b/c can return. So there's no a lot of value to compare to GROUP BY a and GROUP BY a, b, c...