I have the following problem:
What happens if the query didn't ask for B in the select?. I think it would give an error because the aggregate is computed based on the values in the select clause.
I have the following relation schema and queries:
Suppose R(A,B) is a relation with a single tuple (NULL, NULL).
SELECT A, COUNT(B)
FROM R
GROUP BY A;
SELECT A, COUNT(*)
FROM R
GROUP BY A;
SELECT A, SUM(B)
FROM R
GROUP BY A;
The first query returns NULL and 0. I am not sure about what the second query returns. The aggregate COUNT(*) count the number of tuples in one table; however, I don't know what it does to a group. The third returns NULL,NULL
The only rule about SELECT and GROUP BY is that the unaggregated columns in the SELECT must be in the GROUP BY (with very specific exceptions).
You can have columns in the GROUP BY that never appear in the SELECT. That is fine. It doesn't affect the definition of a group, but multiple rows may seem to have the same values in the GROUP BY columns.
Related
Query:
SELECT DISTINCT ON (geom_line),gid
FROM edge_table;
I have a edge table which contains duplicates and I want to remove duplicate edges keeping one of them, but the syntax itself is wrong?
The comma is the problem.
If you want geom_line included in the result, use
SELECT DISTINCT ON (geom_line) geom_line, gid FROM edge_table;
Else use
SELECT DISTINCT ON (geom_line) gid FROM edge_table;
But if your objective is just to remove duplicates, I'd say that you should use
SELECT DISTINCT geom_line, gid FROM edge_table;
DISTINCT guarantees uniqueness over the whole result set, while DISTINCT ON guarantees uniqueness over the expression in parentheses. If there are several rows where the expression in parentheses is identical, one of these rows is picked. If you have an ORDER BY clause, the first row will be picked.
DISTINCT a, b is the same as DISTINCT ON (a, b) a, b.
I am aware of select count(distinct a), but I recently came across select distinct count(a).
I'm not very sure if that is even valid.
If it is a valid use, could you give me a sample code with a sample data, that would explain me the difference.
Hive doesn't allow the latter.
Any leads would be appreciated!
Query select count(distinct a) will give you number of unique values in a.
While query select distinct count(a) will give you list of unique counts of values in a. Without grouping it will be just one line with total count.
See following example
create table t(a int)
insert into t values (1),(2),(3),(3)
select count (distinct a) from t
select distinct count (a) from t
group by a
It will give you 3 for first query and values 1 and 2 for second query.
I cannot think of any useful situation where you would want to use:
select distinct count(a)
If the query has no group by, then the distinct is anomalous. The query only returns on row anyway. If there is a group by, then the aggregation columns should be in the select, to identify each row.
I mean, technically, with a group by, it would be answering the question: "how many different non-null values of a are in groups". Usually, it is much more useful to know the value per group.
If you want to count the number of distinct values of a, then use count(distinct a).
I have a table with two columns a and b where a is an ID and b is a timestamp.
I need to select all of the a's which are distinct but I only care about the most up to date row per ID.
I.e. I need a way of selecting distinct a's conditional on the b values.
Is there a way to do this using DISTINCT ON in postgres?
Cheers
Like #a_horse_with_no_name suggests, the solution is
SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC
As the manual says,
Note that the "first row" of a set is unpredictable unless the query
is sorted on enough columns to guarantee a unique ordering of the rows
arriving at the DISTINCT filter. (DISTINCT ON processing occurs after
ORDER BY sorting.)
As posted by the upvoted answers, SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC works on Postgre 12. However, I am posting this answer to highlight few important points:
The results will be sorted based on column a; not column b.
Within each result row, the most recent (highest value) for column b would be picked.
In case, someone wants to get the most recent value for column b on the entire result set, in sql, we can run : SELECT MAX(b) from (SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC).
I have a table with some "functionally duplicate" records - different IDs, but the 4 columns of "user data" (of even more columns) are identical. I've got a query working that will select all records that have such duplicates.
Now I want to select, from each group of duplicates, first any of them that have column A not null - and I've verified from the data that there are at most 1 such rows per group - and if there are none in this particular group, then the minimum of column ID.
How do I select that? I can't exactly use a non-aggregate in the THEN of a CASE and an aggregate in the ELSE. E.g. this doesn't work:
SELECT CASE
WHEN d.A IS NULL THEN d.ID
ELSE MIN(d.ID) END,
d.B,
d.C,
d.E,
d.F
FROM TABLE T
JOIN (my duplicate query here) D ON T.B=D.B
AND T.C=D.C
AND T.E=D.E
AND T.F=D.F
GROUP BY T.B,
T.C,
T.E,
T.F
Error being:
column A must appear in the GROUP BY clause or be used in an aggregate function.
This can be radically simpler:
SELECT DISTINCT ON (b, c, e, f)
b, c, e, f, id -- add more columns freely
FROM (<duplicate query here>) sub
ORDER BY b, c, e, f, (a IS NOT NULL), id
Your duplicate query has all columns. No need to JOIN to the base table again.
Use the Postgres extension of the standard SQL DISTINCT: DISTINCT ON:
Select first row in each GROUP BY group?
Postgres has a proper boolean type. You can ORDER BY boolean expression directly. The sequence is FALSE (0), TRUE (1), NULL (NULL). If a is NULL, this expression is FALSE and sorts first: (a IS NOT NULL). The rest is ordered by id. Voilá.
Selection of ID happens automatically. According to your description you want the ID of the row selected in this query. Nothing more to do.
You can probably integrate this into your duplicate query directly.
Okay so this query should be easy but I'm having a bit of difficult. Let's say I have a table called 'foo' with columns 'a', 'b'.
I'm trying to figure out the following in one query:
select how of column 'a' are available of type column 'b', this is done with the following:
mysql> select count(a),b from foo GROUP BY b;
that's straight forward. but now I want to add a third output to that query as well which shows the percentage of the result from count(a) divided by count(*). So if I have 100 rows total, and one of the GROUP BY results comes back with 20, I can get the third column to output 20%. Meaning that column a makes for 20% of the aggregate pool.
Assuming you have > 0 rows in foo
SELECT count(a), b, (count(a) / (SELECT count(*) FROM foo)) * 100
FROM foo
GROUP BY b
There is a risk of it running slow, best bet is to program whatever is to preform two separate queries.
SELECT count(*) INTO #c FROM foo;
SELECT count(a), b, (count(a)/#c)*100 FROM foo GROUP by b;