Why doesn't my DISTINCT ON expression work? - sql

Query:
SELECT DISTINCT ON (geom_line),gid
FROM edge_table;
I have a edge table which contains duplicates and I want to remove duplicate edges keeping one of them, but the syntax itself is wrong?

The comma is the problem.
If you want geom_line included in the result, use
SELECT DISTINCT ON (geom_line) geom_line, gid FROM edge_table;
Else use
SELECT DISTINCT ON (geom_line) gid FROM edge_table;
But if your objective is just to remove duplicates, I'd say that you should use
SELECT DISTINCT geom_line, gid FROM edge_table;
DISTINCT guarantees uniqueness over the whole result set, while DISTINCT ON guarantees uniqueness over the expression in parentheses. If there are several rows where the expression in parentheses is identical, one of these rows is picked. If you have an ORDER BY clause, the first row will be picked.
DISTINCT a, b is the same as DISTINCT ON (a, b) a, b.

Related

Basic SQL question: displaying single results of multiple rows

I'm just beginning SQL, forgive me for my very basic questions. I have two. Here is the relevant code:
SELECT column_1, column_2, COUNT(*) AS total
FROM table1
This simple query shows a single row for all identical instances. The COUNT column will show there are multiple instances, but it will only show up as one row. Even without COUNT, there is only one row. Why only one row if I didn't use DISTINCT? Without DISTINCT it would seem that all identical rows would show up individually.
What does the * do in COUNT. I understand it usually indicates "all", but what does it change in this case?
Thank you.
https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html says:
Without GROUP BY, there is a single group and it is nondeterministic which name value to choose for the group.
That is, the use of any aggregating function such as COUNT() causes all the rows in the table to be treated as a single group. The result is one row for that group.
If it didn't work this way, there might not be any way to do an aggregation against the whole table.
You also asked about the purpose of the *. By default, if you use COUNT(<expression>) the rows where the expression is NULL are not counted. But COUNT(*) is a special syntax that always counts all the rows, so you don't have to think up an expression that is guaranteed to be non-NULL.
Some people use a constant value that is not NULL, e.g. COUNT(1), which achieves the same result. But it makes some people wonder if COUNT(2) would somehow count the rows differently (it doesn't). The standard SQL specification provides COUNT(*) as a special syntax to make this more clear.
This query:
SELECT column_1, column_2, COUNT(*) AS total
FROM table1
violates a rule in SQL: when using aggregation functions (like count), any other expression in the select clause should either:
also be an aggregation
should be grouped by in the group by clause
There are some nuances to this rule (like functional dependency), but this shows that the above SQL is ambiguous: count(*) without group by will ensure you get only one record in your output, but then it is not clear what the values would be of those other two expressions? Which record would they be based on? Some database engines have allowed these constructs, and choose a record to base those column values on.
However, if you remove all aggregations, you should get all the rows of your table:
SELECT column_1, column_2
FROM table1
If you want to get a row for each distinct value of column_1, column_2, each with a count, then use group by
SELECT column_1, column_2, COUNT(*) AS total
FROM table1
GROUP BY column_1, column_2
The number of rows in the result will depend on how many distinct pairs of column1, column2 appear in your table.
As to the * in COUNT(*): the alternative would be to mention an expression, like COUNT(column3): in that case null values will not be counted. * thus can be understood to mean: count all records, even when they have null values.
The * is just a short cut for all the columns.
If you select * from table1 it will display all the columns without requiring you to write them all out.
COUNT() is an aggregate function. If you pass COUNT(column_1) it would just count the rows with a column_1 value. So COUNT(*) is the same as writing COUNT(column_1, column_2, ...).
I'm thinking your SQL may work better with a group by clause at the end.
SELECT column_1, column_2, COUNT(*) AS total
FROM table1 GROUP BY column_1, column_2

Combine multiple rows with N-1 identical columns and 1 different column into one row, preserving the first N-1 columns and summing the last column

I have a query that produces a table with 26 columns, A-Z. For some rows, columns A-Y are identical, and column Z is the only one that differs. Is there an easy and clean way to combine duplicate rows, such that columns A-Y are the same and column Z is summed over? My solution is to do something like
SELECT A, B, C,...,Y,SUM(Z)
-- lots of work
FROM [table produced by multiple joins]
GROUP BY A, B, C,...,Y
The last GROUP BY clause ends up being very long. It's also prone to making mistakes if columns are ever added or removed from the SELECT statement. Is this the only way to go about what I want to do?
Below is for BigQuery Standard SQL
#standardSQL
SELECT
ANY_VALUE((SELECT AS STRUCT t.* EXCEPT(z))).*,
SUM(z) AS z
FROM `project.dataset.table_produced_by_multiple_joins` t
GROUP BY FORMAT('%t', (SELECT AS STRUCT t.* EXCEPT(z)))

What's the difference between select distinct count, and select count distinct?

I am aware of select count(distinct a), but I recently came across select distinct count(a).
I'm not very sure if that is even valid.
If it is a valid use, could you give me a sample code with a sample data, that would explain me the difference.
Hive doesn't allow the latter.
Any leads would be appreciated!
Query select count(distinct a) will give you number of unique values in a.
While query select distinct count(a) will give you list of unique counts of values in a. Without grouping it will be just one line with total count.
See following example
create table t(a int)
insert into t values (1),(2),(3),(3)
select count (distinct a) from t
select distinct count (a) from t
group by a
It will give you 3 for first query and values 1 and 2 for second query.
I cannot think of any useful situation where you would want to use:
select distinct count(a)
If the query has no group by, then the distinct is anomalous. The query only returns on row anyway. If there is a group by, then the aggregation columns should be in the select, to identify each row.
I mean, technically, with a group by, it would be answering the question: "how many different non-null values of a are in groups". Usually, it is much more useful to know the value per group.
If you want to count the number of distinct values of a, then use count(distinct a).

SQL Basic Syntax

I have the following problem:
What happens if the query didn't ask for B in the select?. I think it would give an error because the aggregate is computed based on the values in the select clause.
I have the following relation schema and queries:
Suppose R(A,B) is a relation with a single tuple (NULL, NULL).
SELECT A, COUNT(B)
FROM R
GROUP BY A;
SELECT A, COUNT(*)
FROM R
GROUP BY A;
SELECT A, SUM(B)
FROM R
GROUP BY A;
The first query returns NULL and 0. I am not sure about what the second query returns. The aggregate COUNT(*) count the number of tuples in one table; however, I don't know what it does to a group. The third returns NULL,NULL
The only rule about SELECT and GROUP BY is that the unaggregated columns in the SELECT must be in the GROUP BY (with very specific exceptions).
You can have columns in the GROUP BY that never appear in the SELECT. That is fine. It doesn't affect the definition of a group, but multiple rows may seem to have the same values in the GROUP BY columns.

SQL Select Distinct with ints and varchars

I have a query which returns a number of ints and varchars.
I want the result Distinct but by only one of the ints.
SELECT DISTINCT
t.TaskID ,
td.[TestSteps]
FROM SOX_Task t
snip
Is there a simple way to do this?
DISTINCT doesn't work that way ... it ensures that entire rows are not duplicated. Besides, if it did, how would it decide which values should appear in other columns?
What you probably want to do here is a GROUP BY on the column you want to be distinct, and then apply an appropriate aggregate operator to the other columns to get the values you want (eg. MIN, MAX, SUM, AVG, etc).
For example, the following returns a distinct list of task IDs with the maximum number of steps for that task. That may not be exactly what you want, but it should give you the general idea.
SELECT t.TaskID, MAX(t.TestSteps)
FROM SOX_Task t
GROUP BY t.TaskID
I want the result Distinct but by only one of the ints.
This:
SELECT DISTINCT t.taskid
FROM SOX_TASK t
...will return a list of unique taskid values - there will not be any duplicates.
If you want to return a specific value, you need to specify it in the WHERE clause:
SELECT t.*
FROM SOX_TASK t
WHERE t.taskid = ?
That will return all the rows with the specific taskid value. If you want a distict list of the values associated with the taskid value:
SELECT DISTINCT t.*
FROM SOX_TASK t
WHERE t.taskid = ?
GROUP BY is another means of getting distinct/unique values, and it's my preference to use, but in order to use GROUP BY we need to know the columns in the table and what grouping you want.