How to do count(distinct) for multiple columns - sql

This does not work:
select count(distinct colA, colB) from mytable
I know I can simply solve this by making a double select.
select count(*) from (
select distinct colA, colB from mytable
)
Is there anyway I can do this without having to do the sub-select?

Subquery is standard solution which I recommend too. Concatenation-based solutions, except they are error-prone if dangerous character occurs, might be also worse in performance.
Note: in case you collected obscure solutions how to avoid subquery, window function usage is also possible here (Not to be used in production - your code reviewers won't praise you for it):
select distinct count(*) over ()
from my_table
group by colA, colB

[TL;DR] Just use a sub-query.
If you are trying to use concatenation then you need to ensure that you delimit the terms with a string that is never going to appear in the values otherwise you will find non-distinct terms grouped together.
For example: if you have a two numeric column then using COUNT(DISTINCT col1 || col2) will group together 1||23 and 12||3 and count them as one group.
You could use COUNT(DISTINCT col1 || '-' || col2) but if the columns are string values and you have 'ab-'||'-'||'c' and 'ab'||'-'||'-c' then, once again, they would be identical once concatenated.
The simplest method is to use a sub-query.
If you can't do that then you can combine columns via string-concatenation but you need to analyse the contents of the column and pick a delimiter that does not appear in your strings otherwise your results might be erroneous. Even better is to ensure that the delimiter character will never be in the sub-string with check constraints.
ALTER TABLE mytable ADD CONSTRAINT mytable__col1__chk CHECK (col1 NOT LIKE '%¬%');
ALTER TABLE mytable ADD CONSTRAINT mytable__col2__chk CHECK (col2 NOT LIKE '%¬%');
Then:
SELECT COUNT(DISTINCT col1 || '¬' || col2)
FROM mytable;

Just for fun, you can (ab)use window functions and limit clause. These are evaluated after grouping. So:
SELECT COUNT(*) OVER()
FROM t
GROUP BY col_a, col_b
OFFSET 0 ROWS FETCH NEXT 1 ROWS ONLY

If you're trying to avoid sub-selects at all costs, one variant would be to concatenate them as such:
SELECT count(DISTINCT concat(colA, colB)) FROM mytable;

Concatenate them.
Select count(distinct colA ||'-'|| colB) from mytable;

Related

Select Distinct (case insensitive) on Postgres

This has been asked here and a few other places before but seems like the suggested answers either don't apply to postgres or don't work in this situation.
I'm looking to select distinct column names, eg:
SELECT DISTINCT column_name FROM table_name WHERE ... ORDER BY column_name however I'm looking to eliminate case sensitive duplicates (eg A and a should be considered the same thing)
I tried COLLATE but all available formats were case sensitive. And changing case via LOWER() OR UPPER() wont work because In this situation I need the case information.
I thought about something like this to grab unique values but still maintain the case:
SELECT DISTINCT upper(my_column) as upper_case, my_column
FROM my_table
ORDER BY upper(my_column)
But introducing my_column in the distinct query negates the whole thing.
How can I get unique values (case insensitive) without modifying the case of the results itself?
In PostgreSQL (but not many other databases), you can use a DISTINCT ON clause:
SELECT DISTINCT ON (upper(my_column)) my_column
FROM my_table
ORDER BY upper(my_column)
You can even choose which of the results you get, by adding another column to the ORDER BY clause to make the desired result appear first:
SELECT DISTINCT ON (upper(my_column)) my_column
FROM my_table
ORDER BY upper(my_column), other_column
Documentation: DISTINCT Clause
You can use an aggregation function:
SELECT MAX(my_column)
FROM my_table
GROUP BY upper(my_column);
This returns one value. If you want all the values:
SELECT ARRAY_AGG(DISTINCT my_column)
FROM my_table
GROUP BY upper(my_column);

find duplicate records in the table with all columns are the same

Say If I have a table with hundreds of columns. The task is that I want to find out duplicate records with all the columns are the same, basically find out identical records.
I tried group by as the following
select *
from some_table
group by *
having count(*) > 1
but it seems like group by * is not allowed in sql. Anyone has some idea as to what kind of command I could run to find out identical records? Thanks in advance.
Just put comma separated list of columns instead of * in both places - select and group by. Buy not count - the count(*) should remain as is.
I verified it on SQL Server, but I am pretty sure it is ANSI SQL and should work on most (any?) ANSI SQL compatible RDBMS.
Postgresql solution, I think.
SELECT all rows, and use EXCEPT ALL to remove one of each (the SELECT DISTINCT). Now we will have the duplicates only.
select * from table
except all
select distinct * from table
You have to list out all the columns:
select col1, col2, col3, . . .
from t
group by col1, col2, col3, . . .
having count(*) > 1;
MSSQL 2016+
Add a new column in the table to hash all the columns, MSSQL HashBytes
notes to consider:
you need to convert all the columns to Varchar or Varbinary.
is you comparison case sensitive, if yes use upper() or lower()
Null values, use column sperator.
the hashing algorithm Performance on the server.
for me usualy go for something like
select col1 , col2, col3 , col4
,HASHBYTES ( 'MD5',
concat(
Convert (varbinary ,col1),'|'
,Convert (varbinary ,col2),'|'
,Convert (varbinary ,col3),'|'
,Convert (varbinary ,col4),'|'
)
) as Row_Hash
from table1
the the row_hash can be use as a singl column in the table/CTE to present the content of all the other columns
you can count by it and Order by it to find the duplicates

How is it possible for count distinct to show duplicates, but group by does not?

I want to query for duplicates in my data.
So, the first thing I do is I do a count distinct:
select count(distinct colA, colB ....) from Table
and a count:
select count(*) from Table
And I see that the count distinct is lower than the count(*).
So, now I want to actually see the duplicates, so I do this:
select colA, colB, .... count(*) from Table
group by colA, colB ... having count(*) > 1;
Now, for some reason, this does not return any records at all. The table is too big for me to show results here, and the columns too many.
How is it possible for both of these to be true? the counts are different, but no rows show up when I group them and filter for count(*) >1?
Thanks.
The behavior you see may depend on the database you are using. However, I'm pretty sure that the problem is due to NULL values in the columns. For instance, MySQL explicitly describes COUNT(DISTINCT) as:
COUNT(DISTINCT expr,[expr...])
Returns a count of the number of rows with different non-NULL expr
values.
Not all databases support COUNT(DISTINCT) with multiple expressions. Different databases may handle NULL values differently. But, they seem to be the most likely cause of the discrepancy.

SQL select distinct by 2 or more columns

I have a table with a lot of columns and what I need to do is to write select that would take only unique values. The main problem is that I need to check three columns at the same time and if all three columns have same values in their columns(not between them, but in their own column) then distinct. Idea should be something like distinct(column1 and column2 and column3)
Any ideas? Or you need more information, because I'm not sure if everybody gets what I have in mind.
This is example. Select should return two rows from this, one where last column would have Yes and other row withNo`.
This is exactly what the distinct keyword is for:
SELECT distinct col1, col2, col3
FROM mytable

Row_number() function for Informix

Does informix has a function similar to the SQLServer and Oracle's row_number()?
I have to make a query using row_number() between two values, but I don't know how.
This is my query in SQLServer:
SELECT col1, col2
FROM (SELECT col1, col2, ROW_NUMBER()
OVER (ORDER BY col1) AS ROWNUM FROM table) AS TB
WHERE TB.ROWNUM BETWEEN value1 AND value2
Some help?
If, as it appears, you are seeking to get first rows 1-100, then rows 101-200, and so on, then you can use a more direct (but non-standard) syntax. Other DBMS have analogous notations, handled somewhat differently.
To fetch rows 101-200:
SELECT SKIP 100 FIRST 100 t.*
FROM Table AS T
WHERE ...other criteria...
You can use a host variable in place of either literal 100 (or a single prepared statement with different values for the placeholders on different iterations).