Select Distinct (case insensitive) on Postgres - sql

This has been asked here and a few other places before but seems like the suggested answers either don't apply to postgres or don't work in this situation.
I'm looking to select distinct column names, eg:
SELECT DISTINCT column_name FROM table_name WHERE ... ORDER BY column_name however I'm looking to eliminate case sensitive duplicates (eg A and a should be considered the same thing)
I tried COLLATE but all available formats were case sensitive. And changing case via LOWER() OR UPPER() wont work because In this situation I need the case information.
I thought about something like this to grab unique values but still maintain the case:
SELECT DISTINCT upper(my_column) as upper_case, my_column
FROM my_table
ORDER BY upper(my_column)
But introducing my_column in the distinct query negates the whole thing.
How can I get unique values (case insensitive) without modifying the case of the results itself?

In PostgreSQL (but not many other databases), you can use a DISTINCT ON clause:
SELECT DISTINCT ON (upper(my_column)) my_column
FROM my_table
ORDER BY upper(my_column)
You can even choose which of the results you get, by adding another column to the ORDER BY clause to make the desired result appear first:
SELECT DISTINCT ON (upper(my_column)) my_column
FROM my_table
ORDER BY upper(my_column), other_column
Documentation: DISTINCT Clause

You can use an aggregation function:
SELECT MAX(my_column)
FROM my_table
GROUP BY upper(my_column);
This returns one value. If you want all the values:
SELECT ARRAY_AGG(DISTINCT my_column)
FROM my_table
GROUP BY upper(my_column);

Related

How to do count(distinct) for multiple columns

This does not work:
select count(distinct colA, colB) from mytable
I know I can simply solve this by making a double select.
select count(*) from (
select distinct colA, colB from mytable
)
Is there anyway I can do this without having to do the sub-select?
Subquery is standard solution which I recommend too. Concatenation-based solutions, except they are error-prone if dangerous character occurs, might be also worse in performance.
Note: in case you collected obscure solutions how to avoid subquery, window function usage is also possible here (Not to be used in production - your code reviewers won't praise you for it):
select distinct count(*) over ()
from my_table
group by colA, colB
[TL;DR] Just use a sub-query.
If you are trying to use concatenation then you need to ensure that you delimit the terms with a string that is never going to appear in the values otherwise you will find non-distinct terms grouped together.
For example: if you have a two numeric column then using COUNT(DISTINCT col1 || col2) will group together 1||23 and 12||3 and count them as one group.
You could use COUNT(DISTINCT col1 || '-' || col2) but if the columns are string values and you have 'ab-'||'-'||'c' and 'ab'||'-'||'-c' then, once again, they would be identical once concatenated.
The simplest method is to use a sub-query.
If you can't do that then you can combine columns via string-concatenation but you need to analyse the contents of the column and pick a delimiter that does not appear in your strings otherwise your results might be erroneous. Even better is to ensure that the delimiter character will never be in the sub-string with check constraints.
ALTER TABLE mytable ADD CONSTRAINT mytable__col1__chk CHECK (col1 NOT LIKE '%¬%');
ALTER TABLE mytable ADD CONSTRAINT mytable__col2__chk CHECK (col2 NOT LIKE '%¬%');
Then:
SELECT COUNT(DISTINCT col1 || '¬' || col2)
FROM mytable;
Just for fun, you can (ab)use window functions and limit clause. These are evaluated after grouping. So:
SELECT COUNT(*) OVER()
FROM t
GROUP BY col_a, col_b
OFFSET 0 ROWS FETCH NEXT 1 ROWS ONLY
If you're trying to avoid sub-selects at all costs, one variant would be to concatenate them as such:
SELECT count(DISTINCT concat(colA, colB)) FROM mytable;
Concatenate them.
Select count(distinct colA ||'-'|| colB) from mytable;

SQL Having on columns not in SELECT

I have a table with 3 columns:
userid mac_address count
The entries for one user could look like this:
57193 001122334455 42
57193 000C6ED211E6 15
57193 FFFFFFFFFFFF 2
I want to create a view that displays only those MAC's that are considered "commonly used" for this user. For example, I want to filter out the MAC's that are used <10% compared to the most used MAC-address for that user. Furthermore I want 1 row per user. This could easily be achieved with a GROUP BY, HAVING & GROUP_CONCAT:
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
And indeed, the result is as follows:
57193 001122334455,000C6ED211E6 42
However I really don't want the count-column in my view. But if I take it out of the SELECT statement, I get the following error:
#1054 - Unknown column 'count' in 'having clause'
Is there any way I can perform this operation without being forced to have a nasty count-column in my view? I know I can probably do it using inner queries, but I would like to avoid doing that for performance reasons.
Your help is very much appreciated!
As HAVING explicitly refers to the column names in the select list, it is not possible what you want.
However, you can use your select as a subselect to a select that returns only the rows you want to have.
SELECT a.userid, a.macs
FROM
(
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
) as a
UPDATE:
Because of a limitation of MySQL this is not possible, although it works in other DBMS like Oracle.
One solution would be to create a view for the subquery. Another solution seems cleaner:
CREATE VIEW YOUR_VIEW (userid, macs) AS
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
This will declare the view as returning only the columns userid and macs although the underlying SELECT statement returns more columns than those two.
Although I am not sure, whether the non-DBMS MySQL supports this or not...

Distinct with select *

Is it possible to use select * with distinct or write easily something that has the same impact?
I need to select all columns from a table with distinct value, but listing all the columns in select clause would be nerve-breaking because the number of columns is over 20!
In Microsoft SQL Server you can write:
select distinct * from MyTable
However, it is considered "best practice" to specify the columns explicitly, partly because it improves the performance of the query, but also to protect yourself from failures that would arise if the database schema were to change in the future
This should work:
SELECT DISTINCT * FROM TABLE_NAME
Use this query:
SELECT DISTINCT Employee, Rank
FROM Employees
Adding the "distinct" keyword right after "select" does the work.
For example:
SELECT DISTINCT * FROM TABLE_NAME

Query works in MySQL not Oracle

The following SQL statement works in MySQL but not with Oracle:
SELECT *, MAX(COLUMN_A)
FROM table_xyz
WHERE COLUMN_A <= 100
GROUP BY COLUMN_A
Oracle complaint: "FROM keyword not found where expected"
actually the statement was incorrect, we were not grouping by COLUMN_A but another column instead. actually what we want is this
SELECT *, MAX(COLUMN_A)
FROM table_xyz
WHERE COLUMN_A <= 100
GROUP BY COLUMN_B
this works but gives us only column A and B
SELECT COLUMN_B, MAX(COLUMN_A)
FROM table_xyz
WHERE COLUMN_A <= 100
GROUP BY COLUMN_B
what we want is this, but it doesn't work (group by error)
SELECT COLUMN_B, COLUMN_C .... COLUMN_X, MAX(COLUMN_A)
FROM table_xyz
WHERE COLUMN_A <= 100
GROUP BY COLUMN_B
That's because Oracle requires you to define all the columns not wrapped in an aggregate function (MIN, MAX, COUNT, etc). SQL Server would return a similar error. MySQL's behavior is documented here.
Because your query is using SELECT *, I can't re-write it properly for you. But I also can't guarantee a syntactically correct version would return the same results as you see on MySQL either. Grouping by the same column you want the MAX is quite odd...
If you want the max() for column_a you don't need the group by at all:
SELECT MAX(COLUMN_A)
FROM table_xyz
WHERE COLUMN_A <= 100
In addition to what everyone else is saying, Oracle does not allow mixing * with explicit column definitions in queries:
SQL> select *, table_name from user_tables;
select *, table_name from user_tables
*
ERROR at line 1:
ORA-00923: FROM keyword not found where expected
Oracle hasn't even looked at the fact that you are trying to get columns outside of those included in the group by clause. Which as others have stated, Oracle will not do.
This doesn't answer your MAX issue, but the only way to follow a '*' with other columns is if you use an explicit reference to a table alias - e.g.
SELECT e.*, zip_code
FROM addresses a,
employees e
WHERE e.addressId = a.Id
For the MAX value, you will either need to group by all other columns, or look into analytic functions (plenty of previous answers on Stack Overflow).
Multiple problems. Your GROUP BY clause is backwards. You need to define your GROUP BY by the columns in the *. Also what OMG Ponies said before.

Why shouldn’t you use DISTINCT when you could use GROUP BY?

According to tips from MySQL performance wiki:
Don't use DISTINCT when you have or could use GROUP BY.
Can somebody post example of queries where GROUP BY can be used instead of DISTINCT?
If you know that two columns from your result are always directly related then it's slower to do this:
SELECT DISTINCT CustomerId, CustomerName FROM (...)
than this:
SELECT CustomerId, CustomerName FROM (...) GROUP BY CustomerId
because in the second case it only has to compare the id, but in the first case it has to compare both fields. This is a MySQL specific trick. It won't work with other databases.
SELECT Code
FROM YourTable
GROUP BY Code
vs
SELECT DISTINCT Code
FROM YourTable
The basic rule : Put all the columns from the SELECT clause into the GROUP BY clause
so
SELECT DISTINCT a,b,c FROM D
becomes
SELECT a,b,c FROM D GROUP BY a,b,c
Example.
Relation customer(ssnum,name, zipcode, address) PK(ssnum). ssnum is social security number.
SQL:
Select DISTINCT ssnum from customer where zipcode=1234 group by name
This SQL statement returns unique records for those customer's that have zipcode 1234. At the end results are grouped by name.
Here DISTINCT is no not necessary. because you are selecting ssnum which is already unique because ssnun is primary key. two person can not have same ssnum.
In this case Select ssnum from customer where zipcode=1234 group by name will give better performance than "... DISTINCT.......".
DISTINCT is an expensive operation in a DBMS.