SQL to find the number of distinct values in a column - sql

I can select all the distinct values in a column in the following ways:
SELECT DISTINCT column_name FROM table_name;
SELECT column_name FROM table_name GROUP BY column_name;
But how do I get the row count from that query? Is a subquery required?

You can use the DISTINCT keyword within the COUNT aggregate function:
SELECT COUNT(DISTINCT column_name) AS some_alias FROM table_name
This will count only the distinct values for that column.

This will give you BOTH the distinct column values and the count of each value. I usually find that I want to know both pieces of information.
SELECT [columnName], count([columnName]) AS CountOf
FROM [tableName]
GROUP BY [columnName]

An sql sum of column_name's unique values and sorted by the frequency:
SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name ORDER BY 2 DESC;

Be aware that Count() ignores null values, so if you need to allow for null as its own distinct value you can do something tricky like:
select count(distinct my_col)
+ count(distinct Case when my_col is null then 1 else null end)
from my_table
/

SELECT COUNT(DISTINCT column_name) FROM table as column_name_count;
you've got to count that distinct col, then give it an alias.

select count(*) from
(
SELECT distinct column1,column2,column3,column4 FROM abcd
) T
This will give count of distinct group of columns.

select Count(distinct columnName) as columnNameCount from tableName

Using following SQL we can get the distinct column value count in Oracle 11g.
select count(distinct(Column_Name)) from TableName

After MS SQL Server 2012, you can use window function too.
SELECT column_name, COUNT(column_name) OVER (PARTITION BY column_name)
FROM table_name
GROUP BY column_name

To do this in Presto using OVER:
SELECT DISTINCT my_col,
count(*) OVER (PARTITION BY my_col
ORDER BY my_col) AS num_rows
FROM my_tbl
Using this OVER based approach is of course optional. In the above SQL, I found specifying DISTINCT and ORDER BY to be necessary.
Caution: As per the docs, using GROUP BY may be more efficient.

select count(distinct(column_name)) AS columndatacount from table_name where somecondition=true
You can use this query, to count different/distinct data.

Without using DISTINCT this is how we could do it-
SELECT COUNT(C)
FROM (SELECT COUNT(column_name) as C
FROM table_name
GROUP BY column_name)

Count(distinct({fieldname})) is redundant
Simply Count({fieldname}) gives you all the distinct values in that table. It will not (as many presume) just give you the Count of the table [i.e. NOT the same as Count(*) from table]

Related

In SQL Server, how to concat --> group by using the concat column

SELECT
'dbo.our_table' as table_name,
CONCAT(col1, '-', col2, '-', col3) as table_id,
COUNT(*) as ct
FROM dbo.our_table
group by table_name, table_id
-- group by 1, 2 -- this doesn't work either...
order by ct desc
This does not work in SQL server because it does not recognize table_name or table_id. I understand that this can be done by nesting a 2nd SELECT clause into the FROM, so that table_name and table_id are explicitly available, however I am trying to understand if it is possible to achieve this output without having to create a nested SELECT statement, but rather by keeping the structure of my current query but only making a tweak.
Thanks
As mentioned in the comments, you need to put your 3 columns (col1, col2 & col3) into the GROUP BY. Unlike these 3 columns, dbo.our_table is not needed in the GROUP BY as it is a string.
SQL Server executes the components of SELECT queries in a particular order starting with FROM, then WHERE, GROUP BY, etc. In this case, SQL Server doesn't recognize the aliases table_name & table_id in the GROUP BY because they are not set until the SELECT, which is executed after the GROUP BY.
Googling "SQL Server SELECT query execution order" should give you a number of resources which will explain the order of execution in more detail.
You need to specify the full calculation for the GROUP BY as well as for the SELECT. This is because GROUP BY is logically considered before SELECT, so cannot access those calculations.
You could do it like this (table_name is not necessary because it's purely computed):
SELECT
'dbo.our_table' as table_name,
CONCAT(col1, '-', col2, '-', col3) as table_id,
COUNT(*) as ct
FROM dbo.our_table
group by CONCAT(col1, '-', col2, '-', col3)
order by ct desc;
But much better is to place calculations in a CROSS APPLY, this means it is accessible later by name as you wished:
SELECT
'dbo.our_table' as table_name,
v.table_id,
COUNT(*) as ct
FROM dbo.our_table
CROSS APPLY (VALUES (CONCAT(col1, '-', col2, '-', col3) ) ) as v(table_id)
group by v.table_id
order by ct desc;

Defining a subtable and then query from that table using SQL

I have a table with many columns, and I want to count the unique values of each column. I know that I can do
SELECT sho_01, COUNT(*) from sho GROUP BY sho_01
UNION ALL
SELECT sho_02, COUNT(*) from sho GROUP BY sho_02
UNION ALL
....
Here sho is the table and sho_01,.... are the individual columns. This is BigQuery by the way, so they use UNION ALL.
Next, I want to do the same thing, but for a subset of sho, say SELECT * FROM sho WHERE id in (1,2,3). Is there a way where I can create a subtable first, and then query the subtable? Something like this
SELECT * FROM (SELECT * FROM sho WHERE id IN (1,2,3)) AS t1;
SELECT sho_01, COUNT(*) from t1 GROUP BY sho_01
UNION ALL
SELECT sho_02, COUNT(*) from t1 GROUP BY sho_02
UNION ALL
....
Thanks
Presumably, the columns are all of the same type. If so, you can simplify this using arrays:
select el.which, el.val, count(*)
from (select t1.*,
array[struct('sho_01' as which, sho_01 as val),
struct('sho_2', show_02),
. . .
] as ar
from t
) t cross join
unnest(ar) el
group by el.which, el.val;
You can then easily filter however you want by adding a where clause before the group by.
Below is for BigQuery Standard SQL and allows you to avoid manual typing of column names or even knowing them in advance
#standardSQL
SELECT
TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') column,
SPLIT(kv, ':')[OFFSET(1)] value,
COUNT(1) cnt
FROM `project.dataset.table` t,
UNNEST(SPLIT(TRIM(TO_JSON_STRING(t), '{}'))) kv
GROUP BY column, value
-- ORDER BY column, value

Optimized way to select distinct records

I have a table in hive with 23 columns out of which 5 columns make up the composite primary keys.
what is the best optimized way to select all the distinct records from the table.
select *
from (select t.*
,count(*) over (partition by Col1,Col2,Col3,Col4,Col5) as cnt
from tablename t
) t
where t.cnt = 1
;
Use the group by statements with where statements where count(1)>=1 this will give you the distinct records based on your composite key.
Like
Select Col1,Col2,Col3,Col4,Col5,Count(1) from tablename group by Col1,Col2,Col3,Col4,Col5 having Count(1)>=1

SQL query to determine that values in a column are unique

How to write a query to just determine that the values in a column are unique?
Try this:
SELECT CASE WHEN count(distinct col1)= count(col1)
THEN 'column values are unique' ELSE 'column values are NOT unique' END
FROM tbl_name;
Note: This only works if 'col1' does not have the data type 'ntext' or 'text'. If you have one of these data types, use 'distinct CAST(col1 AS nvarchar(4000))' (or similar) instead of 'distinct col1'.
select count(distinct column_name), count(column_name)
from table_name;
If the # of unique values is equal to the total # of values, then all values are unique.
IF NOT EXISTS (
SELECT
column_name
FROM
your_table
GROUP BY
column_name
HAVING
COUNT(*)>1
)
PRINT 'All are unique'
ELSE
PRINT 'Some are not unique'
If you want to list those that aren't unique, just take the inner query and run it. HTH.
With this following query, you have the advantage of not only seeing if your columns are unique, but you can also see which combination is most non-unique. Furthermore, because you still see frequency 1 is your key is unique, you know your results are good, and not for example simply missing; something is less clear when using a HAVING clause.
SELECT Col1, Col2, COUNT(*) AS Freq
FROM Table
GROUP BY Col1, Col2
ORDER BY Freq DESC
Are you trying to return only distinct values of a column? If so, you can use the DISTINCT keyword. The syntax is:
SELECT DISTINCT column_name,column_name
FROM table_name;
If you want to check if all the values are unique and you care about NULL values, then do something like this:
select (case when count(distinct column_name) = count(column_name) and
(count(column_name) = count(*) or count(column_name) = count(*) - 1)
then 'All Unique'
else 'Duplicates'
end)
from table t;
select (case when count(distinct column1 ) = count(column1)
then 'Unique'
else 'Duplicates'
end)
from table_name
By my understanding you want to know which values are unique in a column. Therefore, using select distinct to do so doesn't solve the problem, because only lists the value as if they are unique, but they may not.
A simple solution as follows:
SELECT COUNT(column_name), column_name
FROM table_name
GROUP BY column_name
HAVING COUNT(column_name) = 1;
this code return distinct value
SELECT code FROM #test
group by code
having count(distinct code)= count(code)
return 14 that is just unique value
Use the DISTINCT keyword inside a COUNT aggregate function as shown below:
SELECT COUNT(DISTINCT column_name) AS some_alias FROM table_name
The above query will give you the count of distinct values in that column.

SQL group by clause with two columns

Have a TSQL view, which I need to group by one column, however, am using nhibernate(C#) and am required to specify the Id column too.. my query looks like:
SELECT
row_number() over(order by id)as Id,
column_name,..etc
from tblName
group by column_name
which gives me an error that the Id has to be included in the group by clause.
Alternatively, I can write:
SELECT
row_number() over(order by id)as Id,
column_name,..etc
from tblName
group by column_name, id
which return multiple rows of the same column_name name.
Is there a way around this?
I think you want to do this:
Select row_number() over(order by column_name) as ID, column_name from (
Select distinct column_name from tblName
) as A
Do you mean this?
SELECT
row_number() over(partition by column_name order by id)as Id,
column_name,..etc
from tblName