SQL Count distinct number of rows in table in GBQ - sql

I'd like to count the number of distinct rows in a table. I know that I can do that using groupby or by naming all the columns one by one, but would like to just do:
select count(distinct *) from my_table
Is that possible?

Do SELECT DISTINCT in a derived table (the subquery), then count the number of rows returned.
select count(*) from
(select distinct * from my_table) dt
(Doesn't your table have any primary key?)

You can use to_json_string():
select count(distinct to_json_string(t))
from t;

Below more options for BigQuery Standard SQL
select count(distinct format('%t', t))
from `project.dataset.table` t
depends on your use case - approximate count can be even more optimal option
select approx_count_distinct(format('%t', t))
from `project.dataset.table` t
APPROX_COUNT_DISTINCT - returns the approximate result for COUNT(DISTINCT expression). The value returned is a statistical estimate—not necessarily the actual value. This function is less accurate than COUNT(DISTINCT expression), but performs better on huge input.

The use of count(distinct *) is not permitted.
Alternatively you could explicitly name the columns (what defines uniqueness).

Related

What's the difference between select distinct count, and select count distinct?

I am aware of select count(distinct a), but I recently came across select distinct count(a).
I'm not very sure if that is even valid.
If it is a valid use, could you give me a sample code with a sample data, that would explain me the difference.
Hive doesn't allow the latter.
Any leads would be appreciated!
Query select count(distinct a) will give you number of unique values in a.
While query select distinct count(a) will give you list of unique counts of values in a. Without grouping it will be just one line with total count.
See following example
create table t(a int)
insert into t values (1),(2),(3),(3)
select count (distinct a) from t
select distinct count (a) from t
group by a
It will give you 3 for first query and values 1 and 2 for second query.
I cannot think of any useful situation where you would want to use:
select distinct count(a)
If the query has no group by, then the distinct is anomalous. The query only returns on row anyway. If there is a group by, then the aggregation columns should be in the select, to identify each row.
I mean, technically, with a group by, it would be answering the question: "how many different non-null values of a are in groups". Usually, it is much more useful to know the value per group.
If you want to count the number of distinct values of a, then use count(distinct a).

COUNT(DISTINCT) and COUNT(*) + GROUP BY give different results

We're querying one of data sets for unique IDs
SELECT count(distinct id) FROM [MyTable] LIMIT 1
Another query ran a similar command
SELECT count(*) From ( select id FROM MyTable group by id) A ;
The first command is more efficient, but the output should be identical. However, they are getting different results. The first query returns more results by about 1.5% of the dataset, of over 100 million rows.
COUNT(DISTINCT field) is just an estimate. If you need exact results you can use EXACT_COUNT_DISTINCT(field).
This is explained in the query reference: https://cloud.google.com/bigquery/query-reference?hl=en#countdistinct
Check COUNT([DISTINCT] field [, n]) definition
It is a statistical approximation and is not guaranteed to be exact.
The second query returns exact count, thus the difference

"group by" needed in count(*) SQL statement?

The following statement works in my database:
select column_a, count(*) from my_schema.my_table group by 1;
but this one doesn't:
select column_a, count(*) from my_schema.my_table;
I get the error:
ERROR: column "my_table.column_a" must appear in the GROUP BY clause
or be used in an aggregate function
Helpful note: This thread: What does SQL clause "GROUP BY 1" mean? discusses the meaning of "group by 1".
Update:
The reason why I am confused is because I have often seen count(*) as follows:
select count(*) from my_schema.my_table
where there is no group by statement. Is COUNT always required to be followed by group by? Is the group by statement implicit in this case?
This error makes perfect sense. COUNT is an "aggregate" function. So you need to tell it which field to aggregate by, which is done with the GROUP BY clause.
The one which probably makes most sense in your case would be:
SELECT column_a, COUNT(*) FROM my_schema.my_table GROUP BY column_a;
If you only use the COUNT(*) clause, you are asking to return the complete number of rows, instead of aggregating by another condition. Your questing if GROUP BY is implicit in that case, could be answered with: "sort of": If you don't specify anything is a bit like asking: "group by nothing", which means you will get one huge aggregate, which is the whole table.
As an example, executing:
SELECT COUNT(*) FROM table;
will show you the number of rows in that table, whereas:
SELECT col_a, COUNT(*) FROM table GROUP BY col_a;
will show you the the number of rows per value of col_a. Something like:
col_a | COUNT(*)
---------+----------------
value1 | 100
value2 | 10
value3 | 123
You also should take into account that the * means to count everything. Including NULLs! If you want to count a specific condition, you should use COUNT(expression)! See the docs about aggragate functions for more details on this topic.
If you don't use the Group by clause at all then all that will be returned is a count of 1 for each row, which is already assumed anyway and therefore redundant data. By adding GROUP BY 1 you have categorized the information thereby making it non-redundant even though it returns the same result in theory as the statement that creates an error.
When you have a function like count, sum etc. you need to group the other columns. This would be equivalent to your query:
select column_a, count(*) from my_schema.my_table group by column_a;
When you use count(*) with no other column, you are counting all rows from SELECT * from the table. When you use count(*) alongside another column, you are counting the number of rows for each different value of that other column. So in this case you need to group the results, in order to show each value and its count only once.
group by 1 in this case refers to column_a which has the column position 1 in your query.
This why it works on your server. Indeed this is not a good practice in sql.
You should mention the column name because the column order may change in the table so it will be hard to maintain this code.
The best solution is:
select column_a, count(*) from my_schema.my_table group by column_a;

sql divide column by column max

I have a column of count and want to divide the column by max of this column to get the rate.
I tried
select t.count/max(t.count)
from table t
group by t.count
but failed.
I also tried the one without GROUP BY, still failed.
Order the count desc and pick the first one as dividend didn't work in my case. Consider I have different counts for product subcategory. For each product category, I want to divide the count of subcategory by the max of count in that category. I can't think of a way avoiding aggregate func.
If you want the MAX() per category you need a correlated subquery:
select t.count*1.0/(SELECT max(t.count)
FROM table a
WHERE t.category = a.category)
from table t
Or you need to PARTITION BY your MAX()
select t.count/(max(t.count) over (PARTITION BY category))
from table t
group by t.count
The following works in all dialects of SQL:
select t.count/(select max(t.count) from t)
from table t
group by t.count;
Note that some versions of SQL do integer division, so the result will be either 0 or 1. You can fix this by multiplying by 1.0 or casting to a float.
Most versions of SQL also support:
select t.count/(max(t.count) over ())
from table t
group by t.count;
The same caveat applies about integer division.
You might want to try using a subquery to derive the max value (including both in the same query might not work the way that you are expecting, since you are grouping on the same column that you are aggregating)
Select t.count / (select max(sub.count) from table sub)
from table t
group by t.count

What does "select count(1) from table_name" on any database tables mean?

When we execute select count(*) from table_name it returns the number of rows.
What does count(1) do? What does 1 signify here? Is this the same as count(*) (as it gives the same result on execution)?
The parameter to the COUNT function is an expression that is to be evaluated for each row. The COUNT function returns the number of rows for which the expression evaluates to a non-null value. ( * is a special expression that is not evaluated, it simply returns the number of rows.)
There are two additional modifiers for the expression: ALL and DISTINCT. These determine whether duplicates are discarded. Since ALL is the default, your example is the same as count(ALL 1), which means that duplicates are retained.
Since the expression "1" evaluates to non-null for every row, and since you are not removing duplicates, COUNT(1) should always return the same number as COUNT(*).
Here is a link that will help answer your questions. In short:
count(*) is the correct way to write
it and count(1) is OPTIMIZED TO BE
count(*) internally -- since
a) count the rows where 1 is not null
is less efficient than
b) count the rows
Difference between count(*) and count(1) in oracle?
count(*) means it will count all records i.e each and every cell
BUT
count(1) means it will add one pseudo column with value 1 and returns count of all records
This is similar to the difference between
SELECT * FROM table_name and SELECT 1 FROM table_name.
If you do
SELECT 1 FROM table_name
it will give you the number 1 for each row in the table. So yes count(*) and count(1) will provide the same results as will count(8) or count(column_name)
There is no difference.
COUNT(1) is basically just counting a constant value 1 column for each row. As other users here have said, it's the same as COUNT(0) or COUNT(42). Any non-NULL value will suffice.
http://asktom.oracle.com/pls/asktom/f?p=100:11:2603224624843292::::P11_QUESTION_ID:1156151916789
The Oracle optimizer did apparently use to have bugs in it, which caused the count to be affected by which column you picked and whether it was in an index, so the COUNT(1) convention came into being.
SELECT COUNT(1) from <table name>
should do the exact same thing as
SELECT COUNT(*) from <table name>
There may have been or still be some reasons why it would perform better than SELECT COUNT(*)on some database, but I would consider that a bug in the DB.
SELECT COUNT(col_name) from <table name>
however has a different meaning, as it counts only the rows with a non-null value for the given column.
in oracle i believe these have exactly the same meaning
You can test like this:
create table test1(
id number,
name varchar2(20)
);
insert into test1 values (1,'abc');
insert into test1 values (1,'abc');
select * from test1;
select count(*) from test1;
select count(1) from test1;
select count(ALL 1) from test1;
select count(DISTINCT 1) from test1;
Depending on who you ask, some people report that executing select count(1) from random_table; runs faster than select count(*) from random_table. Others claim they are exactly the same.
This link claims that the speed difference between the 2 is due to a FULL TABLE SCAN vs FAST FULL SCAN.