What happens when you use DISTINCT * in COUNT() in SQL? - sql

I've just learned about the COUNT() function, and how it is possible to get the number of rows in a column by passing * as the argument.
SELECT COUNT(*) FROM table;
I've also learned that we can get the number of distinct rows of a column in a table by using DISTINCT.
SELECT COUNT(DISTINCT column) FROM table;
I've noticed that the following returns nothing.
SELECT COUNT(DISTINCT *) FROM table;
Why is this?
I suppose the root of my issue is that I don't quite fully understand what the COUNT() function with * as the argument does exactly. My resource says that the COUNT() function takes a column as an argument and counts how many non-NULL rows there are. So say we have a table that has a column with some rows having both NULL and non-NULL values. If COUNT(column) doesn't count the non-NULL rows, what happens differently in COUNT(*) so that all the rows are counted? And by extension, what happens during COUNT(DISTINCT *)?

This would be a syntax error in most databases. If it were allowed, it would probably be equivalent to:
select count(*)
from (select distinct * from t) t
However, NULL values might throw it off.

Related

SQL returning 0 for count(), but returning multiple rows with simple SELECT

Sorry if the phrasing of my question was not very clear.
I am running this simple query below
SELECT count(cg)
FROM all_data
WHERE cg is null
and am getting 0 as the result. When I run this query
SELECT cg
FROM all_data
WHERE cg is null
and get a bunch of records that fit the criteria. There are very obviously many records that have a cg value of null, but they do not appear from the count() query.
Is there a reason for this? Am I doing something wrong?
Thanks for any help
Aggregates (COUNT(), SUM() etc.) ignore NULL values.
Use COUNT(*) to count all rows matching your condition.
SELECT COUNT(*)
FROM all_data
WHERE cg IS NULL
Further reading - Count Function (Microsoft Access SQL):
The Count function does not count records that have Null fields unless expr is the asterisk (*) wildcard character. If you use an asterisk, Count calculates the total number of records, including those that contain Null fields. Count(*) is considerably faster than Count([Column Name]).
If you want to count the amount of null values use the following query
SELECT
SUM(CASE WHEN CG IS NULL THEN 1 END) AMOUNT_CG
FROM all_data
No more follow the tip of the friend above
According to the SQL Reference Manual section on Aggregate Functions:
All aggregate functions except COUNT(*) and GROUPING ignore nulls. You can use the NVL function in the argument to an aggregate function to substitute a value for a null. COUNT never returns null, but returns either a number or zero. For all the remaining aggregate functions, if the data set contains no rows, or contains only rows with nulls as arguments to the aggregate function, then the function returns null.
So from above information we can conclude that to solve your problem use count(*) instead of count(cg).

"group by" needed in count(*) SQL statement?

The following statement works in my database:
select column_a, count(*) from my_schema.my_table group by 1;
but this one doesn't:
select column_a, count(*) from my_schema.my_table;
I get the error:
ERROR: column "my_table.column_a" must appear in the GROUP BY clause
or be used in an aggregate function
Helpful note: This thread: What does SQL clause "GROUP BY 1" mean? discusses the meaning of "group by 1".
Update:
The reason why I am confused is because I have often seen count(*) as follows:
select count(*) from my_schema.my_table
where there is no group by statement. Is COUNT always required to be followed by group by? Is the group by statement implicit in this case?
This error makes perfect sense. COUNT is an "aggregate" function. So you need to tell it which field to aggregate by, which is done with the GROUP BY clause.
The one which probably makes most sense in your case would be:
SELECT column_a, COUNT(*) FROM my_schema.my_table GROUP BY column_a;
If you only use the COUNT(*) clause, you are asking to return the complete number of rows, instead of aggregating by another condition. Your questing if GROUP BY is implicit in that case, could be answered with: "sort of": If you don't specify anything is a bit like asking: "group by nothing", which means you will get one huge aggregate, which is the whole table.
As an example, executing:
SELECT COUNT(*) FROM table;
will show you the number of rows in that table, whereas:
SELECT col_a, COUNT(*) FROM table GROUP BY col_a;
will show you the the number of rows per value of col_a. Something like:
col_a | COUNT(*)
---------+----------------
value1 | 100
value2 | 10
value3 | 123
You also should take into account that the * means to count everything. Including NULLs! If you want to count a specific condition, you should use COUNT(expression)! See the docs about aggragate functions for more details on this topic.
If you don't use the Group by clause at all then all that will be returned is a count of 1 for each row, which is already assumed anyway and therefore redundant data. By adding GROUP BY 1 you have categorized the information thereby making it non-redundant even though it returns the same result in theory as the statement that creates an error.
When you have a function like count, sum etc. you need to group the other columns. This would be equivalent to your query:
select column_a, count(*) from my_schema.my_table group by column_a;
When you use count(*) with no other column, you are counting all rows from SELECT * from the table. When you use count(*) alongside another column, you are counting the number of rows for each different value of that other column. So in this case you need to group the results, in order to show each value and its count only once.
group by 1 in this case refers to column_a which has the column position 1 in your query.
This why it works on your server. Indeed this is not a good practice in sql.
You should mention the column name because the column order may change in the table so it will be hard to maintain this code.
The best solution is:
select column_a, count(*) from my_schema.my_table group by column_a;

sqlite SELECT AVG returns null

Does anyone know why a SQL SELECT query returns no rows when SELECTing from an empty table, but when trying to SELECT the AVG from a column in an empty table it returns < null >? The difference in behavior just seems odd to me. I’m using a sqlite database if that makes any difference.
Here are the two queries:
Normal select: SELECT a FROM table1
If table1 is empty I get no rows back
Avg select: SELECT AVG(a) FROM table1
If table1 is empty I get back a < null > row.
From the ANSI 92 spec
b) If AVG, MAX, MIN, or SUM is
specified, then
Case:
i) If TXA is empty, then the result is the null value.
Read more at: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
I'm not positive, but to determine average, you must divide by the number of rows. If the number of rows is zero, dividing by it would be undefined. Thus, the NULL return. Just a guess.
You're doing an aggregate. Since the aggregate is defined for 0-n rows (in this case, 0 rows yields null), you will always get one result back (exactly one in this case).
To put it another way, you're not asking for rows from the table--you're asking for the average of one column in the table and that's what you're getting back. Getting anything other than one row in this case would be weirder.
If you had asked for non-aggregated columns, too, e.g.
SELECT Salesperson, AVG(Sale)
FROM Sales
GROUP BY Salesperson
then I would expect you to get no rows back because there wouldn't be anything to satisfy the non-aggregate selects.
AVG is an aggregate function similiar to COUNT. If you do:
SELECT COUNT(a) FROM table1 you'd expect to get a zero row.
its the same with AVG, SUM, etc. You get the one row with the result of the aggregate function.

What does "select count(1) from table_name" on any database tables mean?

When we execute select count(*) from table_name it returns the number of rows.
What does count(1) do? What does 1 signify here? Is this the same as count(*) (as it gives the same result on execution)?
The parameter to the COUNT function is an expression that is to be evaluated for each row. The COUNT function returns the number of rows for which the expression evaluates to a non-null value. ( * is a special expression that is not evaluated, it simply returns the number of rows.)
There are two additional modifiers for the expression: ALL and DISTINCT. These determine whether duplicates are discarded. Since ALL is the default, your example is the same as count(ALL 1), which means that duplicates are retained.
Since the expression "1" evaluates to non-null for every row, and since you are not removing duplicates, COUNT(1) should always return the same number as COUNT(*).
Here is a link that will help answer your questions. In short:
count(*) is the correct way to write
it and count(1) is OPTIMIZED TO BE
count(*) internally -- since
a) count the rows where 1 is not null
is less efficient than
b) count the rows
Difference between count(*) and count(1) in oracle?
count(*) means it will count all records i.e each and every cell
BUT
count(1) means it will add one pseudo column with value 1 and returns count of all records
This is similar to the difference between
SELECT * FROM table_name and SELECT 1 FROM table_name.
If you do
SELECT 1 FROM table_name
it will give you the number 1 for each row in the table. So yes count(*) and count(1) will provide the same results as will count(8) or count(column_name)
There is no difference.
COUNT(1) is basically just counting a constant value 1 column for each row. As other users here have said, it's the same as COUNT(0) or COUNT(42). Any non-NULL value will suffice.
http://asktom.oracle.com/pls/asktom/f?p=100:11:2603224624843292::::P11_QUESTION_ID:1156151916789
The Oracle optimizer did apparently use to have bugs in it, which caused the count to be affected by which column you picked and whether it was in an index, so the COUNT(1) convention came into being.
SELECT COUNT(1) from <table name>
should do the exact same thing as
SELECT COUNT(*) from <table name>
There may have been or still be some reasons why it would perform better than SELECT COUNT(*)on some database, but I would consider that a bug in the DB.
SELECT COUNT(col_name) from <table name>
however has a different meaning, as it counts only the rows with a non-null value for the given column.
in oracle i believe these have exactly the same meaning
You can test like this:
create table test1(
id number,
name varchar2(20)
);
insert into test1 values (1,'abc');
insert into test1 values (1,'abc');
select * from test1;
select count(*) from test1;
select count(1) from test1;
select count(ALL 1) from test1;
select count(DISTINCT 1) from test1;
Depending on who you ask, some people report that executing select count(1) from random_table; runs faster than select count(*) from random_table. Others claim they are exactly the same.
This link claims that the speed difference between the 2 is due to a FULL TABLE SCAN vs FAST FULL SCAN.

Not getting the correct count in SQL

I am totally new to SQL. I have a simple select query similar to this:
SELECT COUNT(col1) FROM table1
There are some 120 records in the table and shown on the GUI.
For some reason, this query always returns a number which is less than the actual count.
Can somebody please help me?
Try
select count(*) from table1
Edit: To explain further, count(*) gives you the rowcount for a table, including duplicates and nulls. count(isnull(col1,0)) will do the same thing, but slightly slower, since isnull must be evaluated for each row.
You might have some null values in col1 column. Aggregate functions ignore nulls.
try this
SELECT COUNT(ISNULL(col1,0)) FROM table1
Slightly tangential, but there's also the useful
SELECT count(distinct cola) from table1
which gives you number of distinct column in the table.
You are getting the correct count
As per https://learn.microsoft.com
COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
COUNT(ALL expression) evaluates an expression for each row in a group and returns the number of nonnull values.
COUNT(DISTINCT expression) evaluates an expression for each row in a group and returns the number of unique, non null values.
In your case you have passed the column name in COUNT that's why you will get count of not null records, now you're in your table data you may have null values in given column(col1)
Hope this helps!