Basic SQL question: displaying single results of multiple rows - sql

I'm just beginning SQL, forgive me for my very basic questions. I have two. Here is the relevant code:
SELECT column_1, column_2, COUNT(*) AS total
FROM table1
This simple query shows a single row for all identical instances. The COUNT column will show there are multiple instances, but it will only show up as one row. Even without COUNT, there is only one row. Why only one row if I didn't use DISTINCT? Without DISTINCT it would seem that all identical rows would show up individually.
What does the * do in COUNT. I understand it usually indicates "all", but what does it change in this case?
Thank you.

https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html says:
Without GROUP BY, there is a single group and it is nondeterministic which name value to choose for the group.
That is, the use of any aggregating function such as COUNT() causes all the rows in the table to be treated as a single group. The result is one row for that group.
If it didn't work this way, there might not be any way to do an aggregation against the whole table.
You also asked about the purpose of the *. By default, if you use COUNT(<expression>) the rows where the expression is NULL are not counted. But COUNT(*) is a special syntax that always counts all the rows, so you don't have to think up an expression that is guaranteed to be non-NULL.
Some people use a constant value that is not NULL, e.g. COUNT(1), which achieves the same result. But it makes some people wonder if COUNT(2) would somehow count the rows differently (it doesn't). The standard SQL specification provides COUNT(*) as a special syntax to make this more clear.

This query:
SELECT column_1, column_2, COUNT(*) AS total
FROM table1
violates a rule in SQL: when using aggregation functions (like count), any other expression in the select clause should either:
also be an aggregation
should be grouped by in the group by clause
There are some nuances to this rule (like functional dependency), but this shows that the above SQL is ambiguous: count(*) without group by will ensure you get only one record in your output, but then it is not clear what the values would be of those other two expressions? Which record would they be based on? Some database engines have allowed these constructs, and choose a record to base those column values on.
However, if you remove all aggregations, you should get all the rows of your table:
SELECT column_1, column_2
FROM table1
If you want to get a row for each distinct value of column_1, column_2, each with a count, then use group by
SELECT column_1, column_2, COUNT(*) AS total
FROM table1
GROUP BY column_1, column_2
The number of rows in the result will depend on how many distinct pairs of column1, column2 appear in your table.
As to the * in COUNT(*): the alternative would be to mention an expression, like COUNT(column3): in that case null values will not be counted. * thus can be understood to mean: count all records, even when they have null values.

The * is just a short cut for all the columns.
If you select * from table1 it will display all the columns without requiring you to write them all out.
COUNT() is an aggregate function. If you pass COUNT(column_1) it would just count the rows with a column_1 value. So COUNT(*) is the same as writing COUNT(column_1, column_2, ...).
I'm thinking your SQL may work better with a group by clause at the end.
SELECT column_1, column_2, COUNT(*) AS total
FROM table1 GROUP BY column_1, column_2

Related

What's the difference between select distinct count, and select count distinct?

I am aware of select count(distinct a), but I recently came across select distinct count(a).
I'm not very sure if that is even valid.
If it is a valid use, could you give me a sample code with a sample data, that would explain me the difference.
Hive doesn't allow the latter.
Any leads would be appreciated!
Query select count(distinct a) will give you number of unique values in a.
While query select distinct count(a) will give you list of unique counts of values in a. Without grouping it will be just one line with total count.
See following example
create table t(a int)
insert into t values (1),(2),(3),(3)
select count (distinct a) from t
select distinct count (a) from t
group by a
It will give you 3 for first query and values 1 and 2 for second query.
I cannot think of any useful situation where you would want to use:
select distinct count(a)
If the query has no group by, then the distinct is anomalous. The query only returns on row anyway. If there is a group by, then the aggregation columns should be in the select, to identify each row.
I mean, technically, with a group by, it would be answering the question: "how many different non-null values of a are in groups". Usually, it is much more useful to know the value per group.
If you want to count the number of distinct values of a, then use count(distinct a).

How is it possible for count distinct to show duplicates, but group by does not?

I want to query for duplicates in my data.
So, the first thing I do is I do a count distinct:
select count(distinct colA, colB ....) from Table
and a count:
select count(*) from Table
And I see that the count distinct is lower than the count(*).
So, now I want to actually see the duplicates, so I do this:
select colA, colB, .... count(*) from Table
group by colA, colB ... having count(*) > 1;
Now, for some reason, this does not return any records at all. The table is too big for me to show results here, and the columns too many.
How is it possible for both of these to be true? the counts are different, but no rows show up when I group them and filter for count(*) >1?
Thanks.
The behavior you see may depend on the database you are using. However, I'm pretty sure that the problem is due to NULL values in the columns. For instance, MySQL explicitly describes COUNT(DISTINCT) as:
COUNT(DISTINCT expr,[expr...])
Returns a count of the number of rows with different non-NULL expr
values.
Not all databases support COUNT(DISTINCT) with multiple expressions. Different databases may handle NULL values differently. But, they seem to be the most likely cause of the discrepancy.

"group by" needed in count(*) SQL statement?

The following statement works in my database:
select column_a, count(*) from my_schema.my_table group by 1;
but this one doesn't:
select column_a, count(*) from my_schema.my_table;
I get the error:
ERROR: column "my_table.column_a" must appear in the GROUP BY clause
or be used in an aggregate function
Helpful note: This thread: What does SQL clause "GROUP BY 1" mean? discusses the meaning of "group by 1".
Update:
The reason why I am confused is because I have often seen count(*) as follows:
select count(*) from my_schema.my_table
where there is no group by statement. Is COUNT always required to be followed by group by? Is the group by statement implicit in this case?
This error makes perfect sense. COUNT is an "aggregate" function. So you need to tell it which field to aggregate by, which is done with the GROUP BY clause.
The one which probably makes most sense in your case would be:
SELECT column_a, COUNT(*) FROM my_schema.my_table GROUP BY column_a;
If you only use the COUNT(*) clause, you are asking to return the complete number of rows, instead of aggregating by another condition. Your questing if GROUP BY is implicit in that case, could be answered with: "sort of": If you don't specify anything is a bit like asking: "group by nothing", which means you will get one huge aggregate, which is the whole table.
As an example, executing:
SELECT COUNT(*) FROM table;
will show you the number of rows in that table, whereas:
SELECT col_a, COUNT(*) FROM table GROUP BY col_a;
will show you the the number of rows per value of col_a. Something like:
col_a | COUNT(*)
---------+----------------
value1 | 100
value2 | 10
value3 | 123
You also should take into account that the * means to count everything. Including NULLs! If you want to count a specific condition, you should use COUNT(expression)! See the docs about aggragate functions for more details on this topic.
If you don't use the Group by clause at all then all that will be returned is a count of 1 for each row, which is already assumed anyway and therefore redundant data. By adding GROUP BY 1 you have categorized the information thereby making it non-redundant even though it returns the same result in theory as the statement that creates an error.
When you have a function like count, sum etc. you need to group the other columns. This would be equivalent to your query:
select column_a, count(*) from my_schema.my_table group by column_a;
When you use count(*) with no other column, you are counting all rows from SELECT * from the table. When you use count(*) alongside another column, you are counting the number of rows for each different value of that other column. So in this case you need to group the results, in order to show each value and its count only once.
group by 1 in this case refers to column_a which has the column position 1 in your query.
This why it works on your server. Indeed this is not a good practice in sql.
You should mention the column name because the column order may change in the table so it will be hard to maintain this code.
The best solution is:
select column_a, count(*) from my_schema.my_table group by column_a;

comparing rows in Sql, without using distinct operator? (distinct operator implementation)

I want to compare twos rows from a query result, for instance, if 1st row is equal to 2nd Row.
Given a query of the form
SELECT * FROM table_name
if the query results 100 rows, then how do we compare each rows for equality. just i am curious about the sql server how it will implement. basically implementation of Distinct operator. just want to know the how the SQL server will implement in behind the process. as it will help to understand the concept more in clearer way.
Simplest way the sql server may use - to compare hashes of whole rows:
SELECT CHECKSUM(*)
from YourTable
or choosen columns
SELECT CHECKSUM(col1, col2, col3)
from YourTable
and if checksums differ - the rows are differ, but if checksum match - it need to check more carefully over exact values of columns, but it will be more or less easier to filter out the results which checksums is not match.
To check the candidates to duplicates:
SELECT CHECKSUM(*)
from YourTable
GROUP BY CHECKSUM(*)
HAVING COUNT(*) > 1
You could use the following query:
SELECT *
FROM table_name
GROUP BY col1,col2,... -- all columns to test for equality here
HAVING COUNT(*)>1
In the GROUP BY you put the name of every column you want to be equal. If you want entire rows to be equal, put down the name of every column in the table there.
No matter what, your table "in a relational database" will have a primary key that will be used in other tables.
Because of this, your rows 1-100 will all be unique because of that key.
However, if you are trying to compare specific columns, you will need to build a function similar to this:
$temp;
$i=0;
$stmt = $mysqli->prepare("SELECT id, name FROM users");
$stmt->execute();
$stmt->bind_result($id, $name);
while($stmt->fetch()){
if($temp!=$name){
$temp=$name;
$saveIDs[$i]=$id;
}
$i++;
}

Not getting the correct count in SQL

I am totally new to SQL. I have a simple select query similar to this:
SELECT COUNT(col1) FROM table1
There are some 120 records in the table and shown on the GUI.
For some reason, this query always returns a number which is less than the actual count.
Can somebody please help me?
Try
select count(*) from table1
Edit: To explain further, count(*) gives you the rowcount for a table, including duplicates and nulls. count(isnull(col1,0)) will do the same thing, but slightly slower, since isnull must be evaluated for each row.
You might have some null values in col1 column. Aggregate functions ignore nulls.
try this
SELECT COUNT(ISNULL(col1,0)) FROM table1
Slightly tangential, but there's also the useful
SELECT count(distinct cola) from table1
which gives you number of distinct column in the table.
You are getting the correct count
As per https://learn.microsoft.com
COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
COUNT(ALL expression) evaluates an expression for each row in a group and returns the number of nonnull values.
COUNT(DISTINCT expression) evaluates an expression for each row in a group and returns the number of unique, non null values.
In your case you have passed the column name in COUNT that's why you will get count of not null records, now you're in your table data you may have null values in given column(col1)
Hope this helps!