In SQL, what's the difference between count(column) and count(*)? - sql

I have the following query:
select column_name, count(column_name)
from table
group by column_name
having count(column_name) > 1;
What would be the difference if I replaced all calls to count(column_name) to count(*)?
This question was inspired by How do I find duplicate values in a table in Oracle?.
To clarify the accepted answer (and maybe my question), replacing count(column_name) with count(*) would return an extra row in the result that contains a null and the count of null values in the column.

count(*) counts NULLs and count(column) does not
[edit] added this code so that people can run it
create table #bla(id int,id2 int)
insert #bla values(null,null)
insert #bla values(1,null)
insert #bla values(null,1)
insert #bla values(1,null)
insert #bla values(null,1)
insert #bla values(1,null)
insert #bla values(null,null)
select count(*),count(id),count(id2)
from #bla
results
7 3 2

Another minor difference, between using * and a specific column, is that in the column case you can add the keyword DISTINCT, and restrict the count to distinct values:
select column_a, count(distinct column_b)
from table
group by column_a
having count(distinct column_b) > 1;

A further and perhaps subtle difference is that in some database implementations the count(*) is computed by looking at the indexes on the table in question rather than the actual data rows. Since no specific column is specified, there is no need to bother with the actual rows and their values (as there would be if you counted a specific column). Allowing the database to use the index data can be significantly faster than making it count "real" rows.

The explanation in the docs, helps to explain this:
COUNT(*) returns the number of items in a group, including NULL values and duplicates.
COUNT(expression) evaluates expression for each row in a group and returns the number of nonnull values.
So count(*) includes nulls, the other method doesn't.

We can use the Stack Exchange Data Explorer to illustrate the difference with a simple query. The Users table in Stack Overflow's database has columns that are often left blank, like the user's Website URL.
-- count(column_name) vs. count(*)
-- Illustrates the difference between counting a column
-- that can hold null values, a 'not null' column, and count(*)
select count(WebsiteUrl), count(Id), count(*) from Users
If you run the query above in the Data Explorer, you'll see that the count is the same for count(Id) and count(*)because the Id column doesn't allow null values. The WebsiteUrl count is much lower, though, because that column allows null.

The COUNT(*) sentence indicates SQL Server to return all the rows from a table, including NULLs.
COUNT(column_name) just retrieves the rows having a non-null value on the rows.
Please see following code for test executions SQL Server 2008:
-- Variable table
DECLARE #Table TABLE
(
CustomerId int NULL
, Name nvarchar(50) NULL
)
-- Insert some records for tests
INSERT INTO #Table VALUES( NULL, 'Pedro')
INSERT INTO #Table VALUES( 1, 'Juan')
INSERT INTO #Table VALUES( 2, 'Pablo')
INSERT INTO #Table VALUES( 3, 'Marcelo')
INSERT INTO #Table VALUES( NULL, 'Leonardo')
INSERT INTO #Table VALUES( 4, 'Ignacio')
-- Get all the collumns by indicating *
SELECT COUNT(*) AS 'AllRowsCount'
FROM #Table
-- Get only content columns ( exluce NULLs )
SELECT COUNT(CustomerId) AS 'OnlyNotNullCounts'
FROM #Table

COUNT(*) – Returns the total number of records in a table (Including NULL valued records).
COUNT(Column Name) – Returns the total number of Non-NULL records. It means that, it ignores counting NULL valued records in that particular column.

Basically the COUNT(*) function return all the rows from a table whereas COUNT(COLUMN_NAME) does not; that is it excludes null values which everyone here have also answered here.
But the most interesting part is to make queries and database optimized it is better to use COUNT(*) unless doing multiple counts or a complex query rather than COUNT(COLUMN_NAME). Otherwise, it will really lower your DB performance while dealing with a huge number of data.

Further elaborating upon the answer given by #SQLMeance and #Brannon making use of GROUP BY clause which has been mentioned by OP but not present in answer by #SQLMenace
CREATE TABLE table1 (
id INT
);
INSERT INTO table1 VALUES
(1),
(2),
(NULL),
(2),
(NULL),
(3),
(1),
(4),
(NULL),
(2);
SELECT * FROM table1;
+------+
| id |
+------+
| 1 |
| 2 |
| NULL |
| 2 |
| NULL |
| 3 |
| 1 |
| 4 |
| NULL |
| 2 |
+------+
10 rows in set (0.00 sec)
SELECT id, COUNT(*) FROM table1 GROUP BY id;
+------+----------+
| id | COUNT(*) |
+------+----------+
| 1 | 2 |
| 2 | 3 |
| NULL | 3 |
| 3 | 1 |
| 4 | 1 |
+------+----------+
5 rows in set (0.00 sec)
Here, COUNT(*) counts the number of occurrences of each type of id including NULL
SELECT id, COUNT(id) FROM table1 GROUP BY id;
+------+-----------+
| id | COUNT(id) |
+------+-----------+
| 1 | 2 |
| 2 | 3 |
| NULL | 0 |
| 3 | 1 |
| 4 | 1 |
+------+-----------+
5 rows in set (0.00 sec)
Here, COUNT(id) counts the number of occurrences of each type of id but does not count the number of occurrences of NULL
SELECT id, COUNT(DISTINCT id) FROM table1 GROUP BY id;
+------+--------------------+
| id | COUNT(DISTINCT id) |
+------+--------------------+
| NULL | 0 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
+------+--------------------+
5 rows in set (0.00 sec)
Here, COUNT(DISTINCT id) counts the number of occurrences of each type of id only once (does not count duplicates) and also does not count the number of occurrences of NULL

It is best to use
Count(1) in place of column name or *
to count the number of rows in a table, it is faster than any format because it never go to check the column name into table exists or not

There is no difference if one column is fix in your table, if you want to use more than one column than you have to specify that how much columns you required to count......
Thanks,

As mentioned in the previous answers, Count(*) counts even the NULL columns, whereas count(Columnname) counts only if the column has values.
It's always best practice to avoid * (Select *, count *, …)

Related

HQL, insert two rows if a condition is met

I have the following table called table_persons in Hive:
+--------+------+------------+
| people | type | date |
+--------+------+------------+
| lisa | bot | 19-04-2022 |
| wayne | per | 19-04-2022 |
+--------+------+------------+
If type is "bot", I have to add two rows in the table d1_info else if type is "per" i only have to add one row so the result is the following:
+---------+------+------------+
| db_type | info | date |
+---------+------+------------+
| x_bot | x | 19-04-2022 |
| x_bnt | x | 19-04-2022 |
| x_per | b | 19-04-2022 |
+---------+------+------------+
How can I add two rows if this condition is met?
with a Case When maybe?
You may try using a union to merge or duplicate the rows with bot. The following eg unions the first query which selects all records and the second query selects only those with bot.
Edit
In response to the edited question, I have added an additional parity column (storing 1 or 0) named original to differentiate the duplicate entry named
SELECT
p1.*,
1 as original
FROM
table_persons p1
UNION ALL
SELECT
p1.*,
0 as original
FROM
table_persons p1
WHERE p1.type='bot'
You may then insert this into your other table d1_info using the above query as a subquery or CTE with the desired transformations CASE expressions eg
INSERT INTO d1_info
(`db_type`, `info`, `date`)
WITH merged_data AS (
SELECT
p1.*,
1 as original
FROM
table_persons p1
UNION ALL
SELECT
p1.*,
0 as original
FROM
table_persons p1
WHERE p1.type='bot'
)
SELECT
CONCAT('x_',CASE
WHEN m1.type='per' THEN m1.type
WHEN m1.original=1 AND m1.type='bot' THEN m1.type
ELSE 'bnt'
END) as db_type,
CASE
WHEN m1.type='per' THEN 'b'
ELSE 'x'
END as info,
m1.date
FROM
merged_data m1
ORDER BY m1.people,m1.date;
See working demo db fiddle here
I think what you want is to create a new table that captures your logic. This would simplify your query and make it so you could easily add new types without having to edit logic of a case statement. It may also make it cleaner to view your logic later.
CREATE TABLE table_persons (
`people` VARCHAR(5),
`type` VARCHAR(3),
`date` VARCHAR(10)
);
INSERT INTO table_persons
VALUES
('lisa', 'bot', '19-04-2022'),
('wayne', 'per', '19-04-2022');
CREATE TABLE info (
`type` VARCHAR(5),
`db_type` VARCHAR(5),
`info` VARCHAR(1)
);
insert into info
values
('bot', 'x_bot', 'x'),
('bot', 'x_bnt', 'x'),
('per','x_per','b');
and then you can easily do a join:
select
info.db_type,
info.info,
persons.date date
from
table_persons persons inner join info
on
info.type = persons.type

PostgreSQL CTE UPDATE-FROM query skips rows

2 tables
table_1 rows: NOTE: id 2 has two rows
-----------------------
| id | counts | track |
-----------------------
| 1 | 10 | 1 |
| 2 | 10 | 2 |
| 2 | 10 | 3 |
-----------------------
table_2 rows
---------------
| id | counts |
---------------
| 1 | 0 |
| 2 | 0 |
---------------
Query:
with t1_rows as (
select id, sum(counts) as counts, track
from table_1
group by id, track
)
update table_2 set counts = (coalesce(table_2.counts, 0) + t1.counts)::float
from t1_rows t1
where table_2.id = t1.id;
select * from table_2;
When i ran above query i got table_2 output as
---------------
| id | counts |
---------------
| 1 | 10 |
| 2 | 10 | (expected counts as 20 but got 10)
---------------
I noticed that above update query is considering only 1st match and skipping rest.
I can make it work by changing the query like below. Now the table_2 updates as expected since there are no duplicate rows from table_1.
But i would like to know why my previous query is not working. Is there anything wrong in it?
with t1_rows as (
select id, sum(counts) as counts, array_agg(track) as track
from table_1
group by id
)
update table_2 set counts = (coalesce(table_2.counts, 0) + t1.counts)::float
from t1_rows t1
where table_2.id = t1.id;
Schema
CREATE TABLE IF NOT EXISTS table_1(
id varchar not null,
counts integer not null,
track integer not null
);
CREATE TABLE IF NOT EXISTS table_2(
id varchar not null,
counts integer not null
);
insert into table_1(id, counts, track) values(1, 10, 1), (2, 10, 2), (2, 10, 3);
insert into table_2(id, counts) values(1, 0), (2, 0);
The problem is that an UPDATE in PostgreSQL creates a new version of the row rather than changing the row in place, but the new row version is not visible in the snapshot of the current query. So from the point of view of the query, the row “vanishes” when it is updated the first time.
The documentation says:
When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the from_list, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.
So if I read your question correctly, you expect row 2&3 from table_1 to get added together? If so, the reason your first approach didn't work is because it grouped by id, track.
Since row 2&3 have a different number in the track column, they didn't get added together by the group by clause.
Your second approach worked because it only grouped by id

SQL Identify records which occur more than once in the same year

I have a records from which a set of Procedure codes should only occur once per year per member. I'm trying to identify occurrences where this rule is broken.
I've tried the below SQL, is that correct?
Table
+---------------+--------+-------------+
| ProcedureCode | Member | ServiceDate |
+---------------+--------+-------------+
| G0443 | 1234 | 01-03-2017 |
+---------------+--------+-------------+
| G0443 | 1234 | 05-03-2018 |
+---------------+--------+-------------+
| G0443 | 1234 | 07-03-2018 |
+---------------+--------+-------------+
| G0444 | 3453 | 01-03-2017 |
+---------------+--------+-------------+
| G0443 | 5676 | 07-03-2018 |
+---------------+--------+-------------+
Expected results where rule is broken
+---------------+--------+
| ProcedureCode | Member |
+---------------+--------+
| G0443 | 1234 |
+---------------+--------+
SQL
Select ProcedureCD, Mbr_Id
From CLAIMS
Where ProcedureCD IN ('G0443', 'G0444')
GROUP BY ProcedureCD,Mbr_Id, YEAR(ServiceFromDate)
having count(YEAR(ServiceFromDate))>1
The query you've written will work (if you correct the column names- your query uses different column names to the sample data you posted). It can be simplified visually by using COUNT(*) in the HAVING clause. COUNT works on any non null value and accumulates a 1 for non nulls, or 0 for nulls, but there isn't any significance to using YEAR inside the count in this case because all the dates are non null and count isn't interested in the value - count(*), count(1), count(0), count(member)would all work equally here
The only time count(column) works differently to count(*) is when column contains null values. There is also an option of COUNT where you put DISTINCT inside the brackets, and this causes the counting to ignore repeated values.
COUNT DISTINCT on a table column that contains 6 rows of values 1, 1, 2, null, 3, 3 would return 3 (3 unique values). COUNTing the same column would return 5 (5 non null values), COUNT(*) would return 6
You should understand that by putting the YEAR(...) in the group by but not the select, you might produce duplicate-looking rows in the output. For example if you had these rows also:
Member, Code, Date
1234, G0443, 1-1-19
1234, G0443, 2-1-19
And you're grouping on year (but not showing it) then you'll see:
1234, G0443 --it's for year 2018
1234, G0443 --it's for year 2019
Personally I think it'd be handy to show the year in the select list, so you can better pinpoint where the problem is, but if you want to squish these duplicate rows, do a SELECT DISTINCT Alternatively, leverage the difference between count and count distinct: remove the year from the GROUP BY and instead say HAVING COUNT(*) > COUNT(DISTINCT YEAR(ServiceDate))
As discussed above a count(*) will be greater than a count distinct year if there are duplicated years
Select ProcedureCode, Member,YEAR(ServiceDate) [Year],Count(*) Occurences
From CLAIMS
Where ProcedureCode IN ('G0443', 'G0444')
GROUP BY ProcedureCode, Member,YEAR(ServiceDate)
HAVING Count(*) > 1
Hope This code will help you
create table #temp (ProcedureCode varchar(20),Member varchar(20),ServiceDate Date)
insert into #temp (ProcedureCode,Member,ServiceDate) values ('G0443','1234','01-03-2017')
insert into #temp (ProcedureCode,Member,ServiceDate) values ('G0443','1234','05-03-2018 ')
insert into #temp (ProcedureCode,Member,ServiceDate) values ('G0443','1234','07-03-2018')
insert into #temp (ProcedureCode,Member,ServiceDate) values ('G0444','3453','01-03-2017')
insert into #temp (ProcedureCode,Member,ServiceDate) values ('G0443','5676','07-03-2018')
select ProcedureCode,Member from #temp
where YEAR(ServiceDate) in (Select year(ServiceDate) ServiceDate from #temp group by
ServiceDate having count(ServiceDate)>1)
and Member in (Select Member from #temp group by Member having count(Member)>1)
Group by ProcedureCode,Member
drop table #temp

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.

Insert multiple values and Insert value in parallel

I have a question about SQL in parallel queries. For example, suppose that I have this query:
INSERT INTO tblExample (num) VALUES (1), (2)
And this query:
INSERT INTO tblExample (num) VALUES (3)
The final table should looked like this:
num
---
1
2
3
But I wonder if there is an option that those two queries will run in parallel and the final table will be looked like this:
num
---
1
3
2
Someone know the answer?
Thanks in advance!
There is no order in sql. You can sort your queries by adding an ORDER BY clause
SELECT * FROM tblExample ORDER BY num
Or you could add a timestamp column to the table and order by that.
How your table "looks" depends on how you asked for it (in the SELECT statement). Without an ORDER BY clause, the order of your table is undefined:
ORDER BY is the only way to sort the rows in the result set. Without this clause, the relational database system may return the rows in any order. If an ordering is required, the ORDER BY must be provided in the SELECT statement sent by the application.
For example:
SELECT num FROM tblExample ORDER BY num ASC
1
2
3
SELECT num FROM tblExample ORDER BY num DESC
3
2
1
If you want to order your columns manually, you can add a new column and sort on it:
+-----+-------+
| num | order |
+-----+-------+
| 1 | 1 |
| 2 | 3 |
| 3 | 2 |
+-----+-------+
SELECT num FROM tblExample ORDER BY order ASC
1
3
2