NOT IN vs NOT EXISTS - sql

Which of these queries is the faster?
NOT EXISTS:
SELECT ProductID, ProductName
FROM Northwind..Products p
WHERE NOT EXISTS (
SELECT 1
FROM Northwind..[Order Details] od
WHERE p.ProductId = od.ProductId)
Or NOT IN:
SELECT ProductID, ProductName
FROM Northwind..Products p
WHERE p.ProductID NOT IN (
SELECT ProductID
FROM Northwind..[Order Details])
The query execution plan says they both do the same thing. If that is the case, which is the recommended form?
This is based on the NorthWind database.
[Edit]
Just found this helpful article:
http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
I think I'll stick with NOT EXISTS.

I always default to NOT EXISTS.
The execution plans may be the same at the moment but if either column is altered in the future to allow NULLs the NOT IN version will need to do more work (even if no NULLs are actually present in the data) and the semantics of NOT IN if NULLs are present are unlikely to be the ones you want anyway.
When neither Products.ProductID or [Order Details].ProductID allow NULLs the NOT IN will be treated identically to the following query.
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
The exact plan may vary but for my example data I get the following.
A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.
/*Not valid syntax but better reflects the plan*/
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId
If [Order Details].ProductID is NULL-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
The reason for this is that the correct semantics if [Order Details] contains any NULL ProductIds is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.
If Products.ProductID is also changed to become NULL-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)
The reason for that one is because a NULL Products.ProductId should not be returned in the results except if the NOT IN sub query were to return no results at all (i.e. the [Order Details] table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.
The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.
Additionally the fact that a single NULL can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.
This is not the only possible execution plan for a NOT IN on a NULL-able column however. This article shows another one for a query against the AdventureWorks2008 database.
For the NOT IN on a NOT NULL column or the NOT EXISTS against either a nullable or non nullable column it gives the following plan.
When the column changes to NULL-able the NOT IN plan now looks like
It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id> to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL.
As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail does not contain any NULL ProductIDs it will double the number of seek operations required.

Also be aware that NOT IN is not equivalent to NOT EXISTS when it comes to null.
This post explains it very well
http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/
When the subquery returns even one null, NOT IN will not match any
rows.
The reason for this can be found by looking at the details of what the
NOT IN operation actually means.
Let’s say, for illustration purposes that there are 4 rows in the
table called t, there’s a column called ID with values 1..4
WHERE SomeValue NOT IN (SELECT AVal FROM t)
is equivalent to
WHERE SomeValue != (SELECT AVal FROM t WHERE ID=1)
AND SomeValue != (SELECT AVal FROM t WHERE ID=2)
AND SomeValue != (SELECT AVal FROM t WHERE ID=3)
AND SomeValue != (SELECT AVal FROM t WHERE ID=4)
Let’s further say that AVal is NULL where ID = 4. Hence that !=
comparison returns UNKNOWN. The logical truth table for AND states
that UNKNOWN and TRUE is UNKNOWN, UNKNOWN and FALSE is FALSE. There is
no value that can be AND’d with UNKNOWN to produce the result TRUE
Hence, if any row of that subquery returns NULL, the entire NOT IN
operator will evaluate to either FALSE or NULL and no records will be
returned

If the execution planner says they're the same, they're the same. Use whichever one will make your intention more obvious -- in this case, the second.

Actually, I believe this would be the fastest:
SELECT ProductID, ProductName
FROM Northwind..Products p
outer join Northwind..[Order Details] od on p.ProductId = od.ProductId)
WHERE od.ProductId is null

I have a table which has about 120,000 records and need to select only those which does not exist (matched with a varchar column) in four other tables with number of rows approx 1500, 4000, 40000, 200. All the involved tables have unique index on the concerned Varchar column.
NOT IN took about 10 mins, NOT EXISTS took 4 secs.
I have a recursive query which might had some untuned section which might have contributed to the 10 mins, but the other option taking 4 secs explains, atleast to me that NOT EXISTS is far better or at least that IN and EXISTS are not exactly the same and always worth a check before going ahead with code.

I was using
SELECT * from TABLE1 WHERE Col1 NOT IN (SELECT Col1 FROM TABLE2)
and found that it was giving wrong results (By wrong I mean no results). As there was a NULL in TABLE2.Col1.
While changing the query to
SELECT * from TABLE1 T1 WHERE NOT EXISTS (SELECT Col1 FROM TABLE2 T2 WHERE T1.Col1 = T2.Col2)
gave me the correct results.
Since then I have started using NOT EXISTS every where.

In your specific example they are the same, because the optimizer has figured out what you are trying to do is the same in both examples. But it is possible that in non-trivial examples the optimizer may not do this, and in that case there are reasons to prefer one to other on occasion.
NOT IN should be preferred if you are testing multiple rows in your outer select. The subquery inside the NOT IN statement can be evaluated at the beginning of the execution, and the temporary table can be checked against each value in the outer select, rather than re-running the subselect every time as would be required with the NOT EXISTS statement.
If the subquery must be correlated with the outer select, then NOT EXISTS may be preferable, since the optimizer may discover a simplification that prevents the creation of any temporary tables to perform the same function.

Database table model
Let’s assume we have the following two tables in our database, that form a one-to-many table relationship.
The student table is the parent, and the student_grade is the child table since it has a student_id Foreign Key column referencing the id Primary Key column in the student table.
The student table contains the following two records:
id
first_name
last_name
admission_score
1
Alice
Smith
8.95
2
Bob
Johnson
8.75
And, the student_grade table stores the grades the students received:
id
class_name
grade
student_id
1
Math
10
1
2
Math
9.5
1
3
Math
9.75
1
4
Science
9.5
1
5
Science
9
1
6
Science
9.25
1
7
Math
8.5
2
8
Math
9.5
2
9
Math
9
2
10
Science
10
2
11
Science
9.4
2
SQL EXISTS
Let’s say we want to get all students that have received a 10 grade in Math class.
If we are only interested in the student identifier, then we can run a query like this one:
SELECT
student_grade.student_id
FROM
student_grade
WHERE
student_grade.grade = 10 AND
student_grade.class_name = 'Math'
ORDER BY
student_grade.student_id
But, the application is interested in displaying the full name of a student, not just the identifier, so we need info from the student table as well.
In order to filter the student records that have a 10 grade in Math, we can use the EXISTS SQL operator, like this:
SELECT
id, first_name, last_name
FROM
student
WHERE EXISTS (
SELECT 1
FROM
student_grade
WHERE
student_grade.student_id = student.id AND
student_grade.grade = 10 AND
student_grade.class_name = 'Math'
)
ORDER BY id
When running the query above, we can see that only the Alice row is selected:
id
first_name
last_name
1
Alice
Smith
The outer query selects the student row columns we are interested in returning to the client. However, the WHERE clause is using the EXISTS operator with an associated inner subquery.
The EXISTS operator returns true if the subquery returns at least one record and false if no row is selected. The database engine does not have to run the subquery entirely. If a single record is matched, the EXISTS operator returns true, and the associated other query row is selected.
The inner subquery is correlated because the student_id column of the student_grade table is matched against the id column of the outer student table.
SQL NOT EXISTS
Let’s consider we want to select all students that have no grade lower than 9. For this, we can use NOT EXISTS, which negates the logic of the EXISTS operator.
Therefore, the NOT EXISTS operator returns true if the underlying subquery returns no record. However, if a single record is matched by the inner subquery, the NOT EXISTS operator will return false, and the subquery execution can be stopped.
To match all student records that have no associated student_grade with a value lower than 9, we can run the following SQL query:
SELECT
id, first_name, last_name
FROM
student
WHERE NOT EXISTS (
SELECT 1
FROM
student_grade
WHERE
student_grade.student_id = student.id AND
student_grade.grade < 9
)
ORDER BY id
When running the query above, we can see that only the Alice record is matched:
id
first_name
last_name
1
Alice
Smith
So, the advantage of using the SQL EXISTS and NOT EXISTS operators is that the inner subquery execution can be stopped as long as a matching record is found.

They are very similar but not really the same.
In terms of efficiency, I've found the left join is null statement more efficient (when an abundance of rows are to be selected that is)

If the optimizer says they are the same then consider the human factor. I prefer to see NOT EXISTS :)

It depends..
SELECT x.col
FROM big_table x
WHERE x.key IN( SELECT key FROM really_big_table );
would not be relatively slow the isn't much to limit size of what the query check to see if they key is in. EXISTS would be preferable in this case.
But, depending on the DBMS's optimizer, this could be no different.
As an example of when EXISTS is better
SELECT x.col
FROM big_table x
WHERE EXISTS( SELECT key FROM really_big_table WHERE key = x.key);
AND id = very_limiting_criteria

Related

Exists sub-query with a HAVING clause

I'm trying to understand how EXISTS work.
The following query is based on this answer, and it queries for all SalesOrderIDs that have more than 1 record in the table, where at lease one of those records has OrderQty > 1 and ProductID = 777:
USE AdventureWorks2012;
GO
SELECT SalesOrderID, OrderQty, ProductID
FROM Sales.SalesOrderDetail s
WHERE EXISTS
( SELECT 1
FROM Sales.SalesOrderDetail s2
WHERE s.SalesOrderID = s2.SalesOrderID
GROUP BY SalesOrderID
HAVING COUNT(*) > 1
AND COUNT(CASE WHEN OrderQty > 1 AND ProductID = 777 THEN 1 END) >= 1
);
What I don't understand is this: The sub-query returns a single-columned table filled with the value 1 on each row. So the way I understand it, the WHERE in the outer query has no real condition to apply, just a bunch of 1s. Why\How, then, the outer query returns only part of the Sales.SalesOrderDetail, and not its entirety?
What happens in EXISTS is that, it only checks if the record from the outer table satisfies the conditions given in the inner query. That's why we specify "1" unlike IN where we need to specify the individual columns (and data is checked for each and every record).
So, it does not return any bunch of 1's and validates it. As the name implies, it checks only for the existence of the record as per the given condition.
Hope this clarifies.
Note : Always use table alias names for the columns to prevent ambiguity.
the inner SELECT 1 ... will not always return 1.
When inner WHERE/HAVING condition is not met you will not get 1 returned. Instead there will be nothing, I mean the SQL Server Management Studio (if I recall correctly) will display NO result at all, not even NULL for the inner SELECT 1 thus failing the whole outer WHERE for that particular row.
Therefore part of your outer query result set will be cut off and the total number of rows returned with EXITS(...) will be less then if EXISTS(...) was not present.

LEFT JOIN WHERE RIGHT IS NULL for same table in Teradata SQL

I have a table with 51 records . The table structure looks something like below :
ack_extract_id query_id cnst_giftran_key field1 value1
Now ack_extract_ids can be 8,9.
I want to check for giftran keys which are there for extract_id 9 and not there in 8.
What I tried was
SELECT *
FROM ddcoe_tbls.ack_flextable ack_flextable1
INNER JOIN ddcoe_tbls.ack_main_config config
ON ack_flextable1.ack_extract_id = config.ack_extract_id
LEFT JOIN ddcoe_tbls.ack_flextable ack_flextable2
ON ack_flextable1.cnst_giftran_key = ack_flextable2.cnst_giftran_key
WHERE ack_flextable2.cnst_giftran_key IS NULL
AND config.ack_extract_file_nm LIKE '%Dtl%'
AND ack_flextable2.ack_extract_id = 8
AND ack_flextable1.ack_extract_id = 9
But it is returning me 0 records. Ideally the left join where right is null should have returned the record for which the cnst_giftran_key is not present in the right hand side table, right ?
What am I missing here ?
When you test columns from the left-joined table in the where clause (ack_flextable2.ack_extract_id in your case), you force that join to behave as if it were an inner join. Instead, move that test to be part of the join condition.
Then to find records where that value is missing, test for a NULL key in the where clause.
SELECT *
FROM ddcoe_tbls.ack_flextable ack_flextable1
INNER JOIN ddcoe_tbls.ack_main_config config
ON ack_flextable1.ack_extract_id = config.ack_extract_id
LEFT JOIN ddcoe_tbls.ack_flextable ack_flextable2
ON ack_flextable1.cnst_giftran_key = ack_flextable2.cnst_giftran_key
AND ack_flextable2.ack_extract_id = 8
WHERE ack_flextable2.cnst_giftran_key IS NULL
AND config.ack_extract_file_nm LIKE '%Dtl%'
AND ack_flextable1.ack_extract_id = 9
AND ack_flextable2.cnst_giftran_key IS NULL
THIS IS NO ANSWER, JUST AN EXPLANATION
From your comment to Joe Stefanelli's answer I gather that you don't fully understand the issue with WHERE and ON in an outer join. So let's look at an example.
We are looking for all supplier's last orders, i.e. the order records where there is no newer order for the supplier.
select *
from order
where not exists
(
select *
from order newer
where newer.supplier = order.supplier
and newer.orderdate > order.orderdate
);
This is straight-forward; the query matches what we just put in words: Find orders for which NOT EXISTS a newer order for the same supplier.
The same query with the anti-join pattern:
select order.*
from order
left join order newer on newer.supplier = order.supplier
and newer.orderdate > order.orderdate
where newer.id is null;
Here we join every order with all their newer orders, thus probably creating a huge intermediate result. With the left outer join we make sure we get a dummy record attached when there is no newer order for the supplier. Then at last we scan the intermediate result with the WHERE clause, keeping only records where the attached record has an ID null. Well, the ID is obviously the table's primary key and can never be null, so what we keep here is only the outer-joined results where the newer data is just a dummy record containing nulls. Thus we get exactly the orders for which no newer order exists.
Talking about a huge intermediate result: How can this be faster than the first query? Well, it shouldn't. The first query should actually either run equally fast or faster. A good DBMS will see through this and make the same execution plan for both queries. A rather young DBMS however may really execute the anti join quicker. That is because the developers put so much effort into join techniques, as these are needed in about every query, and didn't yet care about IN and EXISTS that much. In such a case one may run into performance issues with NOT IN or NOT EXISTS and use the anti-join pattern instead.
Now as to the WHERE / ON problem:
select order.*
from order
left join order newer on newer.orderdate > order.orderdate
where newer.supplier = order.supplier
and newer.id is null;
This looks almost the same as before, but some criteria has moved from ON to WHERE. This means the outer join gets different criteria. Here is what happens: for every order find all newer orders &dash; no matter which supplier! So it is all orders of the last order date that get an outer-join dummy record. But then in the WHERE clause we remove all pairs where the supplier doesn't match. Notice that the outer-joined records contain NULL for newer.supplier, so newer.supplier = order.supplier is never true for them; they get removed. But then, if we remove all outer-joined records we get exactly the same result as with a vanilla inner join. When we put outer join criteria in the WHERE clause we turn the outer join into an inner join. So the query can be re-written as
select order.*
from order
inner join order newer on newer.orderdate > order.orderdate
where newer.supplier = order.supplier
and newer.id is null;
And with tables in FROM and INNER JOIN it doesn't matter whether the criteria is in ON or WHERE; it's rather a matter of readability, because both criteria will equally get applied.
Now we see that newer.id is null can never be true. The final result will be empty &dash; which is exactly what happened with your query.
You can try with this query:
select * from ddcoe_tbls.ack_main_config
where cnst_giftran_key not in
(
select cnst_giftran_key from ddcoe_tbls.ack_main_config
where ack_extract_id = 8
)
and ack_extract_id = 9;

Use of IN and EXISTS in SQL

Assuming that one has three Tables in a Relational Database as :
Customer(Id, Name, City),
Product(Id, Name, Price),
Orders(Cust_Id, Prod_Id, Date)
My first question is what is the best way to excecute the query: "Get all the Customers who ordered a Product".
Some people propose the query with EXISTS as:
Select *
From Customer c
Where Exists (Select Cust_Id from Orders o where c.Id=o.cust_Id)
Is the above query equivalent (can it be written?) as:
Select *
From Customer
Where Exists (select Cust_id from Orders o Join Customer c on c.Id=o.cust_Id)
What is the problem when we use IN instead of EXISTS apart from the performance as:
Select *
From Customer
Where Customer.Id IN (Select o.cust_Id from Order o )
Do the three above queries return exactly the same records?
Update: How does really the EXISTS evaluation works in the second query (or the first), considering that it checks only if the Subquery returns true or false? What is the "interpretation" of the query i.e.?
Select *
From Customer c
Where Exists (True)
The first two queries are different.
The first has a correlated subquery and will return what you want -- information about customers who have an order.
The second has an uncorrelated subquery. It will return either all customers or no customers, depending on whether or not any customers have placed an order.
The third query is an alternative way of expressing what you want.
The only possible issue that I can think of would arise when cust_id might have NULL values. In such a case, the first and third queries may not return the same results.
Yes, each of those three should return identical result sets.
Your second query is incorrect, as #ypercube points out in the commends. You're checking whether an uncorrellated subquery EXISTS
Of the two that work (1, 3), I'd expect #3 to be the fastest depending on your tables because it only executes the subquery one time.
However your most effective result is probably none of them but this:
SELECT DISTINCT
c.*
FROM
Customer c
JOIN
Orders o
ON o.[cust_id] = c.[Id]
because it should just be an index scan and a hash.
You should check the query plans and/or benchmark each one.
The best way to execute that query is to add orders to the from clause and join to it.
select distinct c.*
from customers c,
orders o
where c.id = o.cust_id
Your other queries may be more inefficient (depending on the shape of the data) but they should all return the same result set.

Does EXCEPT execute faster than a JOIN when the table columns are the same

To find all the changes between two databases, I am left joining the tables on the pk and using a date_modified field to choose the latest record. Will using EXCEPT increase performance since the tables have the same schema. I would like to rewrite it with an EXCEPT, but I'm not sure if the implementation for EXCEPT would out perform a JOIN in every case. Hopefully someone has a more technical explanation for when to use EXCEPT.
There is no way anyone can tell you that EXCEPT will always or never out-perform an equivalent OUTER JOIN. The optimizer will choose an appropriate execution plan regardless of how you write your intent.
That said, here is my guideline:
Use EXCEPT when at least one of the following is true:
The query is more readable (this will almost always be true).
Performance is improved.
And BOTH of the following are true:
The query produces semantically identical results, and you can demonstrate this through sufficient regression testing, including all edge cases.
Performance is not degraded (again, in all edge cases, as well as environmental changes such as clearing buffer pool, updating statistics, clearing plan cache, and restarting the service).
It is important to note that it can be a challenge to write an equivalent EXCEPT query as the JOIN becomes more complex and/or you are relying on duplicates in part of the columns but not others. Writing a NOT EXISTS equivalent, while slightly less readable than EXCEPT should be far more trivial to accomplish - and will often lead to a better plan (but note that I would never say ALWAYS or NEVER, except in the way I just did).
In this blog post I demonstrate at least one case where EXCEPT is outperformed by both a properly constructed LEFT OUTER JOIN and of course by an equivalent NOT EXISTS variation.
In the following example, the LEFT JOIN is faster than EXCEPT by 70%
(PostgreSQL 9.4.3)
Example:
There are three tables. suppliers, parts, shipments.
We need to get all parts not supplied by any supplier in London.
Database(has indexes on all involved columns):
CREATE TABLE suppliers (
id bigint primary key,
city character varying NOT NULL
);
CREATE TABLE parts (
id bigint primary key,
name character varying NOT NULL,
);
CREATE TABLE shipments (
id bigint primary key,
supplier_id bigint NOT NULL,
part_id bigint NOT NULL
);
Records count:
db=# SELECT COUNT(*) FROM suppliers;
count
---------
1281280
(1 row)
db=# SELECT COUNT(*) FROM parts;
count
---------
1280000
(1 row)
db=# SELECT COUNT(*) FROM shipments;
count
---------
1760161
(1 row)
Query using EXCEPT.
SELECT parts.*
FROM parts
EXCEPT
SELECT parts.*
FROM parts
LEFT JOIN shipments
ON (parts.id = shipments.part_id)
LEFT JOIN suppliers
ON (shipments.supplier_id = suppliers.id)
WHERE suppliers.city = 'London'
;
-- Execution time: 3327.728 ms
Query using LEFT JOIN with table, returned by subquery.
SELECT parts.*
FROM parts
LEFT JOIN (
SELECT parts.id
FROM parts
LEFT JOIN shipments
ON (parts.id = shipments.part_id)
LEFT JOIN suppliers
ON (shipments.supplier_id = suppliers.id)
WHERE suppliers.city = 'London'
) AS subquery_tbl
ON (parts.id = subquery_tbl.id)
WHERE subquery_tbl.id IS NULL
;
-- Execution time: 1136.393 ms

COUNT(*) vs. COUNT(1) vs. COUNT(pk): which is better? [duplicate]

This question already has answers here:
Count(*) vs Count(1) - SQL Server
(13 answers)
Closed 8 years ago.
I often find these three variants:
SELECT COUNT(*) FROM Foo;
SELECT COUNT(1) FROM Foo;
SELECT COUNT(PrimaryKey) FROM Foo;
As far as I can see, they all do the same thing, and I find myself using the three in my codebase. However, I don't like to do the same thing different ways. To which one should I stick? Is any one of them better than the two others?
Bottom Line
Use either COUNT(field) or COUNT(*), and stick with it consistently, and if your database allows COUNT(tableHere) or COUNT(tableHere.*), use that.
In short, don't use COUNT(1) for anything. It's a one-trick pony, which rarely does what you want, and in those rare cases is equivalent to count(*)
Use count(*) for counting
Use * for all your queries that need to count everything, even for joins, use *
SELECT boss.boss_id, COUNT(subordinate.*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
But don't use COUNT(*) for LEFT joins, as that will return 1 even if the subordinate table doesn't match anything from parent table
SELECT boss.boss_id, COUNT(*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
Don't be fooled by those advising that when using * in COUNT, it fetches entire row from your table, saying that * is slow. The * on SELECT COUNT(*) and SELECT * has no bearing to each other, they are entirely different thing, they just share a common token, i.e. *.
An alternate syntax
In fact, if it is not permitted to name a field as same as its table name, RDBMS language designer could give COUNT(tableNameHere) the same semantics as COUNT(*). Example:
For counting rows we could have this:
SELECT COUNT(emp) FROM emp
And they could make it simpler:
SELECT COUNT() FROM emp
And for LEFT JOINs, we could have this:
SELECT boss.boss_id, COUNT(subordinate)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
But they cannot do that (COUNT(tableNameHere)) since SQL standard permits naming a field with the same name as its table name:
CREATE TABLE fruit -- ORM-friendly name
(
fruit_id int NOT NULL,
fruit varchar(50), /* same name as table name,
and let's say, someone forgot to put NOT NULL */
shape varchar(50) NOT NULL,
color varchar(50) NOT NULL
)
Counting with null
And also, it is not a good practice to make a field nullable if its name matches the table name. Say you have values 'Banana', 'Apple', NULL, 'Pears' on fruit field. This will not count all rows, it will only yield 3, not 4
SELECT count(fruit) FROM fruit
Though some RDBMS do that sort of principle (for counting the table's rows, it accepts table name as COUNT's parameter), this will work in Postgresql (if there is no subordinate field in any of the two tables below, i.e. as long as there is no name conflict between field name and table name):
SELECT boss.boss_id, COUNT(subordinate)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
But that could cause confusion later if we will add a subordinate field in the table, as it will count the field(which could be nullable), not the table rows.
So to be on the safe side, use:
SELECT boss.boss_id, COUNT(subordinate.*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
count(1): The one-trick pony
In particular to COUNT(1), it is a one-trick pony, it works well only on one table query:
SELECT COUNT(1) FROM tbl
But when you use joins, that trick won't work on multi-table queries without its semantics being confused, and in particular you cannot write:
-- count the subordinates that belongs to boss
SELECT boss.boss_id, COUNT(subordinate.1)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
So what's the meaning of COUNT(1) here?
SELECT boss.boss_id, COUNT(1)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
Is it this...?
-- counting all the subordinates only
SELECT boss.boss_id, COUNT(subordinate.boss_id)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
Or this...?
-- or is that COUNT(1) will also count 1 for boss regardless if boss has a subordinate
SELECT boss.boss_id, COUNT(*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
By careful thought, you can infer that COUNT(1) is the same as COUNT(*), regardless of type of join. But for LEFT JOINs result, we cannot mold COUNT(1) to work as: COUNT(subordinate.boss_id), COUNT(subordinate.*)
So just use either of the following:
-- count the subordinates that belongs to boss
SELECT boss.boss_id, COUNT(subordinate.boss_id)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
Works on Postgresql, it's clear that you want to count the cardinality of the set
-- count the subordinates that belongs to boss
SELECT boss.boss_id, COUNT(subordinate.*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
Another way to count the cardinality of the set, very English-like (just don't make a column with a name same as its table name) : http://www.sqlfiddle.com/#!1/98515/7
select boss.boss_name, count(subordinate)
from boss
left join subordinate on subordinate.boss_code = boss.boss_code
group by boss.boss_name
You cannot do this: http://www.sqlfiddle.com/#!1/98515/8
select boss.boss_name, count(subordinate.1)
from boss
left join subordinate on subordinate.boss_code = boss.boss_code
group by boss.boss_name
You can do this, but this produces wrong result: http://www.sqlfiddle.com/#!1/98515/9
select boss.boss_name, count(1)
from boss
left join subordinate on subordinate.boss_code = boss.boss_code
group by boss.boss_name
Two of them always produce the same answer:
COUNT(*) counts the number of rows
COUNT(1) also counts the number of rows
Assuming the pk is a primary key and that no nulls are allowed in the values, then
COUNT(pk) also counts the number of rows
However, if pk is not constrained to be not null, then it produces a different answer:
COUNT(possibly_null) counts the number of rows with non-null values in the column possibly_null.
COUNT(DISTINCT pk) also counts the number of rows (because a primary key does not allow duplicates).
COUNT(DISTINCT possibly_null_or_dup) counts the number of distinct non-null values in the column possibly_null_or_dup.
COUNT(DISTINCT possibly_duplicated) counts the number of distinct (necessarily non-null) values in the column possibly_duplicated when that has the NOT NULL clause on it.
Normally, I write COUNT(*); it is the original recommended notation for SQL. Similarly, with the EXISTS clause, I normally write WHERE EXISTS(SELECT * FROM ...) because that was the original recommend notation. There should be no benefit to the alternatives; the optimizer should see through the more obscure notations.
Asked and answered before...
Books on line says "COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )"
"1" is a non-null expression so it's the same as COUNT(*).
The optimiser recognises it as trivial so gives the same plan. A PK is unique and non-null (in SQL Server at least) so COUNT(PK) = COUNT(*)
This is a similar myth to EXISTS (SELECT * ... or EXISTS (SELECT 1 ...
And see the ANSI 92 spec, section 6.5, General Rules, case 1
a) If COUNT(*) is specified, then the result is the cardinality
of T.
b) Otherwise, let TX be the single-column table that is the
result of applying the <value expression> to each row of T
and eliminating null values. If one or more null values are
eliminated, then a completion condition is raised: warning-
null value eliminated in set function.
At least on Oracle they are all the same: http://www.oracledba.co.uk/tips/count_speed.htm
I feel the performance characteristics change from one DBMS to another. It's all on how they choose to implement it. Since I have worked extensively on Oracle, I'll tell from that perspective.
COUNT(*) - Fetches entire row into result set before passing on to the count function, count function will aggregate 1 if the row is not null
COUNT(1) - Will not fetch any row, instead count is called with a constant value of 1 for each row in the table when the WHERE matches.
COUNT(PK) - The PK in Oracle is indexed. This means Oracle has to read only the index. Normally one row in the index B+ tree is many times smaller than the actual row. So considering the disk IOPS rate, Oracle can fetch many times more rows from Index with a single block transfer as compared to entire row. This leads to higher throughput of the query.
From this you can see the first count is the slowest and the last count is the fastest in Oracle.