SQL Server group by foreign key and select dependant columns - sql

I have some performance issues when querying in SQL server. I need to GROUP BY foreign key (academic_unit_id), but I also need to select a column that is dependant on the FK (academic_unit_name).
In SQL Server I can't just select academic_unit_name in the same query, it must be aggregated or in the GROUP BY.
I think the options I have are:
SELECT academic_unit_id (foreign key) and academic_unit_unit_name (dependant on FK), and then group by both
SELECT
ria.COD_DOCENTE_SCD,
ria.COD_CURSO_SECCION_SCD,
ria.COD_ITEM_SCD,
ria.COD_GRUPO_PREGUNTA_SCD,
alumno.IDN_UNIDAD_ACADEM_SCD, -- id
alumno.NOM_UNIDAD_ACADEM_SCD, -- name
ROUND(COUNT(case when opcion.punto = 1 then 1 end), 2) as amount_yes,
ROUND(COUNT(case when opcion.punto = 0 then 1 end), 2) as amount_no
FROM BANNER_ENCUESTA.R_RESP_ITEM_ALUMNO_CSP_SCD AS ria INNER JOIN
BANNER_ENCUESTA.TIPO_OPCION_SCD AS opcion ON ria.COD_TIPO_OPCION_SCD = opcion.COD_TIPO_OPCION_SCD INNER JOIN
BANNER_ENCUESTA.ALUMNO_SCD AS alumno ON ria.COD_ALUMNO_SCD = alumno.COD_ALUMNO_SCD
GROUP BY
ria.COD_DOCENTE_SCD,
ria.COD_CURSO_SECCION_SCD,
ria.COD_ITEM_SCD,
ria.COD_GRUPO_PREGUNTA_SCD,
alumno.IDN_UNIDAD_ACADEM_SCD, -- group by FK
alumno.NOM_UNIDAD_ACADEM_SCD -- group by name
GROUP BY PK and aggregate the academic_unit_name. I can aggregate using max, since all names are equals for a given id.
SELECT
ria.COD_DOCENTE_SCD,
ria.COD_CURSO_SECCION_SCD,
ria.COD_ITEM_SCD,
ria.COD_GRUPO_PREGUNTA_SCD,
alumno.IDN_UNIDAD_ACADEM_SCD, -- id
MAX(alumno.NOM_UNIDAD_ACADEM_SCD), -- aggregate name
ROUND(COUNT(case when opcion.punto = 1 then 1 end), 2) as amount_yes,
ROUND(COUNT(case when opcion.punto = 0 then 1 end), 2) as amount_no
FROM BANNER_ENCUESTA.R_RESP_ITEM_ALUMNO_CSP_SCD AS ria INNER JOIN
BANNER_ENCUESTA.TIPO_OPCION_SCD AS opcion ON ria.COD_TIPO_OPCION_SCD = opcion.COD_TIPO_OPCION_SCD INNER JOIN
BANNER_ENCUESTA.ALUMNO_SCD AS alumno ON ria.COD_ALUMNO_SCD = alumno.COD_ALUMNO_SCD
GROUP BY
ria.COD_DOCENTE_SCD,
ria.COD_CURSO_SECCION_SCD,
ria.COD_ITEM_SCD,
ria.COD_GRUPO_PREGUNTA_SCD,
alumno.IDN_UNIDAD_ACADEM_SCD --Group by FK
SELECT only academic_unit_id and then JOIN with AcademicUnit again to obtain the name.
with banner_questions as (
SELECT
ria.COD_DOCENTE_SCD,
ria.COD_CURSO_SECCION_SCD,
ria.COD_ITEM_SCD,
ria.COD_GRUPO_PREGUNTA_SCD,
alumno.IDN_UNIDAD_ACADEM_SCD -- id,
ROUND(COUNT(case when opcion.punto = 1 then 1 end), 2) as amount_yes,
ROUND(COUNT(case when opcion.punto = 0 then 1 end), 2) as amount_no
FROM BANNER_ENCUESTA.R_RESP_ITEM_ALUMNO_CSP_SCD AS ria INNER JOIN
BANNER_ENCUESTA.TIPO_OPCION_SCD AS opcion ON ria.COD_TIPO_OPCION_SCD = opcion.COD_TIPO_OPCION_SCD INNER JOIN
BANNER_ENCUESTA.ALUMNO_SCD AS alumno ON ria.COD_ALUMNO_SCD = alumno.COD_ALUMNO_SCD
GROUP BY
ria.COD_DOCENTE_SCD,
ria.COD_CURSO_SECCION_SCD,
ria.COD_ITEM_SCD,
ria.COD_GRUPO_PREGUNTA_SCD,
alumno.IDN_UNIDAD_ACADEM_SCD) -- group by FK
SELECT
banner_questions.*,
student_ua.name -- Join with name
from NORMALIZADO_PRELIMINAR.AcademicUnit as student_ua INNER JOIN
banner_questions on student_ua.id = banner_questions.IDN_UNIDAD_ACADEM_SCD
In terms of performance, I'd like to know if one of these alternatives is better and under what assummptions. Also, I'd like to know if there are better choices to get the same result.

In the question I think you mean Foreign key rather than Primary key... the field is a Primary key in another table Academic_unit but is looking at, say, student_unit records which have an FK to Academic_unit.
So the question is for the field alumno.NOM_UNIDAD_ACADEM_SCD - do you GROUP BY it, MAX() it or JOIN it later?
Personally I suggest just
trying all three and see which ones run the fastest - which is best really depends on specific circumstances - and they often run very similarly
use the simplest version if they run at similar speeds - which is likely to be the GROUP BY version
In particular, the GROUP BY and MAX() should result in almost identical plans as they are sorted the same way.
The 'join it later' approach can have some speed advantages in certain circumstances (particularly when it's not just being joined to a reference table, but to a broader set of sub-queries), but I'm often wary about these. They have the disadvantage of making your code a bit more complex - which can have issues if you use the data for other things, or if SQL Server has bad estimates for the amount of data it expects. In this case, as this is just linking to the reference table alumno, it's unlikely to give any specific advantage.
In your code for option 3 above, you still have links to BANNER_ENCUESTA.ALUMNO_SCD AS alumno. The advantage of doing the join later would be to remove that from the initial grouping component, then link to it later to get the specific values e.g.,
In the GROUP BY within the CTE, also group by ria.COD_ALUMNO_SCD, but remove BANNER_ENCUESTA.ALUMNO_SCD AS alumno from the FROM clause
Put BANNER_ENCUESTA.ALUMNO_SCD AS alumno into the main SELECT part of the query, and join to banner_questions on that field
Note there is also a fourth option (temporary tables) which is used when
SQL Server gets estimates for how many rows it expects really wrong - and makes a really bad plan
You're joining not to reference tables, but to views (particularly if they have 'TOP' expressions or 'GROUP BY' in them) - in these cases, SQL Server may sometimes run the view completely once for every row in the join.
In these cases, it can be useful to split the query into two parts along the lines of #3, but instead of a CTE, you save it into a temporary table e.g., SELECT .... INTO #temp FROM ... GROUP BY.
You then use the temporary table, joined to the view that was problematic, and it will often run better.

Related

Semi-join vs Subqueries

What is the difference between semi-joins and a subquery? I am currently taking a course on this on DataCamp and i'm having a hard time making a distinction between the two.
Thanks in advance.
A join or a semi join is required whenever you want to combine two or more entities records based on some common conditional attributes.
Unlike, Subquery is required whenever you want to have a lookup or a reference on same table or other tables
In short, when your requirement is to get additional reference columns added to existing tables attributes then go for join else when you want to have a lookup on records from the same table or other tables but keeping the same existing columns as o/p go for subquery
Also, In case of semi join it can act/used as a subquery because most of the times we dont actually join the right table instead we mantain a check via subquery to limit records in the existing hence semijoin but just that it isnt a subquery by itself
I don't really think of a subquery and a semi-join as anything similar. A subquery is nothing more interesting than a query that is used inside another query:
select * -- this is often called the "outer" query
from (
select columnA -- this is the subquery inside the parentheses
from mytable
where columnB = 'Y'
)
A semi-join is a concept based on join. Of course, joining tables will combine both tables and return the combined rows based on the join criteria. From there you select the columns you want from either table based on further where criteria (and of course whatever else you want to do). The concept of a semi-join is when you want to return rows from the first table only, but you need the 2nd table to decide which rows to return. Example: you want to return the people in a class:
select p.FirstName, p.LastName, p.DOB
from people p
inner join classes c on c.pID = p.pID
where c.ClassName = 'SQL 101'
group by p.pID
This accomplishes the concept of a semi-join. We are only returning columns from the first table (people). The use of the group by is necessary for the concept of a semi-join because a true join can return duplicate rows from the first table (depending on the join criteria). The above example is not often referred to as a semi-join, and is not the most typical way to accomplish it. The following query is a more common method of accomplishing a semi-join:
select FirstName, LastName, DOB
from people
where pID in (select pID
from class
where ClassName = 'SQL 101'
)
There is no formal join here. But we're using the 2nd table to determine which rows from the first table to return. It's a lot like saying if we did join the 2nd table to the first table, what rows from the first table would match?
For performance, exists is typically preferred:
select FirstName, LastName, DOB
from people p
where exists (select pID
from class c
where c.pID = p.pID
and c.ClassName = 'SQL 101'
)
In my opinion, this is the most direct way to understand the semi-join. There is still no formal join, but you can see the idea of a join hinted at by the usage of directly matching the first table's pID column to the 2nd table's pID column.
Final note. The last 2 queries above each use a subquery to accomplish the concept of a semi-join.

Alternative for joining two tables multiple times

I have a situation where I have to join a table multiple times. Most of them need to be left joins, since some of the values are not available. How to overcome the query poor performance when joining multiple times?
The Scenario
Tables
[Project]: ProjectId Guid, Name VARCHAR(MAX).
[UDF]: EntityId Guid, EntityType Char(1), UDFCode Guid, UDFName varchar(20)
[UDFDetail]: UDFCode Guid, Description VARCHAR(MAX)
Relationship:
[Project].ProjectId - [UDF].EntityId
[UDFDetail].UDFCode - [UDF].UDFCode
The UDF table holds custom fields for projects, based on the UDFName column. The value for these fields, however, is stored on the UDFDetail, in the column Description.
I have lots of custom columns for Project, and they are stored in the UDF table.
So for example, to get two fields for the project I do the following select:
SELECT
p.Name ProjectName,
ud1.Description Field1,
ud1.UDFCode Field1Id,
ud2.Description Field2,
ud2.UDFCode Field2Id
FROM
Project p
LEFT JOIN UDF u1 ON
u1.EntityId = p.ProjectId AND u1.ItemName='Field1'
LEFT JOIN UDFDetail ud1 ON
ud1.UDFCode = u1.UDFCode
LEFT JOIN UDF u2 ON
u2.EntityId = p.ProjectId AND u2.ItemName='Field2'
LEFT JOIN UDFDetail ud2 ON
ud2.UDFCode = u2.UDFCode
The Problem
Imagine the above select but joining with like 15 fields. In my query I have around 10 fields already and the performance is not very good. It is taking about 20 seconds to run. I have good indexes for these tables, so looking at the execution plan, it is doing only index seeks without any lookups. Regarding the joins, it needs to be left join, because Field 1 might not exist for that specific project.
The Question
Is there a more performatic way to retrieve the data?
How would you do the query to retrieve 10 different fields for one project in a schema like this?
Your choices are pivot, explicit aggregation (with conditional functions), or the joins. If you have the appropriate indexes set up, the joins may be the fastest method.
The correct index would be UDF(EntityId, ItemName, UdfCode).
You can test if the group by is faster by running a query such as:
SELECT count(*)
FROM p LEFT JOIN
UDF u1
ON u1.EntityId = p.ProjectId LEFT JOIN
UDFDetail ud1
ON ud1.UDFCode = u1.UDFCode;
If this runs fast enough, then you can consider the group by approach.
You can try this very weird contraption (it does not look pretty, but it does a single set of outer joins). The intermediate result is a very "wide" and "long" dataset, which we can then "compact" with aggregation (for example, for each ProjectName, each Field1 column will have N result, N-1 NULLs and 1 non-null result, which is then selecting with a simple MAX aggregation) [N is the number of fields].
select ProjectName, max(Field1) as Field1, max(Field1Id) as Field1Id, max(Field2) as Field2, max(Field2Id) as Field2Id
from (
select
p.Name as ProjectName,
case when u.UDFName='Field1' then ud.Description else NULL end as Field1,
case when u.UDFName='Field1' then ud.UDFCode else NULL end as Field1Id,
case when u.UDFName='Field2' then ud.Description else NULL end as Field2,
case when u.UDFName='Field2' then ud.UDFCode else NULL end as Field2Id
from Project p
left join UDF u on p.ProjectId=u.EntityId
left join UDFDetail ud on u.UDFCode=ud.UDFCode
) tmp
group by ProjectName
The query can actually be rewritten without the inner query, but that should not make a big difference :), and looking at Gordon Linoff's suggestion and your answer, it might actually take just about 20 seconds as well, but it is still worth giving a try.

Simple Table Join with access to joined table columns

I have a simple table of items. Call them "Parts".
Each part can have zero or more related entries in a seperate table, call them "SubParts".
A simple view of the tables is, as you would probably expect:
Parts
-----
PartID int (PK)
PartName varchar
SubParts
--------
SubPartID int (PK)
PartID int (FK_Parts)
SubPartName varchar
SubPartAdded datetime
I would like to return all parts from the primary table, but also have access to the LATEST (order by SubPartAdded DESC) related SubPart if it exists.
My confusion is that there are a 1M+ entries in the subparts table (for many different parts) and I only need the latest one for the current part, if it exists.
Earlier I wrote a statement that performed a left join between the Parts table and a derived table of related Subparts (which works) but the derived table seems to return ALL rows in the subparts table causing a performance hit. I essentially need to do a TOP 1 and order by DESC in the derived select statement, to prefilter the subparts by PartID (and some other columns). However as I cant seem to make reference to the Parts table (outer) columns in the derived select statement, I cant add a WHERE clause to the derived table.
I have also tried the following snippet which does execute, but doesnt return any related records:
SELECT p.PartName, sp.SubPartName, sp.SubPartAdded
FROM Parts p
LEFT JOIN (SELECT TOP 1 SubPartID, SubpartAdded, PartID FROM SubParts ORDER BY SubPartAdded) AS sp
ON sp.PartID = p.PartID
I imagine the "TOP 1" statement is executing against the whole SubParts table, before being filtered by the "ON" statement (?)
Ultimately I need to use some columns from the Subparts table in multiple locations thoughout the main stored proc, so I dont simply want a correlated subquery as this would need to be called multiple times.
(This proc will return multiple parts on each execution. ie. The proc will not be filtered by a single PartID)
I hope this is pretty clear?
It sounds like it should have a very simple solution, but I'm currently stumped!
(Compatibility with SQL Server 2K and above is required)
Regards
Nick
The following should work back to SQL Server 2000.
SELECT PartName, SubPartName, SubPartAdded
FROM Parts
LEFT JOIN
( SELECT SubParts.PartID, SubParts.SubPartName SubParts.SubPartAdded
FROM SubParts
INNER JOIN
( SELECT PartID, MAX(SubPartAdded) [SubPartAdded]
FROM SubParts
GROUP BY PartID
) MaxSubPart
ON MaxSubPart.PartID = SubParts.PartID
AND MaxSubPart.SubPartAdded = SubParts.SubPartAdded
) Subpart
ON SubPart.PartID = Parts.PartID
There are more efficient and elegent ways to do this in later versions (OUTER APPLY, or Window functions), but I am not certain how many of the methods are backwards compatitble to SQL Server 2000.
You need a derived table to extract last subparts. Then you can filter subparts joining it to derived table by id and SubPartAdded columns:
SELECT p.PartName, sp.SubPartName, sp.SubPartAdded
FROM Parts p
LEFT JOIN SubParts sp
ON p.PartID = sp.PartID
LEFT JOIN
(
SELECT PartID, max (SubPartAdded) MaxSubPartAdded
FROM SubParts
GROUP BY PartID
) AS MaxSP
ON sp.PartID = MaxSP.PartID
AND sp.SubPartAdded = MaxSP.MaxSubPartAdded
If Sql Server 2005 or newer:
SELECT p.PartName, sp.SubPartName, sp.SubPartAdded
FROM Parts p
OUTER APPLY
(
select top 1
SubPartName, SubPartAdded
from SubParts
where SubParts.PartID = p.PartID
order by SubPartAdded desc
) sp

Whether Inner Queries Are Okay?

I often see something like...
SELECT events.id, events.begin_on, events.name
FROM events
WHERE events.user_id IN ( SELECT contacts.user_id
FROM contacts
WHERE contacts.contact_id = '1')
OR events.user_id IN ( SELECT contacts.contact_id
FROM contacts
WHERE contacts.user_id = '1')
Is it okay to have query in query? Is it "inner query"? "Sub-query"? Does it counts as three queries (my example)? If its bad to do so... how can I rewrite my example?
Your example isn't too bad. The biggest problems usually come from cases where there is what's called a "correlated subquery". That's when the subquery is dependent on a column from the outer query. These are particularly bad because the subquery effectively needs to be rerun for every row in the potential results.
You can rewrite your subqueries using joins and GROUP BY, but as you have it performance can vary, especially depending on your RDBMS.
It varies from database to database, especially if the columns compared are
indexed or not
nullable or not
..., but generally if your query is not using columns from the table joined to -- you should be using either IN or EXISTS:
SELECT e.id, e.begin_on, e.name
FROM EVENTS e
WHERE EXISTS (SELECT NULL
FROM CONTACTS c
WHERE ( c.contact_id = '1' AND c.user_id = e.user_id )
OR ( c.user_id = '1' AND c.contact_id = e.user_id )
Using a JOIN (INNER or OUTER) can inflate records if the child table has more than one record related to a parent table record. That's fine if you need that information, but if not then you need to use either GROUP BY or DISTINCT to get a result set of unique values -- and that can cost you when you review the query costs.
EXISTS
Though EXISTS clauses look like correlated subqueries, they do not execute as such (RBAR: Row By Agonizing Row). EXISTS returns a boolean based on the criteria provided, and exits on the first instance that is true -- this can make it faster than IN when dealing with duplicates in a child table.
You could JOIN to the Contacts table instead:
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
-- exercise: without the GROUP BY, how many duplicate rows can you end up with?
This leaves the following question up to the database: "Should we look through all the contacts table and find all the '1's in the various columns, or do something else?" where your original SQL didn't give it much choice.
The most common term for this sort of query is "subquery." There is nothing inherently wrong in using them, and can make your life easier. However, performance can often be improved by rewriting queries w/ subqueries to use JOINs instead, because the server can find optimizations.
In your example, three queries are executed: the main SELECT query, and the two SELECT subqueries.
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
In your case, I believe the JOIN version will be better as you can avoid two SELECT queries on contacts, opting for the JOIN instead.
See the mysql docs on the topic.

COUNT(*) vs. COUNT(1) vs. COUNT(pk): which is better? [duplicate]

This question already has answers here:
Count(*) vs Count(1) - SQL Server
(13 answers)
Closed 8 years ago.
I often find these three variants:
SELECT COUNT(*) FROM Foo;
SELECT COUNT(1) FROM Foo;
SELECT COUNT(PrimaryKey) FROM Foo;
As far as I can see, they all do the same thing, and I find myself using the three in my codebase. However, I don't like to do the same thing different ways. To which one should I stick? Is any one of them better than the two others?
Bottom Line
Use either COUNT(field) or COUNT(*), and stick with it consistently, and if your database allows COUNT(tableHere) or COUNT(tableHere.*), use that.
In short, don't use COUNT(1) for anything. It's a one-trick pony, which rarely does what you want, and in those rare cases is equivalent to count(*)
Use count(*) for counting
Use * for all your queries that need to count everything, even for joins, use *
SELECT boss.boss_id, COUNT(subordinate.*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
But don't use COUNT(*) for LEFT joins, as that will return 1 even if the subordinate table doesn't match anything from parent table
SELECT boss.boss_id, COUNT(*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
Don't be fooled by those advising that when using * in COUNT, it fetches entire row from your table, saying that * is slow. The * on SELECT COUNT(*) and SELECT * has no bearing to each other, they are entirely different thing, they just share a common token, i.e. *.
An alternate syntax
In fact, if it is not permitted to name a field as same as its table name, RDBMS language designer could give COUNT(tableNameHere) the same semantics as COUNT(*). Example:
For counting rows we could have this:
SELECT COUNT(emp) FROM emp
And they could make it simpler:
SELECT COUNT() FROM emp
And for LEFT JOINs, we could have this:
SELECT boss.boss_id, COUNT(subordinate)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
But they cannot do that (COUNT(tableNameHere)) since SQL standard permits naming a field with the same name as its table name:
CREATE TABLE fruit -- ORM-friendly name
(
fruit_id int NOT NULL,
fruit varchar(50), /* same name as table name,
and let's say, someone forgot to put NOT NULL */
shape varchar(50) NOT NULL,
color varchar(50) NOT NULL
)
Counting with null
And also, it is not a good practice to make a field nullable if its name matches the table name. Say you have values 'Banana', 'Apple', NULL, 'Pears' on fruit field. This will not count all rows, it will only yield 3, not 4
SELECT count(fruit) FROM fruit
Though some RDBMS do that sort of principle (for counting the table's rows, it accepts table name as COUNT's parameter), this will work in Postgresql (if there is no subordinate field in any of the two tables below, i.e. as long as there is no name conflict between field name and table name):
SELECT boss.boss_id, COUNT(subordinate)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
But that could cause confusion later if we will add a subordinate field in the table, as it will count the field(which could be nullable), not the table rows.
So to be on the safe side, use:
SELECT boss.boss_id, COUNT(subordinate.*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
count(1): The one-trick pony
In particular to COUNT(1), it is a one-trick pony, it works well only on one table query:
SELECT COUNT(1) FROM tbl
But when you use joins, that trick won't work on multi-table queries without its semantics being confused, and in particular you cannot write:
-- count the subordinates that belongs to boss
SELECT boss.boss_id, COUNT(subordinate.1)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
So what's the meaning of COUNT(1) here?
SELECT boss.boss_id, COUNT(1)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
Is it this...?
-- counting all the subordinates only
SELECT boss.boss_id, COUNT(subordinate.boss_id)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
Or this...?
-- or is that COUNT(1) will also count 1 for boss regardless if boss has a subordinate
SELECT boss.boss_id, COUNT(*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
By careful thought, you can infer that COUNT(1) is the same as COUNT(*), regardless of type of join. But for LEFT JOINs result, we cannot mold COUNT(1) to work as: COUNT(subordinate.boss_id), COUNT(subordinate.*)
So just use either of the following:
-- count the subordinates that belongs to boss
SELECT boss.boss_id, COUNT(subordinate.boss_id)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
Works on Postgresql, it's clear that you want to count the cardinality of the set
-- count the subordinates that belongs to boss
SELECT boss.boss_id, COUNT(subordinate.*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id
Another way to count the cardinality of the set, very English-like (just don't make a column with a name same as its table name) : http://www.sqlfiddle.com/#!1/98515/7
select boss.boss_name, count(subordinate)
from boss
left join subordinate on subordinate.boss_code = boss.boss_code
group by boss.boss_name
You cannot do this: http://www.sqlfiddle.com/#!1/98515/8
select boss.boss_name, count(subordinate.1)
from boss
left join subordinate on subordinate.boss_code = boss.boss_code
group by boss.boss_name
You can do this, but this produces wrong result: http://www.sqlfiddle.com/#!1/98515/9
select boss.boss_name, count(1)
from boss
left join subordinate on subordinate.boss_code = boss.boss_code
group by boss.boss_name
Two of them always produce the same answer:
COUNT(*) counts the number of rows
COUNT(1) also counts the number of rows
Assuming the pk is a primary key and that no nulls are allowed in the values, then
COUNT(pk) also counts the number of rows
However, if pk is not constrained to be not null, then it produces a different answer:
COUNT(possibly_null) counts the number of rows with non-null values in the column possibly_null.
COUNT(DISTINCT pk) also counts the number of rows (because a primary key does not allow duplicates).
COUNT(DISTINCT possibly_null_or_dup) counts the number of distinct non-null values in the column possibly_null_or_dup.
COUNT(DISTINCT possibly_duplicated) counts the number of distinct (necessarily non-null) values in the column possibly_duplicated when that has the NOT NULL clause on it.
Normally, I write COUNT(*); it is the original recommended notation for SQL. Similarly, with the EXISTS clause, I normally write WHERE EXISTS(SELECT * FROM ...) because that was the original recommend notation. There should be no benefit to the alternatives; the optimizer should see through the more obscure notations.
Asked and answered before...
Books on line says "COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )"
"1" is a non-null expression so it's the same as COUNT(*).
The optimiser recognises it as trivial so gives the same plan. A PK is unique and non-null (in SQL Server at least) so COUNT(PK) = COUNT(*)
This is a similar myth to EXISTS (SELECT * ... or EXISTS (SELECT 1 ...
And see the ANSI 92 spec, section 6.5, General Rules, case 1
a) If COUNT(*) is specified, then the result is the cardinality
of T.
b) Otherwise, let TX be the single-column table that is the
result of applying the <value expression> to each row of T
and eliminating null values. If one or more null values are
eliminated, then a completion condition is raised: warning-
null value eliminated in set function.
At least on Oracle they are all the same: http://www.oracledba.co.uk/tips/count_speed.htm
I feel the performance characteristics change from one DBMS to another. It's all on how they choose to implement it. Since I have worked extensively on Oracle, I'll tell from that perspective.
COUNT(*) - Fetches entire row into result set before passing on to the count function, count function will aggregate 1 if the row is not null
COUNT(1) - Will not fetch any row, instead count is called with a constant value of 1 for each row in the table when the WHERE matches.
COUNT(PK) - The PK in Oracle is indexed. This means Oracle has to read only the index. Normally one row in the index B+ tree is many times smaller than the actual row. So considering the disk IOPS rate, Oracle can fetch many times more rows from Index with a single block transfer as compared to entire row. This leads to higher throughput of the query.
From this you can see the first count is the slowest and the last count is the fastest in Oracle.