COUNT(*) in SQL - sql

I understand how count(*) in SQL when addressing one table but how does it work on inner joins?
e.g.
SELECT branch, staffNo, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY s.staffNo, p.staffNo
staff contains staffNo staffName
properties contains property management details (i.e. which staff manages which property)
This returns the number of properties managed by staff, but how does the count work? As in how does it know what to count?

It's an aggregate function - as such it's managed by your group by clause - each row will correspond to a unique grouping (i.e. staffNo) and Count(*) will return the number of records in the join that match that grouping.
So for example:
SELECT branch, grade, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY branch, grade
would return the number of staff members of a given grade at each branch.
SELECT branch, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY branch
would return the total number of staff members at each branch
SELECT grade, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY grade
would return the total number of staff at each grade

The aggregate function (whether it's count(), sum(), avg(), etc.) is computed on the rows in each group: that group is then collapsed/summarized/aggregated to a single row according to the select-list defined in the query.
The conceptual model for the execution of a select query is this:
Compute the cartesian product of all tables references in the FROM clause (as if a full join were being performed.
Apply the join criteria.
Filter according to the criteria defined in the where clause.
Partitition into groups, based on the criteria defined in the group by clause.
Reduce each group to a single row, computing the values of each aggregate function on the rows in that group.
Filter according to the criteria defined in the having clause
Sort according to the criteria defined in the order by clause
This conceptual model omits dealing with any compute or compute...by clauses.
Not this this is not actually how anything but a very naive SQL engine would actually execute a query, but the results should be identical to what you'd [eventually] get if you did it this way.

Your query is invalid.
You have an ambiguous column name staffno.
You are selecting branch but not grouping by it - prepare for a Syntax error (everything but MySQL) or random branches to be selected for you (MySQL).
I think what you want to know, though, is that it will return a count for each "set" of your grouped-by fields, so for each combination of s.staffno, p.staffno how many rows belong in that set.

count (*) simply counts the number of rows in the query or the group by.
In your query, it will print the number of rows by staffNo. (It is redundant to have s.staffNo, p.staffNo; either will suffice).

It counts the number of rows for each distinct StaffNo in the cartesian product.
Also, you should group by Branch, StaffNo.

Related

Confused with the Group By function in SQL

Q1: After using the Group By function, why does it only output one row of each group at most? Does this mean that having is supposed to filter the group rather than filter the records in each group?
Q2: I want to find the records in each group whose ages are greater than the average age of that group. I tried the following, but it returns nothing. How should I fix this?
SELECT *, avg(age) FROM Mytable Group By country Having age > avg(age)
Thanks!!!!
You can calculate the average age for each country in a subquery and join that to your table for filtering:
SELECT mt.*, MtAvg.AvgAge
FROM Mytable mt
inner join
(
select mtavgs.country
, avg(mtavgs.age) as AvgAge
from Mytable mtavgs
group by mtavgs.country
) MTAvg
on mtavg.country=mt.country
and mt.Age > mtavg.AvgAge
GROUP BY returns always 1 row per unique combination of values in the GROUP BY columns listed (provided that they are not removed by a HAVING clause). The subquery in our example (alias: MTAvg) will calculate a single row per country. We will use its results for filtering the main table rows by applying the condition in the INNER JOIN clause; we will also report that average by including the calculated average age.
GROUP BY is a keyword that is called an aggregate function. Check this out here for further reading SQL Group By tutorial
What it does is it lumps all the results together into one row. In your example it would lump all the results with the same country together.
Not quite sure what exactly your query needs to be to solve your exact problem. I would however look into what are called window functions in SQL. I believe what you first need to do is write a window function to find the average age in each group. Then you can write a query to return the results you need
Depending on your dbms type and version, you may be able to use a "window function" that will calculate the average per country and with this approach it makes the calculation available on every row. Once that data is present as a "derived table" you can simply use a where clause to filter for the ages that are greater then the calculated average per country.
SELECT mt.*
FROM (
SELECT *
, avg(age) OVER(PARTITION BY country) AS AvgAge
FROM Mytable
) mt
WHERE mt.Age > mt.AvgAge

Usage of aggregate function Group by

I have observed that Count function can be used without the usage of aggregate function Group by. Like for example:
Select Count(*) from Employee
It would surely return the count of all the rows without the usage of aggregate function. Then where do we really need the usage of group by?
Omitting the GROUP BY implies that the entire table is one group. Sometimes you want there to be multiple groups. Consider the following example:
SELECT month, SUM(sales) AS total_sales
FROM all_sales
GROUP BY month;
This query gives you a month-by-month breakdown of sales. If you omitted month and the GROUP BY clause, you would only receive the total sales of all time which may not have the granularity you require.
You can also group by multiple columns, giving finer detail still:
SELECT state, city, COUNT(*) AS population
FROM all_people
GROUP BY state, city;
Additionally, using a GROUP BY allows us to use HAVING clauses. Which lets us filter groups. Using the above example, we can filter the result to cities with over 1,000,000 people:
SELECT state, city, COUNT(*) AS population
FROM all_people
GROUP BY state, city
HAVING COUNT(*) > 1000000;
The group by clause is used to break up aggregate results to groups of unique values. E.g., let's say you don't want to know how many employees you have, but how many by each first name (e.g., two Gregs, one Adam and three Scotts):
SELECT first_name, COUNT(*)
FROM employee
GROUP BY first_name

How does GROUP BY use COUNT(*)

I have this query which finds the number of properties handled by each staff member along with their branch number:
SELECT s.branchNo, s.staffNo, COUNT(*) AS myCount
FROM Staff s, PropertyForRent p
WHERE s.staffNo=p.staffNo
GROUP BY s.branchNo, s.staffNo
The two relations are:
Staff{staffNo, fName, lName, position, sex, DOB, salary, branchNO}
PropertyToRent{propertyNo, street, city, postcode, type, rooms, rent, ownerNo, staffNo, branchNo}
How does SQL know what COUNT(*) is referring to? Why does it count the number of properties and not (say for example), the number of staff per branch?
This is a bit long for a comment.
COUNT(*) is counting the number of rows in each group. It is not specifically counting any particular column. Instead, what is happening is that the join is producing multiple properties, because the properties are what cause multiple rows for given values of s.branchNo and s.staffNo.
It gets even a little more "confusing" if you include a column name. The following would all typically return the same value:
COUNT(*)
COUNT(s.branchNo)
COUNT(s.staffNo)
COUNT(p.propertyNo)
With a column name, COUNT() determines the number of rows that do not have a NULL value in the column.
And finally, you should learn to use proper, explicit join syntax in your queries. Put join conditions in the on clause, not the where clause:
SELECT s.branchNo, s.staffNo, COUNT(*) AS myCount
FROM Staff s JOIN
PropertyForRent p
ON s.staffNo = p.staffNO
GROUP BY s.branchNo, s.staffNo;
GROUP BY clauses partition your result set. These partitions are all the sql engine needs to know - it simply counts their sizes.
Try your query with only count(*) in the select part.
In particular, COUNT(*) does not produce the number of distinct rows/columns in your result set!
Some people might think that count(*) really count all the columns, however the sql optimizer is smarter than that.
COUNT(*) returns the number of rows in a specified table without getting rid of duplicates. Which mean that you can't use Distinct with count(*)
Count(*) will return the cardinality (elements in table) of the specified mapping.
What you have to remember is that when using count over a specific column, null won't be allowed while count(*) will allow null in the rows as it could be any field.
How does SQL know what COUNT(*) is referring to?
I'm pretty sure, however not 100% sure as I can't find in doc, that the sql optimizer simply do a count on the primary key (not null) instead of trying to handle null in rows.

Is GROUP BY needed in the following correlated subquery?

Given scenario:
table fd
(cust_id, fd_id) primary-key and amount
table loan
(cust_id, l_id) primary-key and amount
I want to list all customers who have a fixed deposit with an amount less than the sum of all their loans.
Query:
SELECT cust_id
FROM fd
WHERE amount
<
(SELECT sum(amount)
FROM loan
WHERE fd.cust_id = loan.cust_id);
OR should we use
SELECT cust_id
FROM fd
WHERE amount
<
(SELECT sum(amount)
FROM loan
WHERE fd.cust_id = loan.cust_id group by cust_id);
A customer can have multiple loans but one FD is considered at a time.
GROUP BY can be omitted in this case, because there is only (one) aggregate function(s) in the SELECT list and all rows are guaranteed to belong to the same group of cust_id ( by the WHERE clause).
The aggregation will be over all rows with matching cust_id in both cases. So both queries are correct.
This would be a cleaner another way to implement the same thing:
SELECT fd.cust_id
FROM fd
JOIN loan USING (cust_id)
GROUP BY fd.cust_id, fd.amount
HAVING fd.amount < sum(loan.amount)
There is one difference: rows with identical (cust_id, amount) in fd only appear once in the result of my query, while they would appear multiple times in the original.
Either way, if there is no matching row with a non-null amount in table loan, you get no rows at all. I assume you are aware of that.
There are no need for GROUP BY since you filtered data by cust_id. In any case inner query will return the same result.
No, it isn't, because you calculate sum(amount) for customer with id = fd.cust_id, so for a single customer.
However, if somehow your subquery calculate sum for more than one customer, the group by would cause the subquery to generate more than one row and this will cause the condition(<) to fail, and thus, the query to fail.
A query with an aggregate like sum but without a group by will output one group. The aggregates will be computed over all matching rows.
A subquery in a condition clause is only allowed to return one row. If the subquery returned multiple rows, what would the following expression mean?
where 1 > (... subquery ...)
So the group by must be omitted; you would even get an error for your second query.
N.B. When you specify all, any, or some a subquery can return multiple rows:
where 1 > ALL (... subquery ...)
But it's easy to see why that doesn't make sense in your case; you'd compare one customer's data to that of another.

I'm not sure what is the purpose of "group by" here

I'm struggling to understand what this query is doing:
SELECT branch_name, count(distinct customer_name)
FROM depositor, account
WHERE depositor.account_number = account.account_number
GROUP BY branch_name
What's the need of GROUP BY?
You must use GROUP BY in order to use an aggregate function like COUNT in this manner (using an aggregate function to aggregate data corresponding to one or more values within the table).
The query essentially selects distinct branch_names using that column as the grouping column, then within the group it counts the distinct customer_names.
You couldn't use COUNT to get the number of distinct customer_names per branch_name without the GROUP BY clause (at least not with a simple query specification - you can use other means, joins, subqueries etc...).
It's giving you the total distinct customers for each branch; GROUP BY is used for grouping COUNT function.
It could be written also as:
SELECT branch_name, count(distinct customer_name)
FROM depositor INNER JOIN account
ON depositor.account_number = account.account_number
GROUP BY branch_name
Let's take a step away from SQL for a moment at look at the relational trainging language Tutorial D.
Because the two relations (tables) are joined on the common attribute (column) name account_number, we can use a natural join:
depositor JOIN account
(Because the result is a relation, which by definition has only distinct tuples (rows), we don't need a DISTINCT keyword.)
Now we just need to aggregate using SUMMARIZE..BY:
SUMMARIZE (depositor JOIN account)
BY { branch_name }
ADD ( COUNT ( customer_name ) AS customer_tally )
Back in SQLland, the GROUP BY branch_name is doing the same as SUMMARIZE..BY { branch_name }. Because SQL has a very rigid structure, the branch_name column must be repeated in the SELECT clause.
If you want to COUNT something (see SELECT-Part of the statement), you have to use GROUP BY in order to tell the query what to aggregate. The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns.
Neglecting it will lead to SQL errors in most RDBMS, or senseless results in others.
Useful link:
http://www.w3schools.com/sql/sql_groupby.asp