How does GROUP BY use COUNT(*) - sql

I have this query which finds the number of properties handled by each staff member along with their branch number:
SELECT s.branchNo, s.staffNo, COUNT(*) AS myCount
FROM Staff s, PropertyForRent p
WHERE s.staffNo=p.staffNo
GROUP BY s.branchNo, s.staffNo
The two relations are:
Staff{staffNo, fName, lName, position, sex, DOB, salary, branchNO}
PropertyToRent{propertyNo, street, city, postcode, type, rooms, rent, ownerNo, staffNo, branchNo}
How does SQL know what COUNT(*) is referring to? Why does it count the number of properties and not (say for example), the number of staff per branch?

This is a bit long for a comment.
COUNT(*) is counting the number of rows in each group. It is not specifically counting any particular column. Instead, what is happening is that the join is producing multiple properties, because the properties are what cause multiple rows for given values of s.branchNo and s.staffNo.
It gets even a little more "confusing" if you include a column name. The following would all typically return the same value:
COUNT(*)
COUNT(s.branchNo)
COUNT(s.staffNo)
COUNT(p.propertyNo)
With a column name, COUNT() determines the number of rows that do not have a NULL value in the column.
And finally, you should learn to use proper, explicit join syntax in your queries. Put join conditions in the on clause, not the where clause:
SELECT s.branchNo, s.staffNo, COUNT(*) AS myCount
FROM Staff s JOIN
PropertyForRent p
ON s.staffNo = p.staffNO
GROUP BY s.branchNo, s.staffNo;

GROUP BY clauses partition your result set. These partitions are all the sql engine needs to know - it simply counts their sizes.
Try your query with only count(*) in the select part.
In particular, COUNT(*) does not produce the number of distinct rows/columns in your result set!

Some people might think that count(*) really count all the columns, however the sql optimizer is smarter than that.
COUNT(*) returns the number of rows in a specified table without getting rid of duplicates. Which mean that you can't use Distinct with count(*)
Count(*) will return the cardinality (elements in table) of the specified mapping.
What you have to remember is that when using count over a specific column, null won't be allowed while count(*) will allow null in the rows as it could be any field.
How does SQL know what COUNT(*) is referring to?
I'm pretty sure, however not 100% sure as I can't find in doc, that the sql optimizer simply do a count on the primary key (not null) instead of trying to handle null in rows.

Related

Confused with the Group By function in SQL

Q1: After using the Group By function, why does it only output one row of each group at most? Does this mean that having is supposed to filter the group rather than filter the records in each group?
Q2: I want to find the records in each group whose ages are greater than the average age of that group. I tried the following, but it returns nothing. How should I fix this?
SELECT *, avg(age) FROM Mytable Group By country Having age > avg(age)
Thanks!!!!
You can calculate the average age for each country in a subquery and join that to your table for filtering:
SELECT mt.*, MtAvg.AvgAge
FROM Mytable mt
inner join
(
select mtavgs.country
, avg(mtavgs.age) as AvgAge
from Mytable mtavgs
group by mtavgs.country
) MTAvg
on mtavg.country=mt.country
and mt.Age > mtavg.AvgAge
GROUP BY returns always 1 row per unique combination of values in the GROUP BY columns listed (provided that they are not removed by a HAVING clause). The subquery in our example (alias: MTAvg) will calculate a single row per country. We will use its results for filtering the main table rows by applying the condition in the INNER JOIN clause; we will also report that average by including the calculated average age.
GROUP BY is a keyword that is called an aggregate function. Check this out here for further reading SQL Group By tutorial
What it does is it lumps all the results together into one row. In your example it would lump all the results with the same country together.
Not quite sure what exactly your query needs to be to solve your exact problem. I would however look into what are called window functions in SQL. I believe what you first need to do is write a window function to find the average age in each group. Then you can write a query to return the results you need
Depending on your dbms type and version, you may be able to use a "window function" that will calculate the average per country and with this approach it makes the calculation available on every row. Once that data is present as a "derived table" you can simply use a where clause to filter for the ages that are greater then the calculated average per country.
SELECT mt.*
FROM (
SELECT *
, avg(age) OVER(PARTITION BY country) AS AvgAge
FROM Mytable
) mt
WHERE mt.Age > mt.AvgAge

Why use many columns in GROUP BY and HAVING clause in these examples

Given the schema here I'm trying to understand and solve the below 3 sql queries as I'm confused:
1- Present a table giving the names of the countries with ≥ 50% urbanization
rates, their urbanization rates, and their per capita GDP. Note that
urbanization rate is the percentage of population living in cities. Do not
count cities with NULL values for population.
SELECT country.name, round(sum(city.population)/country.population, 3) AS urban, round(gdp/country.population, 3) AS gdppc
FROM city
INNER JOIN country ON code = country
INNER JOIN economy ON code = economy.country
WHERE city.population IS NOT NULL
GROUP BY country.name, country.population, economy.gdp
HAVING round(sum(city.population)/country.population, 3) >= 0.5
ORDER BY urban DESC;
In the above query, Why I need to include country.population and economy.gdp in the GROUP BY? If I tried using just country.name in the GROUP BY I get an error saying I should include the others.
2- Show organizations that have as members all the European countries with over 50 million people?
SELECT name
FROM organization
INNER JOIN (SELECT organization
FROM country
INNER JOIN encompasses
ON code = encompasses.country
INNER JOIN ismember
ON code = ismember.country
WHERE population > 50000000 AND continent = 'Europe'
GROUP BY organization
HAVING count(ismember.country) = (SELECT count(*)
FROM country
INNER JOIN encompasses
ON code = country
WHERE population > 50000000 AND continent = 'Europe'))
AS innerQuery
ON abbreviation = innerQuery.organization;
Why I need the HAVING Part above?
3- Insert a new organization called “Tivoli” and a trigger that says if Germany joins “Tivoli” then so too must the UK and France. Insert Germany into the “Tivoli” organization. Confirm proper behavior.
I tried the below script but it's not working, any advice please?
do $$
begin
IF(NOT EXISTS ( SELECT 1 FROM organization WHERE organization."name" = 'Tivoli' AND organization.country = 'D' ))
BEGIN
INSERT INTO organization VALUES ('Tivoli','Tivoli organization',NULL,'F',NULL,NULL);
INSERT INTO organization VALUES ('Tivoli','Tivoli organization',NULL,'GB',NULL,NULL);
END;
end $$
1)
You used country.population and economy.gdp in the select, outside of aggregate functions ( COUNT(), AVG() and SUM() ), and you have a GROUP BY. Everything that you select has to be in GROUP BY or inside of aggregate functions.
2)
Because you were asked to show organizations that have ALL of 50mil + people countries. With HAVING, you check if that organization has the right amount of countries.
3)
organization."name" = 'Tivoli'
It's supposed to be :
organization.name
First of all, you should limit a question to one only, not 3. But here are some pointers for all 3:
In the above query, Why I need to include country.population and economy.gdp in the GROUP BY? If I tried using just country.name in the GROUP BY I get an error saying I should include the others.
This is a requirement. A group by country.name alone would work (in Postgres 9.1+) only if the other two fields are known to be functionally dependent on country.name. But probably country.name is not the primary key of the country table, so in theory it is possible to have two records in that table with the same name, but different population.
The rule is as follows:
When GROUP BY is present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or if the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.
This is implemented since version 9.1.
Why I need the HAVING Part above?
Because a condition on an aggregate (count in this case) can only be performed after grouping, and can thus not be expressed in the where clause. In this case the having clause makes sure that the organisation is not only present in some big EU Member States, but all big EU Member states.
I tried the below script but it's not working, any advice please?
Without a proper database schema, it is not possible to provide you with the correct SQ, but from the ERD diagram it seems that the organization table does not have a country field. Instead the ismember table connects organizations with countries. You would only insert one organization, but several ismember records (one per Member State involved)
It is better also to name the fields in your insert statement, so it is clear which value corresponds to which field.

How to insert a count column into a sql query

I need the second column of the table retrieved from a query to have a count of the number of rows, so row one would have a 1, row 2 would have a 2 and so on. I am not very proficient with sql so I am sorry if this is a simple task.
A basic example of what I am doing would be is:
SELECT [Name], [I_NEED_ROW_COUNT_HERE],[Age],[Gender]
FROM [customer]
The row count must be the second column and will act as an ID for each row. It must be the second row as the text file it is generating will be sent to the state and they require a specific format.
Thanks for any help.
With your edit, I see that you want a row ID (normally called row number rather than "count") which is best gathered from a unique ID in the database (person_id or some other unique field). If that isn't possible, you can make one for this report with ROW_NUMBER() OVER (ORDER BY EMPLOYEE_ID DESC) AS ID, in your select statement.
select Name, ROW_NUMBER() OVER (ORDER BY Name DESC) AS ID,
Age, Gender
from customer
This function adds a field to the output called ID (see my tips at the bottom to describe aliases). Since this isn't in the database, it needs a method to determine how it will increment. After the over keyword it orders by Name in descending order.
Information on Counting follows (won't be unique by row):
If each customer has multiple entries but the selected fields are the same for that user and you are counting that user's records (summed in one result record for the user) then you would write:
select Name, count(*), Age, Gender
from customer
group by name, age, gender
This will count (see MSDN) all the user's records as grouped by the name, age and gender (if they match, it's a single record).
However, if you are counting all records so that your whole report has the grand total on every line, then you want:
select Name, (select count(*) from customer) as "count", Age, Gender
from customer
TIP: If you're using something like SSMS to write a query, dragging in columns will put brackets around the columns. This is only necessary if you have spaces in column names, but a DBA will tend to avoid that like the plague. Also, if you need a column header to be something specific, you can use the as keyword like in my first example.
W3Schools has a good tutorial on count()
The COUNT(column_name) function returns
the number of values (NULL values will not be counted) of the
specified column:
SELECT COUNT(column_name) FROM table_name;
The COUNT(*) function returns the number of records in a table:
SELECT COUNT(*) FROM table_name;
The COUNT(DISTINCT column_name) function returns the number of
distinct values of the specified column:
SELECT COUNT(DISTINCT column_name) FROM table_name;
COUNT(DISTINCT) works with ORACLE and Microsoft SQL Server, but
not with Microsoft Access.
It's odd to repeat the same number in every row but it sounds like this is what you're asking for. And note that this might not work in your flavor of SQL. MS Access?
SELECT [Name], (select count(*) from [customer]), [Age], [Gender]
FROM [customer]

I'm not sure what is the purpose of "group by" here

I'm struggling to understand what this query is doing:
SELECT branch_name, count(distinct customer_name)
FROM depositor, account
WHERE depositor.account_number = account.account_number
GROUP BY branch_name
What's the need of GROUP BY?
You must use GROUP BY in order to use an aggregate function like COUNT in this manner (using an aggregate function to aggregate data corresponding to one or more values within the table).
The query essentially selects distinct branch_names using that column as the grouping column, then within the group it counts the distinct customer_names.
You couldn't use COUNT to get the number of distinct customer_names per branch_name without the GROUP BY clause (at least not with a simple query specification - you can use other means, joins, subqueries etc...).
It's giving you the total distinct customers for each branch; GROUP BY is used for grouping COUNT function.
It could be written also as:
SELECT branch_name, count(distinct customer_name)
FROM depositor INNER JOIN account
ON depositor.account_number = account.account_number
GROUP BY branch_name
Let's take a step away from SQL for a moment at look at the relational trainging language Tutorial D.
Because the two relations (tables) are joined on the common attribute (column) name account_number, we can use a natural join:
depositor JOIN account
(Because the result is a relation, which by definition has only distinct tuples (rows), we don't need a DISTINCT keyword.)
Now we just need to aggregate using SUMMARIZE..BY:
SUMMARIZE (depositor JOIN account)
BY { branch_name }
ADD ( COUNT ( customer_name ) AS customer_tally )
Back in SQLland, the GROUP BY branch_name is doing the same as SUMMARIZE..BY { branch_name }. Because SQL has a very rigid structure, the branch_name column must be repeated in the SELECT clause.
If you want to COUNT something (see SELECT-Part of the statement), you have to use GROUP BY in order to tell the query what to aggregate. The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns.
Neglecting it will lead to SQL errors in most RDBMS, or senseless results in others.
Useful link:
http://www.w3schools.com/sql/sql_groupby.asp

COUNT(*) in SQL

I understand how count(*) in SQL when addressing one table but how does it work on inner joins?
e.g.
SELECT branch, staffNo, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY s.staffNo, p.staffNo
staff contains staffNo staffName
properties contains property management details (i.e. which staff manages which property)
This returns the number of properties managed by staff, but how does the count work? As in how does it know what to count?
It's an aggregate function - as such it's managed by your group by clause - each row will correspond to a unique grouping (i.e. staffNo) and Count(*) will return the number of records in the join that match that grouping.
So for example:
SELECT branch, grade, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY branch, grade
would return the number of staff members of a given grade at each branch.
SELECT branch, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY branch
would return the total number of staff members at each branch
SELECT grade, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY grade
would return the total number of staff at each grade
The aggregate function (whether it's count(), sum(), avg(), etc.) is computed on the rows in each group: that group is then collapsed/summarized/aggregated to a single row according to the select-list defined in the query.
The conceptual model for the execution of a select query is this:
Compute the cartesian product of all tables references in the FROM clause (as if a full join were being performed.
Apply the join criteria.
Filter according to the criteria defined in the where clause.
Partitition into groups, based on the criteria defined in the group by clause.
Reduce each group to a single row, computing the values of each aggregate function on the rows in that group.
Filter according to the criteria defined in the having clause
Sort according to the criteria defined in the order by clause
This conceptual model omits dealing with any compute or compute...by clauses.
Not this this is not actually how anything but a very naive SQL engine would actually execute a query, but the results should be identical to what you'd [eventually] get if you did it this way.
Your query is invalid.
You have an ambiguous column name staffno.
You are selecting branch but not grouping by it - prepare for a Syntax error (everything but MySQL) or random branches to be selected for you (MySQL).
I think what you want to know, though, is that it will return a count for each "set" of your grouped-by fields, so for each combination of s.staffno, p.staffno how many rows belong in that set.
count (*) simply counts the number of rows in the query or the group by.
In your query, it will print the number of rows by staffNo. (It is redundant to have s.staffNo, p.staffNo; either will suffice).
It counts the number of rows for each distinct StaffNo in the cartesian product.
Also, you should group by Branch, StaffNo.