I'm not sure what is the purpose of "group by" here - sql

I'm struggling to understand what this query is doing:
SELECT branch_name, count(distinct customer_name)
FROM depositor, account
WHERE depositor.account_number = account.account_number
GROUP BY branch_name
What's the need of GROUP BY?

You must use GROUP BY in order to use an aggregate function like COUNT in this manner (using an aggregate function to aggregate data corresponding to one or more values within the table).
The query essentially selects distinct branch_names using that column as the grouping column, then within the group it counts the distinct customer_names.
You couldn't use COUNT to get the number of distinct customer_names per branch_name without the GROUP BY clause (at least not with a simple query specification - you can use other means, joins, subqueries etc...).

It's giving you the total distinct customers for each branch; GROUP BY is used for grouping COUNT function.
It could be written also as:
SELECT branch_name, count(distinct customer_name)
FROM depositor INNER JOIN account
ON depositor.account_number = account.account_number
GROUP BY branch_name

Let's take a step away from SQL for a moment at look at the relational trainging language Tutorial D.
Because the two relations (tables) are joined on the common attribute (column) name account_number, we can use a natural join:
depositor JOIN account
(Because the result is a relation, which by definition has only distinct tuples (rows), we don't need a DISTINCT keyword.)
Now we just need to aggregate using SUMMARIZE..BY:
SUMMARIZE (depositor JOIN account)
BY { branch_name }
ADD ( COUNT ( customer_name ) AS customer_tally )
Back in SQLland, the GROUP BY branch_name is doing the same as SUMMARIZE..BY { branch_name }. Because SQL has a very rigid structure, the branch_name column must be repeated in the SELECT clause.

If you want to COUNT something (see SELECT-Part of the statement), you have to use GROUP BY in order to tell the query what to aggregate. The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns.
Neglecting it will lead to SQL errors in most RDBMS, or senseless results in others.
Useful link:
http://www.w3schools.com/sql/sql_groupby.asp

Related

Confused with the Group By function in SQL

Q1: After using the Group By function, why does it only output one row of each group at most? Does this mean that having is supposed to filter the group rather than filter the records in each group?
Q2: I want to find the records in each group whose ages are greater than the average age of that group. I tried the following, but it returns nothing. How should I fix this?
SELECT *, avg(age) FROM Mytable Group By country Having age > avg(age)
Thanks!!!!
You can calculate the average age for each country in a subquery and join that to your table for filtering:
SELECT mt.*, MtAvg.AvgAge
FROM Mytable mt
inner join
(
select mtavgs.country
, avg(mtavgs.age) as AvgAge
from Mytable mtavgs
group by mtavgs.country
) MTAvg
on mtavg.country=mt.country
and mt.Age > mtavg.AvgAge
GROUP BY returns always 1 row per unique combination of values in the GROUP BY columns listed (provided that they are not removed by a HAVING clause). The subquery in our example (alias: MTAvg) will calculate a single row per country. We will use its results for filtering the main table rows by applying the condition in the INNER JOIN clause; we will also report that average by including the calculated average age.
GROUP BY is a keyword that is called an aggregate function. Check this out here for further reading SQL Group By tutorial
What it does is it lumps all the results together into one row. In your example it would lump all the results with the same country together.
Not quite sure what exactly your query needs to be to solve your exact problem. I would however look into what are called window functions in SQL. I believe what you first need to do is write a window function to find the average age in each group. Then you can write a query to return the results you need
Depending on your dbms type and version, you may be able to use a "window function" that will calculate the average per country and with this approach it makes the calculation available on every row. Once that data is present as a "derived table" you can simply use a where clause to filter for the ages that are greater then the calculated average per country.
SELECT mt.*
FROM (
SELECT *
, avg(age) OVER(PARTITION BY country) AS AvgAge
FROM Mytable
) mt
WHERE mt.Age > mt.AvgAge

Beginning SQL group by and AVG

I am trying to pull information from two columns titled clientstate and clientrevenue in my table. I want clientstate to show up as the state, and have only distinct names in it, and under client revenue I want the average revenue per state, and that will only show up if there are at least two clients from that state. I am very new at this, so what I have is pretty iffy:
SELECT clientstate, clientrevenue
FROM client
GROUP BY clientrevenue
HAVING COUNT (*) >=2;
Where am I going wrong here?
SELECT clientstate AS [State]
, AVG(clientrevenue) AS [Average Revenue]
FROM client
GROUP BY clientstate
Grouping by ClientRevenue will try to group similar values and that doesn't have a logical sense.
First, in order to get distinct states, clientstate column needs to be used in the GROUP BY statement.
Thus, the code would be :
SELECT clientstate, AVG(clientrevenue)
FROM Source_Table
GROUP BY clientstate --this would get you distinct states
Now, considering the 2 clients per state, it's rather a condition than a HAVING statement. HAVING statement limits your query results according to the aggregate function you are using. For instance, in the code aforementioned, the aggregate function is AVG(clientrevenue). So, we can only use it in HAVING. we can not add count(*) unless it was used in SELECT.
So, you need to add it as a condition like
SELECT clientstate, AVG(clientrevenue)
FROM Source_Table A
WHERE (SELECT count(DISTINCT client_ID) FROM Source_Table B
WHERE A.clientstate = B.clientstate) >= 2 --Condition
GROUP BY clientstate --this would get you distinct states

How does GROUP BY use COUNT(*)

I have this query which finds the number of properties handled by each staff member along with their branch number:
SELECT s.branchNo, s.staffNo, COUNT(*) AS myCount
FROM Staff s, PropertyForRent p
WHERE s.staffNo=p.staffNo
GROUP BY s.branchNo, s.staffNo
The two relations are:
Staff{staffNo, fName, lName, position, sex, DOB, salary, branchNO}
PropertyToRent{propertyNo, street, city, postcode, type, rooms, rent, ownerNo, staffNo, branchNo}
How does SQL know what COUNT(*) is referring to? Why does it count the number of properties and not (say for example), the number of staff per branch?
This is a bit long for a comment.
COUNT(*) is counting the number of rows in each group. It is not specifically counting any particular column. Instead, what is happening is that the join is producing multiple properties, because the properties are what cause multiple rows for given values of s.branchNo and s.staffNo.
It gets even a little more "confusing" if you include a column name. The following would all typically return the same value:
COUNT(*)
COUNT(s.branchNo)
COUNT(s.staffNo)
COUNT(p.propertyNo)
With a column name, COUNT() determines the number of rows that do not have a NULL value in the column.
And finally, you should learn to use proper, explicit join syntax in your queries. Put join conditions in the on clause, not the where clause:
SELECT s.branchNo, s.staffNo, COUNT(*) AS myCount
FROM Staff s JOIN
PropertyForRent p
ON s.staffNo = p.staffNO
GROUP BY s.branchNo, s.staffNo;
GROUP BY clauses partition your result set. These partitions are all the sql engine needs to know - it simply counts their sizes.
Try your query with only count(*) in the select part.
In particular, COUNT(*) does not produce the number of distinct rows/columns in your result set!
Some people might think that count(*) really count all the columns, however the sql optimizer is smarter than that.
COUNT(*) returns the number of rows in a specified table without getting rid of duplicates. Which mean that you can't use Distinct with count(*)
Count(*) will return the cardinality (elements in table) of the specified mapping.
What you have to remember is that when using count over a specific column, null won't be allowed while count(*) will allow null in the rows as it could be any field.
How does SQL know what COUNT(*) is referring to?
I'm pretty sure, however not 100% sure as I can't find in doc, that the sql optimizer simply do a count on the primary key (not null) instead of trying to handle null in rows.

Can peewee nest SELECT queries such that the outer query selects on an aggregate of the inner query?

I'm using peewee2.1 with python3.3 and an sqlite3.7 database.
I want to perform certain SELECT queries in which:
I first select some aggregate (count, sum), grouping by some id column; then
I then select from the results of (1), aggregating over its aggregate. Specifically, I want to count the number of rows in (1) that have each aggregated value.
My database has an 'Event' table with 1 record per event, and a 'Ticket' table with 1..N tickets per event. Each ticket record contains the event's id as a foreign key. Each ticket also contains a 'seats' column that specifies the number of seats purchased. (A "ticket" is really best thought of as a purchase transaction for 1 or more seats at the event.)
Below are two examples of working SQLite queries of this sort that give me the desired results:
SELECT ev_tix, count(1) AS ev_tix_n FROM
(SELECT count(1) AS ev_tix FROM ticket GROUP BY event_id)
GROUP BY ev_tix
SELECT seat_tot, count(1) AS seat_tot_n FROM
(SELECT sum(seats) AS seat_tot FROM ticket GROUP BY event_id)
GROUP BY seat_tot
But using Peewee, I don't know how to select on the inner query's aggregate (count or sum) when specifying the outer query. I can of course specify an alias for that aggregate, but it seems I can't use that alias in the outer query.
I know that Peewee has a mechanism for executing "raw" SQL queries, and I've used that workaround successfully. But I'd like to understand if / how these queries can be done using Peewee directly.
I posted the same question on the peewee-orm Google group. Charles Leifer responded promptly with both an answer and new commits to the peewee master. So although I'm answering my own question, obviously all credit goes to him.
You can see that thread here: https://groups.google.com/forum/#!topic/peewee-orm/FSHhd9lZvUE
But here's the essential part, which I've copied from Charles' response to my post:
I've added a couple commits to master which should make your queries
possible
(https://github.com/coleifer/peewee/commit/22ce07c43cbf3c7cf871326fc22177cc1e5f8345).
Here is the syntax,roughly, for your first example:
SELECT ev_tix, count(1) AS ev_tix_n FROM
(SELECT count(1) AS ev_tix FROM ticket GROUP BY event_id)
GROUP BY ev_tix
ev_tix = SQL('ev_tix') # the name of the alias.
(Ticket
.select(ev_tix, fn.count(ev_tix).alias('ev_tix_n'))
.from_(
Ticket.select(fn.count(Ticket.id).alias('ev_tix')).group_by(Ticket.event))
.group_by(ev_tix))
This yields the following SQL:
SELECT ev_tix, count(ev_tix) AS ev_tix_n FROM (SELECT Count(t2."id")
AS ev_tix FROM "ticket" AS t2 GROUP BY t2."event_id")
GROUP BY ev_tix

COUNT(*) in SQL

I understand how count(*) in SQL when addressing one table but how does it work on inner joins?
e.g.
SELECT branch, staffNo, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY s.staffNo, p.staffNo
staff contains staffNo staffName
properties contains property management details (i.e. which staff manages which property)
This returns the number of properties managed by staff, but how does the count work? As in how does it know what to count?
It's an aggregate function - as such it's managed by your group by clause - each row will correspond to a unique grouping (i.e. staffNo) and Count(*) will return the number of records in the join that match that grouping.
So for example:
SELECT branch, grade, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY branch, grade
would return the number of staff members of a given grade at each branch.
SELECT branch, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY branch
would return the total number of staff members at each branch
SELECT grade, Count(*)
FROM Staff s, Properties p
WHERE s.staffNo = p.staffNo
GROUP BY grade
would return the total number of staff at each grade
The aggregate function (whether it's count(), sum(), avg(), etc.) is computed on the rows in each group: that group is then collapsed/summarized/aggregated to a single row according to the select-list defined in the query.
The conceptual model for the execution of a select query is this:
Compute the cartesian product of all tables references in the FROM clause (as if a full join were being performed.
Apply the join criteria.
Filter according to the criteria defined in the where clause.
Partitition into groups, based on the criteria defined in the group by clause.
Reduce each group to a single row, computing the values of each aggregate function on the rows in that group.
Filter according to the criteria defined in the having clause
Sort according to the criteria defined in the order by clause
This conceptual model omits dealing with any compute or compute...by clauses.
Not this this is not actually how anything but a very naive SQL engine would actually execute a query, but the results should be identical to what you'd [eventually] get if you did it this way.
Your query is invalid.
You have an ambiguous column name staffno.
You are selecting branch but not grouping by it - prepare for a Syntax error (everything but MySQL) or random branches to be selected for you (MySQL).
I think what you want to know, though, is that it will return a count for each "set" of your grouped-by fields, so for each combination of s.staffno, p.staffno how many rows belong in that set.
count (*) simply counts the number of rows in the query or the group by.
In your query, it will print the number of rows by staffNo. (It is redundant to have s.staffNo, p.staffNo; either will suffice).
It counts the number of rows for each distinct StaffNo in the cartesian product.
Also, you should group by Branch, StaffNo.