When to use multiple GROUP BY in SQL? - sql

I'm practicing SQL on SQLZOO, and I'm working on Joins. Question 11 of that section asks: "For every match involving 'POL', show the matchid, date and the number of goals scored."
So I tried the following code:
SELECT matchid, mdate, COUNT(player)
FROM goal JOIN game ON matchid = id
WHERE (team1 = 'POL' OR team2 = 'POL')
GROUP BY matchid
But it throws an error:
'gisq.game.mdate' isn't in GROUP BY
So the answer is:
SELECT matchid, mdate, COUNT(player)
FROM goal JOIN game ON matchid = id
WHERE (team1 = 'POL' OR team2 = 'POL')
GROUP BY matchid, mdate
My question is, why is it required to also include mdate in the GROUP BY clause if it's not part of the aggregate function? Thank you and sorry for the newbie question. Here is the table's format: https://sqlzoo.net/wiki/The_JOIN_operation

The simple reason why it is required is because SQL requires that the GROUP BY columns and the SELECT columns need to be compatible. Those are the rules of the language.
Your query slightly simplified is:
SELECT matchid, mdate, COUNT(player)
FROM goal JOIN
game
ON matchid = id
WHERE 'POL' IN (team1, team2)
GROUP BY matchid;
The query is saying: Return one row per matchid -- because of the GROUP BY. But then which mdate gets returned? There could be multiple matches.
SQL requires that you be explicit about what you want. You might intend the most recent date, in which case you would use MAX(mdate). Or you might want a separate row for each date, in which case you would include it in the GROUP BY. Or you might intend something else. The query needs to be clear.

When using aggregations and aggregating functions (COUNT, MAX, MIN, AVG, etc.) in the SELECT part of a query together with direct (not aggregated) columns, it's mandatory to repeat all not aggregated columns from the SELECT part in the GROUP BY part of the query. As the result, and this is what is required, all columns are aggregated, some of them by aggregating functions in the SELECT part of your query, the rest of them are aggregated in the GROUP BY clause.

Group By single column: Group By single column means, to place all the rows with same value of only that particular column in one group.
Group By multiple columns: Group by multiple column for example, GROUP BY column1, column2. This means to place all the rows with same values of both the columns column1 and column2 in one group
Since the question asks you to select date as well, you will have to put that in group by clause, lets suppose what if POL had multiple games on the same date. Keeping date in groupby clause can help you with that scenario.

Related

Confused with the Group By function in SQL

Q1: After using the Group By function, why does it only output one row of each group at most? Does this mean that having is supposed to filter the group rather than filter the records in each group?
Q2: I want to find the records in each group whose ages are greater than the average age of that group. I tried the following, but it returns nothing. How should I fix this?
SELECT *, avg(age) FROM Mytable Group By country Having age > avg(age)
Thanks!!!!
You can calculate the average age for each country in a subquery and join that to your table for filtering:
SELECT mt.*, MtAvg.AvgAge
FROM Mytable mt
inner join
(
select mtavgs.country
, avg(mtavgs.age) as AvgAge
from Mytable mtavgs
group by mtavgs.country
) MTAvg
on mtavg.country=mt.country
and mt.Age > mtavg.AvgAge
GROUP BY returns always 1 row per unique combination of values in the GROUP BY columns listed (provided that they are not removed by a HAVING clause). The subquery in our example (alias: MTAvg) will calculate a single row per country. We will use its results for filtering the main table rows by applying the condition in the INNER JOIN clause; we will also report that average by including the calculated average age.
GROUP BY is a keyword that is called an aggregate function. Check this out here for further reading SQL Group By tutorial
What it does is it lumps all the results together into one row. In your example it would lump all the results with the same country together.
Not quite sure what exactly your query needs to be to solve your exact problem. I would however look into what are called window functions in SQL. I believe what you first need to do is write a window function to find the average age in each group. Then you can write a query to return the results you need
Depending on your dbms type and version, you may be able to use a "window function" that will calculate the average per country and with this approach it makes the calculation available on every row. Once that data is present as a "derived table" you can simply use a where clause to filter for the ages that are greater then the calculated average per country.
SELECT mt.*
FROM (
SELECT *
, avg(age) OVER(PARTITION BY country) AS AvgAge
FROM Mytable
) mt
WHERE mt.Age > mt.AvgAge

Why does MAX statement require a Group By?

I understand why the first query needs a GROUP BY, as it doesn't know which date to apply the sum to, but I don't understand why this is the case with the second query. The value that ultimately is the max amount is already contained in the table - it is not calculated like SUM is. thank you
-- First Query
select
sum(OrderSales),OrderDates
From Orders
-- Second Query
select
max(FilmOscarWins),FilmName
From tblFilm
It is not the SUM and MAX that require the GROUP BY, it is the unaggregated column.
If you just write this, you will get a single row, for the maximum value of the FilmOscarWins column across the whole table:
select
max(FilmOscarWins)
From
tblFilm
If the most Oscars any film won was 12, that one row will say 12. But there could be multiple films, all of which won 12 Oscars, so if we ask for the FilmName alongside that 12, there is no single answer.
By adding the Group By, we fundamentally change the query: instead of returning one number for the whole table, it will return one row for each group - which in this case, means one row for each film.
If you do want to get a list of all those films which had the maximum 12 Oscars, you have to do something more complicated, such as using a sub-query to first find that single number (12) and then find all the rows matching it:
select
FilmOscarWins,
FilmName
From
tblFilm
Where FilmOscarWins = (
select
max(FilmOscarWins)
From
tblFilm
)
If you want the film with the most Oscar wins, then use select top:
select top (1) f.*
From tblFilm f
order by FilmOscarWins desc;
In an aggregation query, the select columns need to be consistent with the group by columns -- the unaggregated columns in the select must match the group by.

SQL - Distinct Not Providing Unique Results for Designated Column

I'm currently learning SQL by working through these exercises: https://sqlzoo.net/wiki/The_JOIN_operation
I'm on Example 8 which asks: "Show the name of all players who scored a goal against Germany."
Here is what I currently have:
SELECT DISTINCT(goal.player), goal.gtime, game.team1, game.team2
FROM game JOIN goal ON (goal.matchid = game.id)
WHERE (game.team1='GER' OR game.team2='GER') AND (goal.teamid<>'GER')
I would expect that results would be returned with only unique names. However, that is not the case as we can see "Mario Balotelli" is listed twice. Why doesn't the DISTINCT command work in this instance?
Thank you!
DISTINCT operates on the record level, so you should use distinct for the whole row or if you need extra fields to show up in your result, you need to perform a GROUP BY on the player and bring along other fields by joining to the grouped result.
but i reckon the intended answer is only the player name, so query would be something like this:
SELECT DISTINCT player
FROM game JOIN goal ON matchid = id
WHERE (game.team1='GER' OR game.team2='GER') AND (goal.teamid<>'GER')
It looks like "Mario Balotelli" can have multiple goal.gtime. Or can have different contributing values from Team1 and Team2. So try removing the additional columns you have in your select clause.
DISTINCT gets the distinct rows based on all selected columns. As the goal times differ selecting that column will make the rows different (distinct) from one another.
The question only asks you to select the player's name
SELECT DISTINCT player
FROM game
JOIN goal ON matchid = id
WHERE (team1='GER' OR team2='GER')
AND (teamid <>'GER')
This link looks like it would useful further reading https://www.designcise.com/web/tutorial/what-is-the-order-of-execution-of-an-sql-query
Edit: If you want more than one column but only a distinct list of players you are in the realms of aggregation, you would min/max/sum/avg the other data for the group.
SELECT player, team1, team2, MIN(gtime) AS min_gtime, MAX(gtime) AS max_gtime, COUNT(1) AS goals_scored
FROM game
JOIN goal ON matchid = id
WHERE (team1='GER' OR team2='GER')
AND (teamid <>'GER')
GROUP BY player, team1, team2

Take the value of the first row met in non-aggregate expressions

I have a query like this:
SELECT PlayerID, COUNT(PlayerID) AS "MatchesPlayed", Name, Role, Team,
SUM(Goals) As "TotalGoals", SUM(Autogoals) As "TotalAutogoals",
SUM(...)-2*SUM(...)+2*SUM(...) AS Score, ...
FROM raw_ordered
GROUP BY PlayerID
ORDER BY Score DESC
where in raw_ordered each row describes the performance of some player in some match, in reverse chronological order.
Since I'm grouping by PlayerID what I get from this query is a table where each row provides the cumulative data about some player. Now, there's no problem with columns with aggregate functions; my problem is with the Team column.
A player may change team during a season; what I'm interested in here is the last Team he played with, so I'd like to have a way to tell SELECT to take the first value met in each group for the Team column (or, in general, for non-aggregate-function columns).
Unfortunately, I don't seem to find any (easy) way to do this in SQLite: the documentation of SELECT says:
If the expression is an aggregate expression, it is evaluated across all rows in the group. Otherwise, it is evaluated against a single arbitrarily chosen row from within the group.
with no suggestion about how to alter this behavior, and I can't find between the aggregate functions anything that just takes the first value it encounters.
Any idea?
SQLite does not have a 'first' aggregate function; you would have to implement it yourself.
However, the documentation is out of date. Since SQLite 3.7.11, if there is a MIN() or MAX(), the record from which that minimum/maximum value comes is guaranteed to be chosen.
Therefore, just add MAX(MatchDate) to the SELECT column list.
SELECT PlayerID, COUNT(PlayerID) AS "MatchesPlayed", Name, Role,
(SELECT Team FROM raw_ordered GROUP BY PlayerID ORDER BY some_date) AS team,
SUM(Goals) As "TotalGoals", SUM(Autogoals) As "TotalAutogoals",
SUM(...)-2*SUM(...)+2*SUM(...) AS Score, ...
FROM raw_ordered
GROUP BY PlayerID
ORDER BY Score DESC
Presumably you have some way in your table to order the output such that you can use a subquery to achieve your goal.

I'm not sure what is the purpose of "group by" here

I'm struggling to understand what this query is doing:
SELECT branch_name, count(distinct customer_name)
FROM depositor, account
WHERE depositor.account_number = account.account_number
GROUP BY branch_name
What's the need of GROUP BY?
You must use GROUP BY in order to use an aggregate function like COUNT in this manner (using an aggregate function to aggregate data corresponding to one or more values within the table).
The query essentially selects distinct branch_names using that column as the grouping column, then within the group it counts the distinct customer_names.
You couldn't use COUNT to get the number of distinct customer_names per branch_name without the GROUP BY clause (at least not with a simple query specification - you can use other means, joins, subqueries etc...).
It's giving you the total distinct customers for each branch; GROUP BY is used for grouping COUNT function.
It could be written also as:
SELECT branch_name, count(distinct customer_name)
FROM depositor INNER JOIN account
ON depositor.account_number = account.account_number
GROUP BY branch_name
Let's take a step away from SQL for a moment at look at the relational trainging language Tutorial D.
Because the two relations (tables) are joined on the common attribute (column) name account_number, we can use a natural join:
depositor JOIN account
(Because the result is a relation, which by definition has only distinct tuples (rows), we don't need a DISTINCT keyword.)
Now we just need to aggregate using SUMMARIZE..BY:
SUMMARIZE (depositor JOIN account)
BY { branch_name }
ADD ( COUNT ( customer_name ) AS customer_tally )
Back in SQLland, the GROUP BY branch_name is doing the same as SUMMARIZE..BY { branch_name }. Because SQL has a very rigid structure, the branch_name column must be repeated in the SELECT clause.
If you want to COUNT something (see SELECT-Part of the statement), you have to use GROUP BY in order to tell the query what to aggregate. The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns.
Neglecting it will lead to SQL errors in most RDBMS, or senseless results in others.
Useful link:
http://www.w3schools.com/sql/sql_groupby.asp