GROUP BY and HAVING in SQL - sql

If I write
SELECT continent FROM world GROUP BY continent HAVING sum(population) >= 100000000
it will return all continents that have a total sum over 100 million. But if I leave out the GROUP BY like so
SELECT continent FROM world HAVING sum(population) >= 100000000
it will only return one continent (in this case Asia).
Why is that?

When you don't have GROUP BY, aggregate functions like SUM() operate over the entire table, treating it all as one big group. That's why you just get one row of results.
When you use an aggregate function, it's not technically valid to return any columns in the SELECT list other than those in the GROUP BY clause, so your query isn't valid SQL. Some databases, such as MySQL, allow returning other columns as an extension; in that case, it selects the values from arbitrary rows in the group. And if there's no GROUP BY clause at all, the entire table is one group, so you get the continent column from some random row in the table.

Related

Confused with the Group By function in SQL

Q1: After using the Group By function, why does it only output one row of each group at most? Does this mean that having is supposed to filter the group rather than filter the records in each group?
Q2: I want to find the records in each group whose ages are greater than the average age of that group. I tried the following, but it returns nothing. How should I fix this?
SELECT *, avg(age) FROM Mytable Group By country Having age > avg(age)
Thanks!!!!
You can calculate the average age for each country in a subquery and join that to your table for filtering:
SELECT mt.*, MtAvg.AvgAge
FROM Mytable mt
inner join
(
select mtavgs.country
, avg(mtavgs.age) as AvgAge
from Mytable mtavgs
group by mtavgs.country
) MTAvg
on mtavg.country=mt.country
and mt.Age > mtavg.AvgAge
GROUP BY returns always 1 row per unique combination of values in the GROUP BY columns listed (provided that they are not removed by a HAVING clause). The subquery in our example (alias: MTAvg) will calculate a single row per country. We will use its results for filtering the main table rows by applying the condition in the INNER JOIN clause; we will also report that average by including the calculated average age.
GROUP BY is a keyword that is called an aggregate function. Check this out here for further reading SQL Group By tutorial
What it does is it lumps all the results together into one row. In your example it would lump all the results with the same country together.
Not quite sure what exactly your query needs to be to solve your exact problem. I would however look into what are called window functions in SQL. I believe what you first need to do is write a window function to find the average age in each group. Then you can write a query to return the results you need
Depending on your dbms type and version, you may be able to use a "window function" that will calculate the average per country and with this approach it makes the calculation available on every row. Once that data is present as a "derived table" you can simply use a where clause to filter for the ages that are greater then the calculated average per country.
SELECT mt.*
FROM (
SELECT *
, avg(age) OVER(PARTITION BY country) AS AvgAge
FROM Mytable
) mt
WHERE mt.Age > mt.AvgAge

Why does MAX statement require a Group By?

I understand why the first query needs a GROUP BY, as it doesn't know which date to apply the sum to, but I don't understand why this is the case with the second query. The value that ultimately is the max amount is already contained in the table - it is not calculated like SUM is. thank you
-- First Query
select
sum(OrderSales),OrderDates
From Orders
-- Second Query
select
max(FilmOscarWins),FilmName
From tblFilm
It is not the SUM and MAX that require the GROUP BY, it is the unaggregated column.
If you just write this, you will get a single row, for the maximum value of the FilmOscarWins column across the whole table:
select
max(FilmOscarWins)
From
tblFilm
If the most Oscars any film won was 12, that one row will say 12. But there could be multiple films, all of which won 12 Oscars, so if we ask for the FilmName alongside that 12, there is no single answer.
By adding the Group By, we fundamentally change the query: instead of returning one number for the whole table, it will return one row for each group - which in this case, means one row for each film.
If you do want to get a list of all those films which had the maximum 12 Oscars, you have to do something more complicated, such as using a sub-query to first find that single number (12) and then find all the rows matching it:
select
FilmOscarWins,
FilmName
From
tblFilm
Where FilmOscarWins = (
select
max(FilmOscarWins)
From
tblFilm
)
If you want the film with the most Oscar wins, then use select top:
select top (1) f.*
From tblFilm f
order by FilmOscarWins desc;
In an aggregation query, the select columns need to be consistent with the group by columns -- the unaggregated columns in the select must match the group by.

Group By Vs Distinct in SQL

SELECT continent, COUNT(name)
FROM world
WHERE population>200000000
GROUP BY continent
When i execute the query above the query runs fine. It basically shows the number of countries in each continent that has a population larger than 200000000.
However when I modify my query to the below :
SELECT DISTINCT(continent), COUNT(name)
FROM world
WHERE population>200000000
This does not work. I am wondering what the reason is. In this case I am saying for each distinct continent count the total countries with population larger than 200000000.
I just want to understand the reasoning so i can become better at writing queries.
Why does this not work?
SELECT DISTINCT(continent), COUNT(name)
FROM world
WHERE population > 200000000;
That is simple. You have an aggregation query, because you have COUNT() in the SELECT. You have no GROUP BY, so any other columns references in the SELECT must be the arguments of aggregations columns. So, continent generates an error.
You seem to also be under the impression that the parentheses around continent have some significance. They do not. Not at all. SQL has a construct, SELECT DISTINCT, which selects distinct values of rows.
Also note that DISTINCT is almost never used with aggregation functions.
I think you want:
SELECT continent
, COUNT(DISTINCT name) AS DistinctCountries
FROM world
WHERE population>200000000
GROUP BY continent
If want each row to represent a continent, you need to group by continent. Then count the distinct countries in the continent where your condition is met.
The first query and its order of evaluation:
FROM world: Get rows from the world table.
WHERE population>200000000: Only accept rows (countries?) with a population greater than 200000000.
GROUP BY continent: Aggregate the rows so as to get one result row per continent.
SELECT COUNT(name): For the continent show the count of its rows found in 3 where name is not null.
SELECT continent: show the continent.
The second query and its order of evaluation:
FROM world: Get rows from the world table.
WHERE population>200000000: Only accept rows (countries?) with a population greater than 200000000.
GROUP BY continent: Aggregate the rows so as to get one result row per continent.
SELECT COUNT(name): As there is no group by clause, this is saying you want one result row only, with the count of all rows found in 3 where name is not null.
SELECT (continent): The parentheses are superfluous. You are saying you want to show the continent. However, as you said with COUNT(name), you wanted to show one result row only, which continent are you talking about? It makes no sense to the DBMS and is invalid SQL. (There is one DBMS making an exception here, though: MySQL would just pick a continent arbitrarily rather than raising an error, a certain setting provided.)
SELECT DISTINCT: Of all result rows, you want duplicates removed, i.e. all rows showing the same continent and count.
Your error, as you can see, is in steps 4 and 5, where SELECT COUNT(name) without GROUP BY and SELECT (continent) don't match semantically.
GROUP BY AND DISTINCT are very much seperate in one way or the other.
Group by is used specifically to create and perform aggregation per groups while distinct is just used to have distinct/unique records or removing duplicates nothing else.
SELECT continent, COUNT(name)
FROM world
WHERE population>200000000
GROUP BY continent
The first query has a group by on continent it will group all rows which are having same continent into seperate groups after filtering via where.
This query will give you records of count per each continent
SELECT DISTINCT continent,
COUNT(name)
FROM world
WHERE population>200000000
The 2nd query means performing distinct and count on whole table but not groups (note) after filtering population. This query will give you distinct/unique continent but count is independent of groups and is of whole table

How can I query to find max population of country from countries table?

I have a table "countries" with columns -> name,continent,area,popualtion.
Let's say I want to find the name and population of the chosen country with the highest population.
SELECT MAX(population) FROM countries;
The example above returns the maximum population.
I want it to also see the name of the country with that population.
SELECT name,MAX(population) FROM countries;
I am getting the error like below.
ERROR: column "countries.name" must appear in the GROUP BY clause or be used in an aggregate function
I can't think of another way to do it.
Here is an example of my query.
SELECT name,population
FROM countries
WHERE population >= (
SELECT MAX(population)
FROM countries)
;
This query works, but I am also curious why am I getting the error or if anyone knows if there is any better ways to accomplish this?
SELECT name, population
FROM countries
ORDER BY population DESC
LIMIT 1
MAX selects the maximum element from a list of values. In your first query,
SELECT MAX(population) FROM countries;
the list is formed by extracting the population field from all rows in countries, and then the maximum is selected. This collapses the list of rows down to a single row containing just the maximum.
In your second query,
SELECT name,MAX(population) FROM countries;
you (conceptually) get a list of all name fields from countries, but there's only one MAX(population). The database system doesn't know what to do with this: SELECT name FROM countries would return as many rows as there are in countries, but SELECT MAX(population) FROM countries would only return one row. This doesn't match up; it's unclear how many rows you want returned from this. This is why you get an error.
The error message says you need to either
use name in an aggregate function, which would collapse the list of rows down to a single value, which could be returned along the single MAX value, or
use a GROUP BY name clause, which would group the list of countries into entries with equal names first, then compute MAX(population) separately for each group. This makes no sense if all your countries have different names.
As far as I know there's no SQL syntax for "select the maximum population and then get the name field from the same row" (it's not quite clear what this would do anyway because there can be more than one country with a population equal to the maximum).
What you can do instead is sort the whole table, then select only fields from the first row:
SELECT name, population
FROM countries
ORDER BY population DESC
LIMIT 1
(I'm pretty sure Postgres optimizes this so there's no actual sort involved.)
Now if there is more than one country with a maximum population, you'll get a random result (we haven't told the database how to sort rows with equal population).
You can make use of Top keyword for selecting only single record
from countries table.
SELECT Top 1 name,population
FROM countries
order by population DESC

Usage of aggregate function Group by

I have observed that Count function can be used without the usage of aggregate function Group by. Like for example:
Select Count(*) from Employee
It would surely return the count of all the rows without the usage of aggregate function. Then where do we really need the usage of group by?
Omitting the GROUP BY implies that the entire table is one group. Sometimes you want there to be multiple groups. Consider the following example:
SELECT month, SUM(sales) AS total_sales
FROM all_sales
GROUP BY month;
This query gives you a month-by-month breakdown of sales. If you omitted month and the GROUP BY clause, you would only receive the total sales of all time which may not have the granularity you require.
You can also group by multiple columns, giving finer detail still:
SELECT state, city, COUNT(*) AS population
FROM all_people
GROUP BY state, city;
Additionally, using a GROUP BY allows us to use HAVING clauses. Which lets us filter groups. Using the above example, we can filter the result to cities with over 1,000,000 people:
SELECT state, city, COUNT(*) AS population
FROM all_people
GROUP BY state, city
HAVING COUNT(*) > 1000000;
The group by clause is used to break up aggregate results to groups of unique values. E.g., let's say you don't want to know how many employees you have, but how many by each first name (e.g., two Gregs, one Adam and three Scotts):
SELECT first_name, COUNT(*)
FROM employee
GROUP BY first_name