Group By Vs Distinct in SQL - sql

SELECT continent, COUNT(name)
FROM world
WHERE population>200000000
GROUP BY continent
When i execute the query above the query runs fine. It basically shows the number of countries in each continent that has a population larger than 200000000.
However when I modify my query to the below :
SELECT DISTINCT(continent), COUNT(name)
FROM world
WHERE population>200000000
This does not work. I am wondering what the reason is. In this case I am saying for each distinct continent count the total countries with population larger than 200000000.
I just want to understand the reasoning so i can become better at writing queries.

Why does this not work?
SELECT DISTINCT(continent), COUNT(name)
FROM world
WHERE population > 200000000;
That is simple. You have an aggregation query, because you have COUNT() in the SELECT. You have no GROUP BY, so any other columns references in the SELECT must be the arguments of aggregations columns. So, continent generates an error.
You seem to also be under the impression that the parentheses around continent have some significance. They do not. Not at all. SQL has a construct, SELECT DISTINCT, which selects distinct values of rows.
Also note that DISTINCT is almost never used with aggregation functions.

I think you want:
SELECT continent
, COUNT(DISTINCT name) AS DistinctCountries
FROM world
WHERE population>200000000
GROUP BY continent
If want each row to represent a continent, you need to group by continent. Then count the distinct countries in the continent where your condition is met.

The first query and its order of evaluation:
FROM world: Get rows from the world table.
WHERE population>200000000: Only accept rows (countries?) with a population greater than 200000000.
GROUP BY continent: Aggregate the rows so as to get one result row per continent.
SELECT COUNT(name): For the continent show the count of its rows found in 3 where name is not null.
SELECT continent: show the continent.
The second query and its order of evaluation:
FROM world: Get rows from the world table.
WHERE population>200000000: Only accept rows (countries?) with a population greater than 200000000.
GROUP BY continent: Aggregate the rows so as to get one result row per continent.
SELECT COUNT(name): As there is no group by clause, this is saying you want one result row only, with the count of all rows found in 3 where name is not null.
SELECT (continent): The parentheses are superfluous. You are saying you want to show the continent. However, as you said with COUNT(name), you wanted to show one result row only, which continent are you talking about? It makes no sense to the DBMS and is invalid SQL. (There is one DBMS making an exception here, though: MySQL would just pick a continent arbitrarily rather than raising an error, a certain setting provided.)
SELECT DISTINCT: Of all result rows, you want duplicates removed, i.e. all rows showing the same continent and count.
Your error, as you can see, is in steps 4 and 5, where SELECT COUNT(name) without GROUP BY and SELECT (continent) don't match semantically.

GROUP BY AND DISTINCT are very much seperate in one way or the other.
Group by is used specifically to create and perform aggregation per groups while distinct is just used to have distinct/unique records or removing duplicates nothing else.
SELECT continent, COUNT(name)
FROM world
WHERE population>200000000
GROUP BY continent
The first query has a group by on continent it will group all rows which are having same continent into seperate groups after filtering via where.
This query will give you records of count per each continent
SELECT DISTINCT continent,
COUNT(name)
FROM world
WHERE population>200000000
The 2nd query means performing distinct and count on whole table but not groups (note) after filtering population. This query will give you distinct/unique continent but count is independent of groups and is of whole table

Related

Confused with the Group By function in SQL

Q1: After using the Group By function, why does it only output one row of each group at most? Does this mean that having is supposed to filter the group rather than filter the records in each group?
Q2: I want to find the records in each group whose ages are greater than the average age of that group. I tried the following, but it returns nothing. How should I fix this?
SELECT *, avg(age) FROM Mytable Group By country Having age > avg(age)
Thanks!!!!
You can calculate the average age for each country in a subquery and join that to your table for filtering:
SELECT mt.*, MtAvg.AvgAge
FROM Mytable mt
inner join
(
select mtavgs.country
, avg(mtavgs.age) as AvgAge
from Mytable mtavgs
group by mtavgs.country
) MTAvg
on mtavg.country=mt.country
and mt.Age > mtavg.AvgAge
GROUP BY returns always 1 row per unique combination of values in the GROUP BY columns listed (provided that they are not removed by a HAVING clause). The subquery in our example (alias: MTAvg) will calculate a single row per country. We will use its results for filtering the main table rows by applying the condition in the INNER JOIN clause; we will also report that average by including the calculated average age.
GROUP BY is a keyword that is called an aggregate function. Check this out here for further reading SQL Group By tutorial
What it does is it lumps all the results together into one row. In your example it would lump all the results with the same country together.
Not quite sure what exactly your query needs to be to solve your exact problem. I would however look into what are called window functions in SQL. I believe what you first need to do is write a window function to find the average age in each group. Then you can write a query to return the results you need
Depending on your dbms type and version, you may be able to use a "window function" that will calculate the average per country and with this approach it makes the calculation available on every row. Once that data is present as a "derived table" you can simply use a where clause to filter for the ages that are greater then the calculated average per country.
SELECT mt.*
FROM (
SELECT *
, avg(age) OVER(PARTITION BY country) AS AvgAge
FROM Mytable
) mt
WHERE mt.Age > mt.AvgAge

When is aliasing required when using SQL set theory clauses?

I just started learning SQL and am trying to learn from my mistakes. In one of my practice exercises, I had to find city names from the cities database are not listed as capital cities in countries database. Initially I tried the code below but it yielded an error.
SELECT name
FROM cities
EXCEPT
SELECT capital
FROM countries
ORDER BY capital ASC;
The correct code is:
SELECT city.name
FROM cities AS city
EXCEPT
SELECT country.capital
FROM countries AS country
ORDER BY name;
Can someone explain to me why aliasing made all the difference here?
An ORDER BY for a UNION, EXCEPT or INTERSECT sorts the complete result. The column names of the overall query are defined by the first query. So this query:
SELECT name
FROM cities
EXCEPT
SELECT capital
FROM countries
returns a result with a single column named name.
Adding an order by is conceptually the same as:
select *
from (
SELECT name
FROM cities
EXCEPT
SELECT capital
FROM countries
) x
order by ....;
As the inner query only returns a single column name, that's the only column you can use in the order by.
The aliases that you used in your second query don't change the column name of the overall result which determines the column names available for the order by clause.

How can I query to find max population of country from countries table?

I have a table "countries" with columns -> name,continent,area,popualtion.
Let's say I want to find the name and population of the chosen country with the highest population.
SELECT MAX(population) FROM countries;
The example above returns the maximum population.
I want it to also see the name of the country with that population.
SELECT name,MAX(population) FROM countries;
I am getting the error like below.
ERROR: column "countries.name" must appear in the GROUP BY clause or be used in an aggregate function
I can't think of another way to do it.
Here is an example of my query.
SELECT name,population
FROM countries
WHERE population >= (
SELECT MAX(population)
FROM countries)
;
This query works, but I am also curious why am I getting the error or if anyone knows if there is any better ways to accomplish this?
SELECT name, population
FROM countries
ORDER BY population DESC
LIMIT 1
MAX selects the maximum element from a list of values. In your first query,
SELECT MAX(population) FROM countries;
the list is formed by extracting the population field from all rows in countries, and then the maximum is selected. This collapses the list of rows down to a single row containing just the maximum.
In your second query,
SELECT name,MAX(population) FROM countries;
you (conceptually) get a list of all name fields from countries, but there's only one MAX(population). The database system doesn't know what to do with this: SELECT name FROM countries would return as many rows as there are in countries, but SELECT MAX(population) FROM countries would only return one row. This doesn't match up; it's unclear how many rows you want returned from this. This is why you get an error.
The error message says you need to either
use name in an aggregate function, which would collapse the list of rows down to a single value, which could be returned along the single MAX value, or
use a GROUP BY name clause, which would group the list of countries into entries with equal names first, then compute MAX(population) separately for each group. This makes no sense if all your countries have different names.
As far as I know there's no SQL syntax for "select the maximum population and then get the name field from the same row" (it's not quite clear what this would do anyway because there can be more than one country with a population equal to the maximum).
What you can do instead is sort the whole table, then select only fields from the first row:
SELECT name, population
FROM countries
ORDER BY population DESC
LIMIT 1
(I'm pretty sure Postgres optimizes this so there's no actual sort involved.)
Now if there is more than one country with a maximum population, you'll get a random result (we haven't told the database how to sort rows with equal population).
You can make use of Top keyword for selecting only single record
from countries table.
SELECT Top 1 name,population
FROM countries
order by population DESC

GROUP BY and HAVING in SQL

If I write
SELECT continent FROM world GROUP BY continent HAVING sum(population) >= 100000000
it will return all continents that have a total sum over 100 million. But if I leave out the GROUP BY like so
SELECT continent FROM world HAVING sum(population) >= 100000000
it will only return one continent (in this case Asia).
Why is that?
When you don't have GROUP BY, aggregate functions like SUM() operate over the entire table, treating it all as one big group. That's why you just get one row of results.
When you use an aggregate function, it's not technically valid to return any columns in the SELECT list other than those in the GROUP BY clause, so your query isn't valid SQL. Some databases, such as MySQL, allow returning other columns as an extension; in that case, it selects the values from arbitrary rows in the group. And if there's no GROUP BY clause at all, the entire table is one group, so you get the continent column from some random row in the table.

Beginning SQL How do I sum and pick columns in the same line?

I just started learning SQL on w3schools.com today, and want to make sure I'm on the right track.
I'm trying to solve this problem:
Write a SQL statement finding the combined population of the U.S. and Mexico (in this database).
I can't post the table here because of lack of reputation, but it is very simple. You are given that Mexico's country ID is 2 and U.S. id is 5.
the table has 4 columns, CITY_ID, NAME, COUNTRY_ID, and POPULATION. You do not know how many rows there are. So basically I need the to add up the POPULATION columns that have a corresponding '2' or '5' country ID.
This is what I have so far:
//this statement gives result-set with all the cities in Mexico and the U.S.
SELECT * FROM City
WHERE country_id=’5’
OR country_id=’2’
//here, I don't know how to reference the result-set
SELECT SUM(population) FROM result-set
The question also says to do it in one statement, is there a simpler way to do this? Thanks.
You put the SUM(population) expression into the query..
SELECT SUM(population) AS TotalPopulation
FROM City
WHERE country_id='5'
OR country_id='2'
Note that you can also write the x=a or x=b as x in (a,b), i.e.
WHERE country_id in (5,2)
(you can also drop the quotes if country_id is integer although it won't fail with them)
SELECT SUM(population)
FROM City
WHERE country_id=’5’
OR country_id=’2’
You're nearly there.
SELECT Sum(population) FROM City
WHERE country_id=2
OR country_id=5
I assume the country id is integer so speechmarks are not needed