Why does this correlated subquery work? (SQLZOO Select within Select 7) - sql

So you don't have to go searching out for it, the data they're presenting for the question set looks like this and the table is called world
name continent area population gdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000
They present an exercise where you use a query to select the largest country by area in each continent. They do most of it for you so getting to the answer isn't hard. This is the correct query:
SELECT continent, name, area FROM world x
WHERE area >= ALL
(SELECT area FROM world y
WHERE y.continent=x.continent
AND area>0)
I can understand what must be happening for it to work, but not why. y.continent = x.continent must by some sort of fancy GROUP BY, but... the lesson doesn't explain it and I'd really like to understand what's happening behind the scenes.

It's not a loop, or grouping. Lets picture the rowset represented as aliased as x in the query:
name continent area population gdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000
Now lets add an extra column that "contains" the subquery1, with the outer x value substituted:
name continent area population gdp subquery
Afghanistan Asia 652230 25500100 20343000000 (select area FROM world y WHERE y.continent='Asia' AND area>0)
Albania Europe 28748 2831741 12960000000 (select area FROM world y WHERE y.continent='Europe' AND area>0)
Algeria Africa 2381741 37100000 188681000000 (select area FROM world y WHERE y.continent='Africa' AND area>0)
Andorra Europe 468 78115 3712000000 (select area FROM world y WHERE y.continent='Europe' AND area>0)
Angola Africa 1246700 20609294 100990000000 (select area FROM world y WHERE y.continent='Africa' AND area>0)
Let's represent those results that are returned by the subquery:
name continent area population gdp subquery
Afghanistan Asia 652230 25500100 20343000000 (652230)
Albania Europe 28748 2831741 12960000000 (28748,468)
Algeria Africa 2381741 37100000 188681000000 (2381741,1246700)
Andorra Europe 468 78115 3712000000 (28748,468)
Angola Africa 1246700 20609294 100990000000 (2381741,1246700)
Now, for each row, we compare our area column against each value returned by the subquery. That's what the ALL forces - the WHERE clause is only satisfied if all of those comparisons are true. And the nature of the comparison (>=) means that its only true across all comparisons for the country within each continent with the largest area.
1Since it's a correlated subquery, it's effectively evaluated once per row, so I think it's reasonable to show what is evaluated on a per-row basis. Note that a naive implementation may in fact evaluate the subquery a row at a time and so it will e.g. gather all of the areas within Europe (and Africa) twice whilst processing the entire outer query.

You simply want the subquery to return area values for a specific continent. In other words, you want to compare the area of a country with area of all countries being on the same continent.
For example, for the second row you compare 28748 with all values in sequence (28748, 468) when you evaluate the condition. That sequence is returned by the subquery, and it considers the fact that you want to compare only with countries in Europe.
EDIT: you ask how the nested query do the group by. The answer is: it does not. Due to the fact that the data have just one country per continent with largest area it may seems that we perform the group by. However if we have a different data:
name continent area population gdp
--------------------------------------------------------
Afghanistan Asia 652230 25500100 20343000000
Pakistan Asia 652230 2500100 2034300000
then we return both rows for one continent value, since they both satisfy the condition that you want a country with largest area in continent.

Related

How to show SUM() value in T-SQL with condition?

I am trying to solve third problem from this site https://sqlzoo.net/wiki/SUM_and_COUNT.
3.)Give the total GDP of Africa:
Given relation to solve this:
name continent area population gdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000
...
I wrote this:
SELECT SUM(gdp)
FROM world
GROUP BY continent = 'Africa'
It gives me basically 2 sums(Africa and world).
SUM(gdp)
69762111000000
1811788000000
How to show only sum of gdp of Africa?
Add the where clause:
SELECT SUM(gdp) FROM world WHERE continent = 'Africa'
This way you will takie only results from Africa to the sum.
SELECT SUM(gdp)
FROM world
WHERE continent = 'Africa'

SQL issue that I should be able to answer but I cannot

Here's the tiny bit of data I am to query:
name continent area population gdp
Afghani Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000
Given the above data, the request was to select two columns with France, Germany, Italy and their populations.
Here was my thought:
Select name, population
where name = 'France','Germany','Italy'
Where was any screw-up, if you would be so kind.
The = operator doesn't take multiple arguments. You're looking for the in operator. Additionally, you're missing a from clause:
SELECT name, population
FROM populations
WHERE name IN ('France', 'Germany', 'Italy')

A subquery with the ALL operator

I'm trying to write a SQL (Postgres engine) query that answers the following question:
Which countries have a GDP greater than every country in Asia? [Give the name only.] (Some countries may have NULL gdp values)
Below is an abbreviated SQL table containing sample data.
+-------------+-----------+---------+------------+--------------+
| name | continent | area | population | gdp |
+-------------+-----------+---------+------------+--------------+
| Afghanistan | Asia | 652230 | 25500100 | 20343000000 |
| Albania | Europe | 28748 | 2831741 | 12960000000 |
| Algeria | Africa | 2381741 | 37100000 | 188681000000 |
| Andorra | Europe | 468 | 78115 | 3712000000 |
| Angola | Africa | 1246700 | 20609294 | 100990000000 |
+-------------+-----------+---------+------------+--------------+
I wrote something like the following, which returns nothing (although I know this isn't the answer due to the online interactive guide I'm using):
SELECT name
FROM world
WHERE gdp > ALL (SELECT gdp from world WHERE continent = 'Asia')
AND continent<>'Asia'
What query should I use? I'm pretty new to SQL.
Strictly speaking, if there is a single Asian country in your table with an unknown (NULL) GDP, then the correct answer must be:
"We do not know."
Accordingly, this query returns no row:
SELECT name
FROM world
WHERE gdp > ALL (SELECT gdp from world WHERE continent = 'Asia')
AND continent <> 'Asia';
A NULL value in the set of Asian GDP's makes it impossible for the first WHERE expression to be TRUE, so the query can never return rows - which is the correct answer. There is no country that we know to fit the requirement. The query you had is correct.
The variant you squeezed into a comment comparing to max(gdp) answers a slightly different question:
Which countries have a GDP greater than the greatest known GDP in Asia?
It sounds like you want the sum of all Asian countries. If so, try something like this:
select w1.name
from world w1
where w1.gdp > (
select sum(coalesce(w2.gdp,0))
from world w2
where w2.continent='Asia' )
If not, and you want just the single Asian country with the highest GDP, replace sum with max. You don't need the additional continent <> 'Asia' because no Asian country can have a higher GDP than the returned value.
Based on #Tony 's answer, it should be like this:
select w1.name
from world w1
where w1.gdp > (
select max(coalesce(w2.gdp,0))
from world w2
where w2.continent='Asia' )
and w2.gdp is not null;
Why use max instead of sum? Because the question asked:
Which countries have a GDP greater than every country in Asia?
So finding the largest gdp in Asia will work and you don't have to compare with all of the countries in Asia.
I am sorry I can't comment on #Tony 's answer since my reputation is under 50.

SQL GROUP BY and SUM

List the continents with total populations of at least 100 million.
World Table
name continent area population gdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 10009000990
...
...
I started with
SELECT continent FROM world WHERE ... and kind of got stuck here.
Not sure how I can leverage GROUP BY and SUM. I need to GROUP BY continent and
SUM(population) some how but I am still learning how to put things together.
expected output
continent
Africa
Asia
Eurasia
Europe
North America
South America
SELECT continent, SUM(population)
FROM world
GROUP BY continent
HAVING SUM(population) >= 100000000
I'll give you a good framework for thinking through this question.
Since there are multiple records with the same continent, we know we need GROUP BY. Once we do group by, we can use aggregate operations to get the sum, namely SUM. By using this aggregate operation, we can filter using the HAVING clause post group-by. If we wanted to filter pre-groupby, we would use the WHERE clause.
SELECT continent FROM world GROUP BY continent HAVING SUM(population) >
100000000;

Sams Teach Yourself SQL in 10 minutes - Question about GROUP BY

i read the book "Sams Teach Yourself SQL in 10 minutes, Third Edition" and in the lesson 10 "Grouping Data", section "Creating Groups", i can't understand the following:
"Aside from the aggregate calculations statements, every column in your SELECT statement must be present in the GROUP BY clause."
Why? I tried this and i think that it is not true.
For example, consider a table 'World' with the columns 'continent', 'country', 'population'.
SELECT continent, country
FROM World
GROUP BY continent;
According to the book, this should lead to an error, right? But it doesn't. I can group my data depending on the continent (so we have at the results 7 continents) and next to each continent, a random country name.
Like this
continent country
North America Canada
South America Brazil
Europe France
Africa Cameroon
Asia Japan
Australia New Zealand
Antarctica TuxLand
You are most probably using MySQL which allows ungrouped and unaggregated expressions in SELECT clause.
This is violation of standard of course.
This is intended to simplify GROUP BY with joins on a PRIMARY KEY:
SELECT a.*, SUM(b.value)
FROM a
JOIN b
ON b.a_id = a.id
GROUP BY
a.id
Normally, you would have either to add all columns from a into the GROUP BY clause or use a subquery.
MySQL allows you not to do it since all values from a are guaranteed to be the same for a given value of the PRIMARY KEY (which is grouped on).
This is correct and should produce no error in some forms of SQL such as MySQL. You may optionally use the GROUP BY statement on more than one column but it's not required.
GROUP BY will list the first result of the columns specified - so in your case, it would return the first country/continent pair.
PostgreSQL and MySQL allow this, using one field for the group by.
The tutorial probably assumes you should use GROUP BY on all fields so from what you select, you don't lose any data - it would show every country/continent in the above example, but only once.
Here's an example table:
Continent | Country | Random_Field
---------------------------------------------
North America Canada Cake
North America Canada Dog
South America Brazil Cat
Europe France Frog
Africa Cameroon House
Asia Japan Gadget
Asia India Dance
Australia New Zealand Frodo
Antarctica TuxLand Linux
In your first statement:
SELECT continent, country
FROM World
GROUP BY continent;
The output would be:
Continent | Country
--------------------------
North America Canada
South America Brazil
Europe France
Africa Cameroon
Asia Japan
Australia New Zealand
Antarctica TuxLand
Notice one of the Asia rows was lost, despite being different.
Using a GROUP BY on both:
SELECT continent, country
FROM World
GROUP BY continent, country;
Would yield:
Continent | Country
-----------------------------
North America Canada
South America Brazil
Europe France
Africa Cameroon
Asia Japan
Asia India
Australia New Zealand
Antarctica TuxLand