A subquery with the ALL operator - sql

I'm trying to write a SQL (Postgres engine) query that answers the following question:
Which countries have a GDP greater than every country in Asia? [Give the name only.] (Some countries may have NULL gdp values)
Below is an abbreviated SQL table containing sample data.
+-------------+-----------+---------+------------+--------------+
| name | continent | area | population | gdp |
+-------------+-----------+---------+------------+--------------+
| Afghanistan | Asia | 652230 | 25500100 | 20343000000 |
| Albania | Europe | 28748 | 2831741 | 12960000000 |
| Algeria | Africa | 2381741 | 37100000 | 188681000000 |
| Andorra | Europe | 468 | 78115 | 3712000000 |
| Angola | Africa | 1246700 | 20609294 | 100990000000 |
+-------------+-----------+---------+------------+--------------+
I wrote something like the following, which returns nothing (although I know this isn't the answer due to the online interactive guide I'm using):
SELECT name
FROM world
WHERE gdp > ALL (SELECT gdp from world WHERE continent = 'Asia')
AND continent<>'Asia'
What query should I use? I'm pretty new to SQL.

Strictly speaking, if there is a single Asian country in your table with an unknown (NULL) GDP, then the correct answer must be:
"We do not know."
Accordingly, this query returns no row:
SELECT name
FROM world
WHERE gdp > ALL (SELECT gdp from world WHERE continent = 'Asia')
AND continent <> 'Asia';
A NULL value in the set of Asian GDP's makes it impossible for the first WHERE expression to be TRUE, so the query can never return rows - which is the correct answer. There is no country that we know to fit the requirement. The query you had is correct.
The variant you squeezed into a comment comparing to max(gdp) answers a slightly different question:
Which countries have a GDP greater than the greatest known GDP in Asia?

It sounds like you want the sum of all Asian countries. If so, try something like this:
select w1.name
from world w1
where w1.gdp > (
select sum(coalesce(w2.gdp,0))
from world w2
where w2.continent='Asia' )
If not, and you want just the single Asian country with the highest GDP, replace sum with max. You don't need the additional continent <> 'Asia' because no Asian country can have a higher GDP than the returned value.

Based on #Tony 's answer, it should be like this:
select w1.name
from world w1
where w1.gdp > (
select max(coalesce(w2.gdp,0))
from world w2
where w2.continent='Asia' )
and w2.gdp is not null;
Why use max instead of sum? Because the question asked:
Which countries have a GDP greater than every country in Asia?
So finding the largest gdp in Asia will work and you don't have to compare with all of the countries in Asia.
I am sorry I can't comment on #Tony 's answer since my reputation is under 50.

Related

Change column values to column headers in Postgres SQL replacing with values from another column

I am unable to figure out how to make a column values into column headers and assign appropriate values as it happens
Say I have a Postgres database with the following table:
Name Subject Score Region
======= ========= ======= =======
Joe Chemistry 20 America
Robert Math 30 Europe
Jason Physics 50 Europe
Joe Math 70 America
Robert Physics 80 Europe
Jason Math 40 Europe
Jason Chemistry 60 Europe
I want to select/fetch data in the following form:
Name Chemistry Math Physics Region
======= ========== ======= ======== ========
Joe 20 70 null America
Robert null 30 80 Europe
Jason 60 40 50 Europe
Considering that there are 80 subjects. How do I write an SQL select statement that returns data in this format?
In Postgres, I recommend using the FILTER syntax for conditional aggregation:
SELECT name,
MAX(score) FILTER (WHERE subject = 'Chemistry') AS Chemistry,
MAX(score) FILTER (WHERE subject = 'Math') AS Math,
MAX(score) FILTER (WHERE subject = 'Physics') AS Physics
FROM grades
GROUP BY name

How to assign equal revenue weight to every location of a company in a table? Google Big Query

I am working on a problem where I have the following table:
+----------+ | +------+ | +------------+
company_id | country | total revenue
1 Russia 1200
2 Croatia 1200
2 Italy 1200
3 USA 1200
3 UK 1200
3 Italy 1200
There are 3 companies in this table, but company '2' and company '3' have offices in 2 and 3 countries respectively. All companies pay 1200 per month, and because company 2 has 2 offices it shows as if they paid 1200 per month 2 times, and because company 3 has 3 offices it shows as if it paid 1200 per month 3 times. Instead, I would like revenue to be equally distributed based on how many times company_id appears in the table. company_id will only appear more than once for every additional country in which a company is based.
Assuming each company always pays 1,200 per month, my desired output is:
+----------+ | +------+ | +------------+
company_id | country | total revenue
1 Russia 1200
2 Croatia 600
2 Italy 600
3 USA 400
3 UK 400
3 Italy 400
Being new to SQL, I was thinking this can maybe be done through CASE WHEN statement, but I only learned to use CASE WHEN when I want to output a string depending on a condition. Here, I am trying to assign equal revenue weight to each company's country, depending on in how many countries a company is based in.
Thank you in advance for you help!
Below is for BigQuery Standard SQL
#standardSQL
SELECT company_id, country,
total_revenue / (COUNT(1) OVER(PARTITION BY company_id)) AS total_revenue
FROM `project.dataset.table`
If to apply to sample data from your question - output is
Row company_id country total_revenue
1 1 Russia 1200.0
2 2 Croatia 600.0
3 2 Italy 600.0
4 3 USA 400.0
5 3 UK 400.0
6 3 Italy 400.0

Why does this correlated subquery work? (SQLZOO Select within Select 7)

So you don't have to go searching out for it, the data they're presenting for the question set looks like this and the table is called world
name continent area population gdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000
They present an exercise where you use a query to select the largest country by area in each continent. They do most of it for you so getting to the answer isn't hard. This is the correct query:
SELECT continent, name, area FROM world x
WHERE area >= ALL
(SELECT area FROM world y
WHERE y.continent=x.continent
AND area>0)
I can understand what must be happening for it to work, but not why. y.continent = x.continent must by some sort of fancy GROUP BY, but... the lesson doesn't explain it and I'd really like to understand what's happening behind the scenes.
It's not a loop, or grouping. Lets picture the rowset represented as aliased as x in the query:
name continent area population gdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000
Now lets add an extra column that "contains" the subquery1, with the outer x value substituted:
name continent area population gdp subquery
Afghanistan Asia 652230 25500100 20343000000 (select area FROM world y WHERE y.continent='Asia' AND area>0)
Albania Europe 28748 2831741 12960000000 (select area FROM world y WHERE y.continent='Europe' AND area>0)
Algeria Africa 2381741 37100000 188681000000 (select area FROM world y WHERE y.continent='Africa' AND area>0)
Andorra Europe 468 78115 3712000000 (select area FROM world y WHERE y.continent='Europe' AND area>0)
Angola Africa 1246700 20609294 100990000000 (select area FROM world y WHERE y.continent='Africa' AND area>0)
Let's represent those results that are returned by the subquery:
name continent area population gdp subquery
Afghanistan Asia 652230 25500100 20343000000 (652230)
Albania Europe 28748 2831741 12960000000 (28748,468)
Algeria Africa 2381741 37100000 188681000000 (2381741,1246700)
Andorra Europe 468 78115 3712000000 (28748,468)
Angola Africa 1246700 20609294 100990000000 (2381741,1246700)
Now, for each row, we compare our area column against each value returned by the subquery. That's what the ALL forces - the WHERE clause is only satisfied if all of those comparisons are true. And the nature of the comparison (>=) means that its only true across all comparisons for the country within each continent with the largest area.
1Since it's a correlated subquery, it's effectively evaluated once per row, so I think it's reasonable to show what is evaluated on a per-row basis. Note that a naive implementation may in fact evaluate the subquery a row at a time and so it will e.g. gather all of the areas within Europe (and Africa) twice whilst processing the entire outer query.
You simply want the subquery to return area values for a specific continent. In other words, you want to compare the area of a country with area of all countries being on the same continent.
For example, for the second row you compare 28748 with all values in sequence (28748, 468) when you evaluate the condition. That sequence is returned by the subquery, and it considers the fact that you want to compare only with countries in Europe.
EDIT: you ask how the nested query do the group by. The answer is: it does not. Due to the fact that the data have just one country per continent with largest area it may seems that we perform the group by. However if we have a different data:
name continent area population gdp
--------------------------------------------------------
Afghanistan Asia 652230 25500100 20343000000
Pakistan Asia 652230 2500100 2034300000
then we return both rows for one continent value, since they both satisfy the condition that you want a country with largest area in continent.

SQL GROUP BY and SUM

List the continents with total populations of at least 100 million.
World Table
name continent area population gdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 10009000990
...
...
I started with
SELECT continent FROM world WHERE ... and kind of got stuck here.
Not sure how I can leverage GROUP BY and SUM. I need to GROUP BY continent and
SUM(population) some how but I am still learning how to put things together.
expected output
continent
Africa
Asia
Eurasia
Europe
North America
South America
SELECT continent, SUM(population)
FROM world
GROUP BY continent
HAVING SUM(population) >= 100000000
I'll give you a good framework for thinking through this question.
Since there are multiple records with the same continent, we know we need GROUP BY. Once we do group by, we can use aggregate operations to get the sum, namely SUM. By using this aggregate operation, we can filter using the HAVING clause post group-by. If we wanted to filter pre-groupby, we would use the WHERE clause.
SELECT continent FROM world GROUP BY continent HAVING SUM(population) >
100000000;

Sams Teach Yourself SQL in 10 minutes - Question about GROUP BY

i read the book "Sams Teach Yourself SQL in 10 minutes, Third Edition" and in the lesson 10 "Grouping Data", section "Creating Groups", i can't understand the following:
"Aside from the aggregate calculations statements, every column in your SELECT statement must be present in the GROUP BY clause."
Why? I tried this and i think that it is not true.
For example, consider a table 'World' with the columns 'continent', 'country', 'population'.
SELECT continent, country
FROM World
GROUP BY continent;
According to the book, this should lead to an error, right? But it doesn't. I can group my data depending on the continent (so we have at the results 7 continents) and next to each continent, a random country name.
Like this
continent country
North America Canada
South America Brazil
Europe France
Africa Cameroon
Asia Japan
Australia New Zealand
Antarctica TuxLand
You are most probably using MySQL which allows ungrouped and unaggregated expressions in SELECT clause.
This is violation of standard of course.
This is intended to simplify GROUP BY with joins on a PRIMARY KEY:
SELECT a.*, SUM(b.value)
FROM a
JOIN b
ON b.a_id = a.id
GROUP BY
a.id
Normally, you would have either to add all columns from a into the GROUP BY clause or use a subquery.
MySQL allows you not to do it since all values from a are guaranteed to be the same for a given value of the PRIMARY KEY (which is grouped on).
This is correct and should produce no error in some forms of SQL such as MySQL. You may optionally use the GROUP BY statement on more than one column but it's not required.
GROUP BY will list the first result of the columns specified - so in your case, it would return the first country/continent pair.
PostgreSQL and MySQL allow this, using one field for the group by.
The tutorial probably assumes you should use GROUP BY on all fields so from what you select, you don't lose any data - it would show every country/continent in the above example, but only once.
Here's an example table:
Continent | Country | Random_Field
---------------------------------------------
North America Canada Cake
North America Canada Dog
South America Brazil Cat
Europe France Frog
Africa Cameroon House
Asia Japan Gadget
Asia India Dance
Australia New Zealand Frodo
Antarctica TuxLand Linux
In your first statement:
SELECT continent, country
FROM World
GROUP BY continent;
The output would be:
Continent | Country
--------------------------
North America Canada
South America Brazil
Europe France
Africa Cameroon
Asia Japan
Australia New Zealand
Antarctica TuxLand
Notice one of the Asia rows was lost, despite being different.
Using a GROUP BY on both:
SELECT continent, country
FROM World
GROUP BY continent, country;
Would yield:
Continent | Country
-----------------------------
North America Canada
South America Brazil
Europe France
Africa Cameroon
Asia Japan
Asia India
Australia New Zealand
Antarctica TuxLand