Sams Teach Yourself SQL in 10 minutes - Question about GROUP BY - sql

i read the book "Sams Teach Yourself SQL in 10 minutes, Third Edition" and in the lesson 10 "Grouping Data", section "Creating Groups", i can't understand the following:
"Aside from the aggregate calculations statements, every column in your SELECT statement must be present in the GROUP BY clause."
Why? I tried this and i think that it is not true.
For example, consider a table 'World' with the columns 'continent', 'country', 'population'.
SELECT continent, country
FROM World
GROUP BY continent;
According to the book, this should lead to an error, right? But it doesn't. I can group my data depending on the continent (so we have at the results 7 continents) and next to each continent, a random country name.
Like this
continent country
North America Canada
South America Brazil
Europe France
Africa Cameroon
Asia Japan
Australia New Zealand
Antarctica TuxLand

You are most probably using MySQL which allows ungrouped and unaggregated expressions in SELECT clause.
This is violation of standard of course.
This is intended to simplify GROUP BY with joins on a PRIMARY KEY:
SELECT a.*, SUM(b.value)
FROM a
JOIN b
ON b.a_id = a.id
GROUP BY
a.id
Normally, you would have either to add all columns from a into the GROUP BY clause or use a subquery.
MySQL allows you not to do it since all values from a are guaranteed to be the same for a given value of the PRIMARY KEY (which is grouped on).

This is correct and should produce no error in some forms of SQL such as MySQL. You may optionally use the GROUP BY statement on more than one column but it's not required.

GROUP BY will list the first result of the columns specified - so in your case, it would return the first country/continent pair.
PostgreSQL and MySQL allow this, using one field for the group by.
The tutorial probably assumes you should use GROUP BY on all fields so from what you select, you don't lose any data - it would show every country/continent in the above example, but only once.
Here's an example table:
Continent | Country | Random_Field
---------------------------------------------
North America Canada Cake
North America Canada Dog
South America Brazil Cat
Europe France Frog
Africa Cameroon House
Asia Japan Gadget
Asia India Dance
Australia New Zealand Frodo
Antarctica TuxLand Linux
In your first statement:
SELECT continent, country
FROM World
GROUP BY continent;
The output would be:
Continent | Country
--------------------------
North America Canada
South America Brazil
Europe France
Africa Cameroon
Asia Japan
Australia New Zealand
Antarctica TuxLand
Notice one of the Asia rows was lost, despite being different.
Using a GROUP BY on both:
SELECT continent, country
FROM World
GROUP BY continent, country;
Would yield:
Continent | Country
-----------------------------
North America Canada
South America Brazil
Europe France
Africa Cameroon
Asia Japan
Asia India
Australia New Zealand
Antarctica TuxLand

Related

How to find IF single rows meet a criteria ELSE aggregate multiple rows within a group

I have some accounting data where I need to select a single row within a group if it meets a dollar amount criteria OR if it does not I need to sum/combine multiple rows in that group to see if that group meets the criteria. Example data:
Continent
Region
Sales Amount
South America
North
$300
South America
South
$100
South America
West
$500
South America
East
$200
North America
North
$100
North America
South
$50
North America
West
$50
North America
East
$400
Europe
North
$100
Europe
South
$200
Europe
West
$100
Europe
East
$100
Asia
North
$75
Asia
South
$100
Asia
West
$100
Asia
East
$100
Africa
North
$500
Africa
South
$700
Africa
West
$100
Africa
East
$100
In the above example, I want to find all continents that have single regions/rows with $500 in sales OR I want to find countries where 2 or more regions can be combined to meet the $500 amount. My expected result would be:
Continent
Region_1
Region_2
Sales Amount_1
Sales Amount_2
Canada
West
not applicable
$500
USA
North,East
not applicable
$500
Europe
North,South,West,East
not applicable
$500
Asia
does not meet criteria
not applicable
does not meet criteria
Africa
South
North
$700
$500
Region_2 is only applicable if more than one region within a continent meets the sales amount criteria of $500 on its own.

Why does this correlated subquery work? (SQLZOO Select within Select 7)

So you don't have to go searching out for it, the data they're presenting for the question set looks like this and the table is called world
name continent area population gdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000
They present an exercise where you use a query to select the largest country by area in each continent. They do most of it for you so getting to the answer isn't hard. This is the correct query:
SELECT continent, name, area FROM world x
WHERE area >= ALL
(SELECT area FROM world y
WHERE y.continent=x.continent
AND area>0)
I can understand what must be happening for it to work, but not why. y.continent = x.continent must by some sort of fancy GROUP BY, but... the lesson doesn't explain it and I'd really like to understand what's happening behind the scenes.
It's not a loop, or grouping. Lets picture the rowset represented as aliased as x in the query:
name continent area population gdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000
Now lets add an extra column that "contains" the subquery1, with the outer x value substituted:
name continent area population gdp subquery
Afghanistan Asia 652230 25500100 20343000000 (select area FROM world y WHERE y.continent='Asia' AND area>0)
Albania Europe 28748 2831741 12960000000 (select area FROM world y WHERE y.continent='Europe' AND area>0)
Algeria Africa 2381741 37100000 188681000000 (select area FROM world y WHERE y.continent='Africa' AND area>0)
Andorra Europe 468 78115 3712000000 (select area FROM world y WHERE y.continent='Europe' AND area>0)
Angola Africa 1246700 20609294 100990000000 (select area FROM world y WHERE y.continent='Africa' AND area>0)
Let's represent those results that are returned by the subquery:
name continent area population gdp subquery
Afghanistan Asia 652230 25500100 20343000000 (652230)
Albania Europe 28748 2831741 12960000000 (28748,468)
Algeria Africa 2381741 37100000 188681000000 (2381741,1246700)
Andorra Europe 468 78115 3712000000 (28748,468)
Angola Africa 1246700 20609294 100990000000 (2381741,1246700)
Now, for each row, we compare our area column against each value returned by the subquery. That's what the ALL forces - the WHERE clause is only satisfied if all of those comparisons are true. And the nature of the comparison (>=) means that its only true across all comparisons for the country within each continent with the largest area.
1Since it's a correlated subquery, it's effectively evaluated once per row, so I think it's reasonable to show what is evaluated on a per-row basis. Note that a naive implementation may in fact evaluate the subquery a row at a time and so it will e.g. gather all of the areas within Europe (and Africa) twice whilst processing the entire outer query.
You simply want the subquery to return area values for a specific continent. In other words, you want to compare the area of a country with area of all countries being on the same continent.
For example, for the second row you compare 28748 with all values in sequence (28748, 468) when you evaluate the condition. That sequence is returned by the subquery, and it considers the fact that you want to compare only with countries in Europe.
EDIT: you ask how the nested query do the group by. The answer is: it does not. Due to the fact that the data have just one country per continent with largest area it may seems that we perform the group by. However if we have a different data:
name continent area population gdp
--------------------------------------------------------
Afghanistan Asia 652230 25500100 20343000000
Pakistan Asia 652230 2500100 2034300000
then we return both rows for one continent value, since they both satisfy the condition that you want a country with largest area in continent.

SQL issue that I should be able to answer but I cannot

Here's the tiny bit of data I am to query:
name continent area population gdp
Afghani Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000
Given the above data, the request was to select two columns with France, Germany, Italy and their populations.
Here was my thought:
Select name, population
where name = 'France','Germany','Italy'
Where was any screw-up, if you would be so kind.
The = operator doesn't take multiple arguments. You're looking for the in operator. Additionally, you're missing a from clause:
SELECT name, population
FROM populations
WHERE name IN ('France', 'Germany', 'Italy')

SQL GROUP BY and SUM

List the continents with total populations of at least 100 million.
World Table
name continent area population gdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 10009000990
...
...
I started with
SELECT continent FROM world WHERE ... and kind of got stuck here.
Not sure how I can leverage GROUP BY and SUM. I need to GROUP BY continent and
SUM(population) some how but I am still learning how to put things together.
expected output
continent
Africa
Asia
Eurasia
Europe
North America
South America
SELECT continent, SUM(population)
FROM world
GROUP BY continent
HAVING SUM(population) >= 100000000
I'll give you a good framework for thinking through this question.
Since there are multiple records with the same continent, we know we need GROUP BY. Once we do group by, we can use aggregate operations to get the sum, namely SUM. By using this aggregate operation, we can filter using the HAVING clause post group-by. If we wanted to filter pre-groupby, we would use the WHERE clause.
SELECT continent FROM world GROUP BY continent HAVING SUM(population) >
100000000;

Same-table Tree Table Query in SQL Server

I've searched but found nothing that could help.
I have the following table in a SQL Server 2005 database:
Parent Child Value
---- -------- ---------
America Mexico 8
America Canada 1
Asia Japan 5
Asia Korea 7
Europe Spain 0
Europe Italy 2
Africa Zimbabwe 1
Mexico Baja California 0
America USA 3
USA California 1
USA Texas 2
Parent and Child are Primary Key, value is not important (IMO). I would like to create a view that results in something like this:
Parent Child Value
---- -------- ---------
America USA 3
USA California 1
USA Texas 2
I would search for America, and the result will give back every nested child there is, recursively, no matter how many it has, since I could include cities, localities, etc.
What I need is similar to what some call a BOM explosion.
Here is how you can do it:
with cte as (
select parent, child
from t
union all
select cte.parent, t.child
from cte join
t
on cte.child = t.parent
)
select cte.*
from cte
where parent = 'America';
Here is a small SQL Fiddle example.