Needing Clarity on SQL Join Query - sql

Having some trouble understanding this query, particularly the WHERE in the subquery. I don't really get what it is accomplishing. Any help would be appreciated. Thanks
# Find the largest country (by area) in each continent. Show the continent,
# name, and area.
SELECT continent, name, area
FROM countries AS a
WHERE area = (
SELECT MAX(area)
FROM countries AS b
WHERE a.continent = b.continent
)

Consider the following subset of the countries data:
Continent Country Area
North America USA 3718691
North America Canada 3855081
North America Mexico 761602
Europe France 211208
Europe Germany 137846
Europe UK 94525
Europe Italy 116305
This is a correlated query that behaves as follows:
Reads the first row returned by the outer query (North America, USA, 3718691)
Runs the subquery which correlates to a.continent, North America, and returns 3855081 which is the maximum area in North America.
Does the where equality which checks to see if 3855081 matches the area on the row we're working on.
It doesn't match so the next row in the outer query is read and we start over at step 1 this time working on the second row.
Repeat for all rows in the outer query.
When we're looking at rows 2 and 4, step 4. will match so those rows will be returned by the query.
You can check the results by using this data in your countries table and running the query.
Note that this is a very poor way to determine the country with the maximum area per continent because it repeats the subquery for every country. Using my sample data, it determines the maximum area for North America 3 times and the maximum area for Europe 4 times.
Since you asked in your comment, I would write this query as follows:
SELECT a.continent, a.name, a.area
FROM countries AS a
inner join (select continent, max(area) max_area
from countries
group by continent) as b on a.continent = b.continent
WHERE a.area = b.max_area
In this version of the query, the maximum for each continent is only determined once. The original query was written to illustrate correlated queries and it's important to understand them. Correlated queries can often be used to resolve complex logic.

The subquery is finding the maximum area for countries. Which countries? All countries that match the continent of the country in the outer query.
So, for each country it gets the area of the largest country on the same continent.
The WHERE clause then says "are the two areas the same -- the maximum area and the area of this country?". It chooses only countries that have the maximum area.

Related

When a statement contains an item in a list, show it in a new column

I would appreciate a little help on some script in sql. So I have a list like the one below and a database table -Table1 with statement as a colum name, and I will like to create a column called location, where the script can search in the statement column and once it finds any of the items in the list in any row it states that in the location column
(Tema, london, Sydney, Germany, China, Africa,)
Statement
-------------------
Going to london
Apples in Tema
Sydney is a city
China is a country
Africa is a continent
In the end I hope to see a table like this :
Statement
location
Going to london
London
Apples in Tema
Tema
Sydney is a city
Sydney
china is a country
China
Africa is a continent
Africa
By using this script,
SELECT Statement,
Case
WHEN Statement::text ~~* '%london%'::character varying::text
THEN 'london'::character varying
ELSE NULL::character varying
END AS location
FROM Table1
I think I would have to write a very tall script, but I was wondering if I could get help with something efficient and quite simple to achieve this
If you have a list of places, you can use that:
select t1.*, v.place
from table1 t1 cross join
(values ('tema'), ('london'), ('sydney'), ('germany'), ('china'), ('africa')
) v(place)
on Statement::text ilike '%' || v.place || '%';
Note: You might want to use regular expressions so you can include work boundaries but your example code doesn't do tis.

SQL (COUNT(*) / locations.area)

We are learning SQL at school, and my professor has this sql code in his documents.
SELECT wp.city, (COUNT(*) / locations.area) AS population_density
FROM world_poulation AS wp
INNER JOIN location
ON wp.city = locations.city
WHERE locations.state = “Hessen”
GROUP BY wp.city, locations.area
Everything is almost clear for me, just the aggregate function with /locations.area doesn't make any sense to me. Can anybody help?
Thank you in advance!
Look at what the query is grouped on, that tells you what each group consists of. In this case, each group is a city, and contains all the rows that have the same value for wp.city (and as the location table is joined on that value too, the locations.area is only included in the grouping so that it can be used in the result).
So each group has a number of rows, and the COUNT(*) aggregate will contain the number of rows for each group. The value of (COUNT(*) / locations.area) will be the number of rows in the group divided by the value of locations.area for that group.
If you would have data like this:
world_population
name city
--------- ---------
John London
Peter London
Sarah London
Malcolm London
Ian Cardiff
Johanna Stockholm
Sven Stockholm
Egil Stockholm
locations
city state area
----------- -------------- ---------
London Hessen 2
Cardiff Somehere else 14
Stockholm Hessen 1
Then you would get a result with two groups (as Cardiff is not in the state Hessen). One group has four people from London which has the area 2, so the population density would be 2. The other group has three people from Stockholm which has the area 1, so the population density would be 3.
Side note: There is a typo in the query, as it joins in the table location but refers to it as locations everywhere else.
Try writing it like:
SELECT wp.city,
locations.area,
COUNT(*) AS population,
(COUNT(*) / locations.area) AS population_density
FROM world_poulation AS wp
INNER JOIN location
ON wp.city = locations.city
WHERE locations.state = “Hessen”
GROUP BY wp.city, locations.area
The key is the GROUP BY statement. You are showing pairs of cities and areas. The COUNT(*) is the number of times a given pair shows up in the table you created by joining world population and location. The area is just a number, so you can divide the area by the COUNT.

How to design table relationship where the foreign key can mean "all rows", "some rows" or "one row"?

I hope you can help me with this. I've used pseudocode to keep everything simple.
I have a table which describes locations.
location_table
location = charfield(200) # New York, London, Tokyo
A product manager now wants locations to be as follows:
Global = select every location
Asia = select every location in Asia
US = select every location in US
Current system = London (etc.)
This is my proposed redesign.
location_table
location = charfield(200) # New York, London, Tokyo
continent = foreign key to continent_table
continent_table
continent = charfield(50) # "None", "Global", Asia, Europe
But this seems horrible. It means in my code I'll always need to check if the customer is using "global" or "none", and then select the corresponding location records. For example, there will be code like this scattered everywhere:
get continent
if continent is global, select everything from location_table
else if continent is none, select location from location_table
else select location from location_table where foreign key is continent
My feeling is this is a known problem, and there is a known solution for it. Any ideas?
Thank you.
What you seem to have here is a set of locations, and then a set of location groups. Those groups might be all of the locations (global), or a subset of them.
You can build this with an intermediate table between the locations and a new location sets table which associates locations and location sets.
You might build the location set table and the join table so that the individual locations are also location sets, but ones which join only to one location. That way all location selections come from one table -- the location sets.
So you end up with three different types of location set:
Ones which map 1:1 with a location
One which maps 1:all ("global")
Ones which map 1:many (continents and other areas)
It's conceivable that this could be created as a hierarchy, but those queries can be inefficient because the join cardinalities tend to be obscured from the optimiser.
You could do this using a hierarchy, and a self referencing foreign key, e.g.
LocationID Name ParentLocationID LocationType
------------------------------------------------------------------
1 Planet Earth NULL Planet
2 Africa 1 Continent
3 Antartica 1 Continent
4 Asia 1 Continent
5 Australasia 1 Continent
6 Europe 1 Continent
7 North America 1 Continent
8 South America 1 Continent
9 United States 7 Country
10 Canada 7 Country
11 Mexico 7 Country
12 California 9 State
13 San Diego 12 City
14 England 6 Country
15 Cornwall 14 County
16 Truro 15 City
Hierarchical data usually requires either recursion, or multiple joins to get all levels, this answer contains links to articles comparing performance on the major DBMS.
Many DBMS now support recursive Common table expressions, and since no DBMS is specified I will use SQL Server syntax because it is what I am most comfortable with, a quick example would be.
DECLARE #LocationID INT = 7; -- NORTH AMERICA
WITH LocationCTE AS
( SELECT l.LocationID, l.Name, l.ParentLocationID, l.LocationType
FROM dbo.Location AS l
WHERE LocationID = #LocationID
UNION ALL
SELECT l.LocationID, l.Name, l.ParentLocationID, l.LocationType
FROM dbo.Location AS l
INNER JOIN LocationCTE AS c
ON c.LocationID = l.ParentLocationID
)
SELECT *
FROM LocationCTE;
Output based on above sample data
LocationID Name ParentLocationID LocationType
-----------------------------------------------------------------
7 North America 1 Continent
9 United States 7 Country
10 Canada 7 Country
11 Mexico 7 Country
12 California 9 State
13 San Diego 12 City
Online Demo
Supplying a value of 1 (Planet Earth) for the location ID will return the full table, or supplying a locationID of 11 (Mexico) would only return this one row, because there is nothing smaller than this in the sample data.
I'll go with your answer and say that I don't find it quite horrible to look everytime a customer to check if he searches by city or location, or nothing. That would be the role of the backend code and would always lead to different queries depending on what option he chooses.
But I would remove "None", "Global" from the continent table, and just use other queries when these option are not chosen. You would end up with the 3 possibles SQL queries you have, and I don't find it to be bad design per se. Maybe other solution are more performant, but this one seems to be more readable and logical. It's just optional querying with join tables.
Other answer will trade performance/duplication for readability (which isn't a bad thing, depending on how many time you will be relying on this condition in your application, in how many queries you'll be using it, and how many cities you have).
For readability and non-repetition, the best thing would be to concentrate these condition in one SQL function wich take a string parameter and return all location depending on the input (but at the cost of preformance).
Use levels:
0 -> None
00 -> Global
001 -> Europe
002 -> Asia
003 -> Africa
select location from location_table where continent like '[value]%'
Using a fixed length code, you can prefix regions, and then add one more digit for a region inside a region, and so on.
Ok, let me try to improve it.
Consider the world, it has the minimum level (or maximum depending on how you see it)
World ID = '0' (1 digit)
Now, select how you want to divide the world: (Continents, Half-Continents, ...) and assign the next level.
Europe ID = '01' (First digit World + Second digit Europe)
Asia ID = '02'
America ID = '03'
...
Next Level: Countries. (At least 2 digits)
England ID = '0101' (World + Continent + Country)
Deutchland ID = '0102'
....
Texas ID = '0301'
....
Next Level: Regions (2 digits)
Yorkshire ID = '010101' (World + Continent + Country + Region)
....
Next Level: Cities (2 or 3 digits)
London ID = '01010101' (World + Continent + Country + Region + City)
And so on.
Now, the same SELECT some_aggregate, statistics, ... FROM ... can be used for no matter what region, simply change:
WHERE Region like '0%' --> The whole world
WHERE Region like '02%' --> Asia
WHERE Region like '01010101%' --> London
WHERE Region like '02%' AND Region like '01%' --> Asia & Europe

Oracle REGEXP_SUBSTR for string matching b/w two columns

The problem
Users are frequently inputting "country name" strings into the "city name" field. Heuristically, this appears to be an extremely common practice. For example, a user might put "TAIPEI TAIWAN" in the city name when only "TAIPEI" should be input and then the country would be "TAIWAN". I am working to aggregate these instances for this specific field (your help will allow me to expand this to other columns and tables) and then identify where possible rankings associated with strictly the "country" names in the "city" field.
I have two tables that I am attempting to leverage to track down data validation issues. Tbl1 is named "Customer_Address" comprised of geographic columns like (Customer_Num, Address, City_Name, State, Country_Code, Zipcode). Tbl2 named "HR_Countries" is clean table of 2-digit ISO country codes with their corresponding name values (Lebanon, Taiwan, China, Syria, Russia, Ukraine, etc) and some other fields not presently used.
The initial step is to query "Customer_Address" to find City_Names LIKE a series of OR statements (LIKE '%CHINA', OR LIKE 'TAIWAN', OR etc etc) and count the number of occurrences where the City_Name is like the designated country_name string I passed it and the results are pretty good. I've coded in some exclusions to deal with things like "Lebanon, OH" so my overall results are satisfactory for the first phase.
Part of the query does a LEFT join from Tbl1 to Tbl2 to add the risk rating from tbl2 as a result of the query against tbl1:
LEFT JOIN tbl2 risk
ON INSTR(addr.CITY_NM, risk.COUNTRY_NAME,1) <> 0
Example of Tbl1 Data Output (head(tbl1), n=7)
CountryNameInCity CountOfOccurences RR
China 15 High
Taiwan 2000 Medium
Japan 250 Low
Taipei, Taiwan 25 NULL
Kabul, Afghanistan 10 NULL
Shenzen China 100 NULL
Afghanistan 52 Very High
Example of Tb2 Data (head(tbl2), n=6)
CountryName CountryCode RR
China CN High
Taiwan TW High
Iraq IQ Very High
Cuba CU Medium
Lebanon LB Very High
Greece GR High
So my question(s) are as follows:
1) Instead of manually passing in a series of OR-statements for country codes is there a better way to using Tbl2 as the matching "LIKE" driving the query?
2) Can you recommend a better way of comparing the output of the query (see Tbl1 example) and ensuring that multiple strings (Taipei, Taiwan, etc) are appropriately aggregated and bring back the correct 'RR' rating.
Thanks for taking the time to review this and respond.

Find average GDP according to continent in SQL

I have 2 tables- Economics (land_code, gdp) and Continent (Land_code, Cont, Percentage). I need to create a query that calculates average GDP for Continent. In case the country is at the same time in several continents, we should also consider the percentage of GDP that belongs to continent. As I have understood if Egypt has GDP of 100, then 90 belongs to Africa and 10 to Asia, how can I implement this expression?
!!! ALREADY DONE)
Obviously,
select Economics.land_code, Economics.gdp * Continent.Percentage
from ...