Modelling country adjacency in SQL - sql

I'm trying to model which countries border each other in MySQL. I have three tables:
nodes
-----
node_id MEDIUMINT
countries
---------
country_id MEDIUMINT (used as a foreign key for nodes.node_id)
country CHAR(64)
iso_code CHAR(2)
node_adjacency
--------------
node_id_1 MEDIUMINT (used as a foreign key for nodes.node_id)
node_id_2 MEDIUMINT (used as a foreign key for nodes.node_id)
I appreciate the nodes table is redundant in this example, but this is part of a larger architecture where nodes can represent many other items other than countries.
Here's some data (IDs (which appear in all three tables) and countries)
59 Bosnia and Herzegovina
86 Croatia
130 Hungary
178 Montenegro
227 Serbia
232 Slovenia
Croatia is bordered by all the other countries, and this is represented in the node_adjacency table as:
59 86
86 130
86 178
86 227
86 232
So Serbia's ID may appear as a node_id_1 or a node_id_2. The data in this table is essentially non directed graph data.
Questions:
Given the name 'Croatia', what SQL should I use to retrieve its neighbours?
Bosnia and Herzegovina
Hungary
Montenegro
Serbia
Slovenia
Would there be any retrieval efficiency gains in storing the adjacency information as directed graph data? E.g. Croatia borders Hungary, and Hungary borders Croatia, essentially duplicating storage of the relationships:
86 130
130 86

This is just off the top of my head, so I don't know if it's the most performant solution and it may need a tweak, but I think it should work:
SELECT
BORDER.country
FROM
Countries AS C
LEFT OUTER JOIN Node_Adjacency NA1 ON
NA1.node_id_1 = C.country_id OR
NA1.node_id_2 = C.country_id
INNER JOIN Countries AS BORDER ON
(
BORDER.country_id = NA1.node_id_1 OR
BORDER.country_id = NA1.node_id_2
) AND
BORDER.country_id <> C.country_id
WHERE
C.country = 'CROATIA'
Since your graph is not directed, I don't think that it makes sense to store it as a directed graph. You might also want to Google "Celko SQL Graph" as he has done a lot of advanced work on trees, graphs, and hierarchies in SQL and has an excellent book devoted to the subject.

I would store both relations (Hungary borders Croatia, Croatia borders Hungary) so that you only ever need to query one column.
SELECT c.country FROM countries AS c
INNER JOIN node_adjacency AS n
ON n.node_id_1 = c.countryID
WHERE c.countryID = 86

To do both columns, simply union two queries together (borrowing from Matthew Jones):
SELECT c.country FROM countries AS c
INNER JOIN node_adjacency AS n
ON n.node_id_1 = c.countryID
WHERE c.countryID = 86
UNION
SELECT c.country FROM countries AS c
INNER JOIN node_adjacency AS n
ON n.node_id_2 = c.countryID
WHERE c.countryID = 86
If you do it this way, you duplicate your query instead of your data (use 50% less space), and it's still really simple.

You can create a union view to avoid duplication:
CREATE VIEW adjacency_view (node_id_1, node_id_2)
AS
SELECT node_id_1, node_id_2 FROM node_adjacency
UNION ALL
SELECT node_id_2, node_id_1 FROM node_adjacency
So your query becomes quite straightforward:
SELECT c1.country
FROM adjacency_view
INNER JOIN countries AS c1 on c1.country_id = adjacency_view.node_id_1
INNER JOIN countries AS c2 on c2.country_id = adjacency_view.node_id_2
WHERE c2.country = 'CROATIA'

If you are duplicating relationships (i.e. country A shares border with B, and B shares border with A) you can get a way with a simple select. If you store only one relationship per pair of countries you will need to search by both columns in node_adjacency table (running two select statements and performing a union).

Related

Use of USING in SQL

A restaurant provides wine pairings for most food items on its menu. The structure of two of the tables containing this information is shown below
Join these two tables by their id columns to find the country that the recommended wine is produced in.
Here is the code I have tried:
SELECT country, item
FROM regions
INNER JOIN pairing
regions.id = pairing.id
ORDER BY item
LIMIT 5;
But the compiler gives the solution as:
SELECT country, item
FROM regions
INNER JOIN pairing
USING(id)
ORDER BY item
LIMIT 5;
OUTPUT:
country
item
France
caviar
Italy
curry
Italy
grilled vegetables
Argentina
lamb
Germany
roast duck
Doubt:
I want to clear if there is any difference bwtween USING and equal statement on id or they are same?

How does SQL count(distinct) work in this case?

I'm trying to find the match no in which Germany played against Poland. This is from https://www.w3resource.com/sql-exercises/soccer-database-exercise/sql-subqueries-exercise-soccer-database-4.php. There are two tables : match_details and soccer_country. I don't understand how the count(distinct) works in this case. Can someone please clarify? Thanks!
SELECT match_no
FROM match_details
WHERE team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Germany')
OR team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Poland')
GROUP BY match_no
HAVING COUNT(DISTINCT team_id) = 2;
As Lamak mentioned, what an ugly consideration for a query, but many ways to approach a query.
As mentioned, counting for (Distinct team_id) makes sure that there are only 2 unique teams. If there is ever a Cartesian result, you could get repetition of multiple rows showing more than one instance of both teams. So the count of distinct on the TEAM_ID eliminates that.
Now, that said, Other "team" query data structures I have seen have a single record for the match and a column for EACH TEAM playing the match. That is easier by a long-shot, but still a relatively easy query.
Break the query down a little, and consider a large scale set of data (not that this, or any sort of even professional league would have such large record counts to give delay with a sql engine).
Your first criteria is games with Germany. So lets start with that.
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
So, why even look at any other record/match if Germany is not even part of the match on either side. Of which this in itself would return 6 matches from the sample data of 51 matches. So now, all you need to do is join AGAIN to the match details table a second time for only those matches, but ALSO the second team is Poland
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
-- joining again for the same match Germany was already qualified
JOIN match_details md2
on md1.match_no = md2.match_no
-- but we want the OTHER team record since Germany was first team
and md1.team_id != md2.team_id
-- and on to the second country table based on the SECOND team ID
JOIN soccer_country sc2
on md2.team_id = sc2.country_id
-- and the second team was Poland
AND sc2.country_name = 'Poland'
Yes, may be a longer query, but by eliminating 45 other matches (again, thinking a LARGE database), you have already saved blowing through tons of data to a very finite set. And now finishing only those Germany / Poland. No aggregates, counts, distincts, just direct joins.
FEEDBACK
Lets take a look at some BAD sample data... which as all programmers know, there is no such thing (NOT). Anyhow, lets take a look at these few matches.
Match Team ID blah
52 Poland Just put the names here for simplistic purposes
52 Poland
53 Germany
53 Germany
If you were to run the query without DISTINCT Teams, both match 52 and 53 would show up... As Poland is one team and appears 2 times for match 52, and similarly Germany 2 times for match 53. By doing DISTINCT Team, you can see that for each match, there is only 1 team being returned and thus excluded. Does that help? Again, no such thing as bad data :)
And yet another sample match where more than 2 teams created
Match Team ID
54 France
54 Poland
54 England
55 Hungary
56 Austria
In each of these matches, NONE would be returned. Match 54 has 3 distinct teams, and Match 55 and 56 only have single entry, thus no opponent to compete against.
2nd FEEDBACK
To clarify the query. If you look at the short query for just Germany, that aliased instance of "md1" is already sitting on any given record for a Germany match. So the second join to the "md2", I only care about the same match, so I can join on the same match_no. However, in the "md2" alias, the "!=" means NOT EQUAL. ! = logical NOT. So the join is saying from the MD1, join to the MD2 alias on the same match id. However, only give me where the teams are NOT the same. So the first instance holds Germany's team ID (already qualified) and thus give me the secondary team id. So now I can use the secondary (md2) instance team ID to join to the country to confirm only for Poland.
Does this now clarify things for you?

Inner join filtering out desired results

I'm just wondering how to fix this issue:
For example, when I type in this simple query
SELECT CUSTOMERS.CUSTOMER_ID, CUSTOMERS.NAME
FROM CUSTOMERS
WHERE (CUSTOMERS.NAME LIKE 'O%' AND CUSTOMERS.NAME LIKE '%e%') OR CUSTOMERS.NAME LIKE '%t';
I get the following output:
127 Alphabet
128 Comcast
129 Target
196 DuPont
197 Avnet
44 Jabil Circuit
58 Health Net
69 Whole Foods Market
226 Office Depot
260 Occidental Petroleum
27 Assurant
158 Owens & Minor
174 Oracle
255 Waste Management
88 Walmart
113 Microsoft
117 Home Depot
However, when I add INNER JOIN ORDERS ON CUSTOMERS.CUSTOMER_ID = ORDERS.CUSTOMER_ID I get this output instead.
44 Jabil Circuit
44 Jabil Circuit
44 Jabil Circuit
58 Health Net
69 Whole Foods Market
44 Jabil Circuit
44 Jabil Circuit
It seems like it's only displaying ID's and Names for customers that have an ID in the ORDERS table. How do I make it so it runs through every customer again, not just the ones in the ORDERS table?
You should use a LEFT JOIN instead of the INNER JOIN.
A LEFT JOIN tries to match an entry of the second table to one of the first table while also displaying results for the first table where no matching entry in the second table exists. In this case the entries for the second table will just be NULL. An INNER JOIN only gives you data where entries for both sides are given.
Other possibilities are RIGHT JOIN which is like LEFT JOIN but switches the roles of the first and second table and FULL OUTER JOIN which only requires one side to be given.
Further information and some good examples can be found at w3schools.

JOIN the same table on two columns

I use JOINs to replace country and product IDs in import and export data with actual country and products names stored in separate tables. In the data source table (data), there are two columns with country IDs, for origin and destination, both of which I am replacing with country names.
The code I have come up with refers to the country_names table twice – as country_names, and country_names2, – which doesn’t seem to be very elegant. I expected to be able to refer to the table just once, by a single name. I would be grateful if someone pointed me to a more elegant and maybe more efficient way to achieve the same result.
SELECT
country_names.name AS origin,
country_names2.name AS dest,
product_names.name AS product,
SUM(data.export_val) AS export_val,
SUM(data.import_val) AS import_val
FROM
OEC.year_origin_destination_hs92_6 AS data
JOIN
OEC.products_hs_92 AS product_names
ON
data.hs92 = product_names.hs92
JOIN
OEC.country_names AS country_names
ON
data.origin = country_names.id_3char
JOIN
OEC.country_names AS country_names2
ON
data.dest = country_names2.id_3char
WHERE
data.year > 2012
AND data.export_val > 1E8
GROUP BY
origin,
dest,
product
The table to convert product IDs to product names has 6K+ rows. Here is a small sample:
id hs92 name
63215 3215 Ink
2130110 130110 Lac
21002 1002 Rye
2100200 100200 Rye
52706 2706 Tar
20902 902 Tea
42203 2203 Beer
42302 2302 Bran
178703 8703 Cars
The table to convert country IDs to country names (which is the table I have to JOIN on twice) has 264 rows for all countries in the world. (id_3char is the column used.) Here is a sample:
id id_3char name
euchi chi Channel Islands
askhm khm Cambodia
eublx blx Belgium-Luxembourg
eublr blr Belarus
eumne mne Montenegro
euhun hun Hungary
asmng mng Mongolia
nabhs bhs Bahamas
afsen sen Senegal
And here is a sample of data from the import and export data table with a total of 205M rows that has the two columns origin and dest that I am making a join on:
year origin dest hs92 export_val import_val
2009 can isr 300410 2152838.47 3199.24
1995 chn jpn 590190 275748.65 554154.24
2000 deu gmb 100610 1573508.44 1327.0
2008 deu jpn 540822 10000.0 202062.43
2010 deu ukr 950390 1626012.04 159423.38
2006 esp prt 080530 2470699.19 125291.33
2006 grc ind 844859 8667.0 3182.0
2000 ltu deu 630399 6018.12 5061.96
2005 usa zaf 290219 2126216.52 34561.61
1997 ven ecu 281122 155347.73 1010.0
I think you already have it done such that it can be considered good enough to just use as is :o)
Meantime, If for some reason you really-really want to avoid two joins on that country table - what you can do is to materialize below select statement into let's say `OEC.origin_destination_pairs` table
SELECT
o.id_3char o_id_3char,
o.name o_name,
d.id_3char d_id_3char,
d.name d_name
FROM `OEC.country_names` o
CROSS JOIN `OEC.country_names` d
Then you can just join on that new table as below
SELECT
country_names.o_name AS origin,
country_names.d_name AS dest,
product_names.name AS product,
SUM(data.export_val) AS export_val,
SUM(data.import_val) AS import_val
FROM OEC.year_origin_destination_hs92_6 AS data
JOIN OEC.products_hs_92 AS product_names
ON data.hs92 = product_names.hs92
JOIN OEC.origin_destination_pairs AS country_names
ON data.origin = country_names.o_id_3char
AND data.dest = country_names2.d_id_3char
WHERE data.year > 2012
AND data.export_val > 1E8
GROUP BY
origin,
dest,
product
The motivation behind above is cost of storing and querying in your particular case
Your `OEC.country_names` table is just about 10KB in size
Each time you query it you pay as if it is 10MB (Charges are rounded to the nearest MB, with a minimum 10 MB data processed per table referenced by the query, and with a minimum 10 MB data processed per query.)
So, if you will materialize above mentioned table - it will still be less than 10MB so no difference in querying charges
Similar situation with storing that table - no visible changes in charges
You can check more about pricing here

How can I prevent duplicate values in SQL?

I can't think of a good title for this question but here goes..
I have this SQL query
SELECT
J.SRV_JOB_ID,
C.UNIT_COST * C.QTY AS COST_PRICE,
E.SERIAL_NO
FROM
SRV_JOB J
LEFT JOIN SRV_JOB_COST C ON C.SRV_JOB_ID = J.SRV_JOB_ID
LEFT JOIN SRV_JOB_EQUIPMENT JE ON JE.SRV_JOB_ID = J.SRV_JOB_ID
LEFT JOIN SRV_EQUIPMENT E ON E.SRV_EQUIPMENT_ID = JE.SRV_EQUIPMENT_ID
WHERE
j.srv_job_id = 52423
which is somewhat simplified for the purpose of the question, and gives these results;
srv_job_id cost_price serial_no
52423 89 400887
52423 89 400888
52423 89 400889
because there is one job with an id of 52423 and a cost of 89, but there are three associated serial numbers.
There is nothing wrong with the result, but it is misleading because it looks like each serial number has a cost of 89, when in fact the total cost for all three is 89.
How can I prevent the cost of 89 being duplicated? I can't change the database schema, but i can change the query.
The result i would like would be
srv_job_id cost_price serial_no
52423 89 400887
52423 null 400888
52423 null 400889
I've broken it into two separate queries. One to list the job(s) details and one to list the serial numbers for each job.
Thanks for you help. Your comments led me to thinking about the problem differently.