SQL query that will retrieve set containing all entries from another set - sql

I have the following relations in my db:
Organization: information about political and economical organizations.name: the full name of the organizationabbreviation: its abbreviation
isMember: memberships in political and economical organizations.organization: the abbreviation of the organizationcountry: the code of the member country
geo_desert: geographical information about desertsdesert: the name of the desertcountry: the country code where it is locatedprovince: the province of this country
My task is to retrieve organizations which have within their members full set of countries with deserts. This organization can have also countries without deserts. So I have a set of countries with deserts and every organization in result should have all of them as members and arbitrary amount of other (no desert) countries.
I tried so far to write following code, but it doesn't work.
WITH CountriesWithDeserts AS (
SELECT DISTINCT country
FROM dbmaster.geo_desert
), OrganizationsWithAllDesertMembers AS (
SELECT organization
FROM dbmaster.isMember AS ism
WHERE (
SELECT count(*)
FROM (
SELECT *
FROM CountriesWithDeserts
EXCEPT
SELECT country
FROM dbmaster.isMember
WHERE organization = ism.organization
)
) IS NULL
), OrganizationCode AS (
SELECT name, abbreviation
FROM dbmaster.Organization
)
SELECT oc.name AS Organization
FROM OrganizationCode AS oc, OrganizationsWithAllDesertMembers AS owadm
WHERE oc.abbreviation=owadm.organization;
UPD: DBMS says: "ism.organization is not defined"
I'm using DB2/LINUXX8664 9.7.0
Output should look like this:
NAME --------------------------------------------------------------------------------
African, Caribbean, and Pacific Countries
African Development Bank
Agency for Cultural and Technical Cooperation
Andean Group

I find the easiest way to handle this is by using group by and having. You just want to focus on the deserts, so the rest of the countries don't matter.
select m.organization
from isMember m join
geo_desert d
on m.country = d.country
group by m.organization
having count(distinct m.country) = (select count(distinct d.country) from geo_desert);
The having clause simply counts the number of matching (i.e. desert) countries and checks that all are included.

Word it like this: You are looking for organizations for which not exists a desert country they don't include.
select *
from organization o
where not exists
(
select country from geo_desert
except
select country from ismember
where organization = o.abbreviation
);

Here are two equivalent solutions:
First:
WITH CountriesWithDeserts AS (
SELECT DISTINCT country
FROM dbmaster.geo_desert
), OrganizationsWithAllDesertMembers AS (
SELECT ism.organization
FROM dbmaster.isMember AS ism
JOIN CountriesWithDeserts AS cwd
ON ism.country = cwd.country
GROUP BY ism.organization
HAVING count(ism.country) = (SELECT count(*) FROM CountriesWithDeserts)
), OrganizationCode AS (
SELECT name, abbreviation
FROM dbmaster.Organization
)
SELECT oc.name AS Organization
FROM OrganizationCode AS oc, OrganizationsWithAllDesertMembers AS owadm
WHERE oc.abbreviation=owadm.organization;
Second:
WITH CountriesWithDeserts AS (
SELECT DISTINCT country
FROM dbmaster.geo_desert
)
SELECT org.name AS Organization
FROM dbmaster.Organization AS org
WHERE NOT EXISTS (
SELECT *
FROM CountriesWithDeserts
EXCEPT
SELECT country
FROM dbmaster.isMember
WHERE organization = org.abbreviation
);

Related

Find string from table in cell in BiqQuery --> Query exceeded resource limits

I have two tables in BigQuery:
City List: Table: invertible-fin-XXX238.Reports.City
StationionNames: invertible-fin-XXX238.Reports.Station
Most of the StationNames containing City Names. Now I want to extract the city from the Station Table.
Here some example data:
City: Berlin
Stationname: inStore_Berlin_Alexanderplatz
Stationname: Berlin Schönefeld Airport
Stationname: Train Station Franchise Berlin
I tried the INSTR Function, but had no success (the INSTR works only with Legacy SQL and there I couldn’t use SUBSELECTS).
SELECT City,
INSTR((SELECT AdGroupName
FROM [invertible-fin-XXX238.Reports.City]),City) AS Match
FROM [invertible-fin-XXX238.Reports.Station]
Therefore I tried it with WHERE LIKE. Below the SQL Code:
SELECT a.City
FROM [invertible-fin-XXX238.Reports.City] a
CROSS JOIN [invertible-fin-XXX238.Reports.Station] b
WHERE b. Name LIKE '%' + a.City + '%'
GROUP BY a.City
But now the Query is too computationally intensive and I got the Error Code “Query exceeded resource limits for tier 1. Tier 18 or higher required.” back.
Could some please help me, writing a more resource friendly query.
Thanks in advance,
Philipp
Below are few of many possible versions for BiigQuery Standard SQL
#standardSQL
SELECT city, station
FROM `invertible-fin-XXX238.Reports.Station` AS s
JOIN `invertible-fin-XXX238.Reports.City` AS c
ON REPLACE(LOWER(station), LOWER(city), '') <> LOWER(station)
or
#standardSQL
SELECT city, station
FROM `invertible-fin-XXX238.Reports.Station` AS s
JOIN `invertible-fin-XXX238.Reports.City` AS c
ON LOWER(station) LIKE CONCAT('%',LOWER(city),'%')
You can remove LOWER() function if names of City are spelled in same case in both tables
While above versions look more straightforward - i would prefer below one as it allows control way you extract city from station -r'([^ _]+)' - you should all characters that you observe being delimiters in column station. So in this case you will extract only city when it is not part of longer name
Of course you should validate if you even need to worry of this
#standardSQL
WITH tokens AS (
SELECT token, station
FROM `invertible-fin-XXX238.Reports.Station` AS s,
UNNEST(REGEXP_EXTRACT_ALL(LOWER(station), r'([^ _]+)')) token
)
SELECT city, station
FROM tokens AS s
JOIN `invertible-fin-XXX238.Reports.City` AS c
ON LOWER(city) = token
I also wonder how the performance for a sub-query would be in this case. For instance:
WITH City AS(
SELECT 'Berlin' As Name UNION ALL
SELECT 'Hamburg'
),
StationNames AS(
SELECT 'inStore_Berlin_Alexanderplatz' AS Name UNION ALL
SELECT 'Berlin Schönefeld Airport' UNION ALL
SELECT 'Train Station Franchise Berlin' UNION ALL
SELECT 'Train Station Hamburg' UNION ALL
SELECT 'Train Station Pluton'
)
SELECT
Name StationName,
(SELECT Name FROM City c WHERE LOWER(s.Name) LIKE CONCAT('%', LOWER(c.Name), '%')) city
FROM StationNames s
Or in your case:
SELECT
Name StationName,
(SELECT Name FROM `invertible-fin-XXX238.Reports.City` c WHERE LOWER(s.Name) LIKE CONCAT('%', LOWER(c.Name), '%')) city
FROM `invertible-fin-XXX238.Reports.Station` s
I know it's common sense for most databases that JOIN has better performance than sub-queries but BigQuery have lots of different optimization techniques for storing and querying data, I was curious to know how different the performance would be in this case.

How to write a query for "Select the site_id and location of sites in Europe where the students are not living in the UK"

I have two separate queries:
SELECT SITE_ID, LOCATION
FROM SITES
WHERE LOCATION LIKE 'Europe%';
and
SELECT STUDENT_ID, STUDENT_FNAME, STUDENT_LNAME, COUNTRY
FROM STUDENTS
WHERE COUNTRY NOT LIKE 'UK';
How to write a query which would select the site_id and location of sites in Europe where the students are not living in the UK?
A query that takes one from the other and applies to the following structure:
SELECT column_name1
FROM table_name
WHERE condition
OPERATOR
SELECT column_name1
FROM table_name
WHERE condition
I have added UNION ALL operator (I have no idea if I'm using it correctly), but the result is:
SELECT SITE_ID, LOCATION
*
ERROR at line 1:
ORA-01789: query block has incorrect number of result columns
.
SELECT SITE_ID, LOCATION
FROM SITES
WHERE LOCATION LIKE 'Europe%'
UNION ALL
SELECT STUDENT_ID, STUDENT_FNAME, STUDENT_LNAME, COUNTRY
FROM STUDENTS
WHERE COUNTRY NOT LIKE 'UK';
Try below Code; If you want just a list of data of both tables:
SELECT
SITE_ID AS S, LOCATION AS L, '' As SL, 0 As C, 'Site' As UType
FROM
SITES
WHERE
LOCATION LIKE 'Europe%'
UNION ALL
SELECT
STUDENT_ID As S, STUDENT_FNAME As L, STUDENT_LNAME AS SL, COUNTRY AS C, 'Student' As UType
FROM
STUDENTS
WHERE
COUNTRY NOT LIKE 'UK';
That sounds like you want the MINUS operator.

Divide two queries in SQL then group by

I am looking for the rate change between new accounts and all accounts, I have both queries listed below. I need to divide NewAccounts by AllAccounts, take that percentage and group by town in the same query. Thanks
SELECT DISTINCT Count(NewAccounts), Town
FROM (SELECT Stuff)
WHERE (Newaccounts)
Group By Town
;
SELECT DISTINCT Count(AllAccounts), Town
FROM (SELECT DifferentSTUFF)
WHERE (AllAccounts)
Group By Town
You need to rewrite your queries as subqueries and join them together:
SELECT CAST(na.NewAccounts AS FLOAT) / aa.AllAccounts
FROM ( SELECT Count(NewAccounts) AS NewAccounts, Town
FROM (SELECT Stuff)
WHERE (Newaccounts)
GROUP BY Town
) na
INNER JOIN
( SELECT Count(AllAccounts) AS AllAccounts, Town
FROM (SELECT DifferentSTUFF)
WHERE (AllAccounts)
GROUP BY Town
) aa
ON aa.Town = na.Town;
n.b. I have removed DISTINCT from both queries as it is redundant. The cast to float on NewAccounts is to avoid the implicit conversion of the result integer division back to an integer.
You may need to alter this slightly depending on the availability of data in each of the queries, i.e. if you won't always have a result in the new accounts for a town it would be better written as:
SELECT CAST(COALESCE(na.NewAccounts, 0) AS FLOAT) / aa.AllAccounts
FROM
( SELECT Count(AllAccounts) AS AllAccounts, Town
FROM (SELECT DifferentSTUFF)
WHERE (AllAccounts)
GROUP BY Town
) aa
LEFT JOIN
( SELECT Count(NewAccounts) AS NewAccounts, Town
FROM (SELECT Stuff)
WHERE (Newaccounts)
GROUP BY Town
) na
ON aa.Town = na.Town

SQL Query: Largest number of guns

Schema is below:
Ships(name, yearLaunched, country, numGuns, gunSize, displacement)
Battles(ship, battleName, result)
where name and ship are equal. By this I mean if 'Missouri' was one of the tuple
results for name, 'Missouri' would also appear as a tuple result for ship.
(i.e. name = 'Missouri' , ship = 'Missouri)
They are the same
Now the question I have is what SQL statement would I make in order to list
the battleship amongst a list of battleships that has the largest amount
of guns (i.e. gunSize)
I tried:
SELECT name, max(gunSize)
FROM Ships
But this gave me the wrong result.
I then tried:
SELECT s.name
FROM Ships s,
(SELECT MAX(gunSize) as "Largest # of Guns"
FROM Ships
GROUP BY name) maxGuns
WHERE s.name = maxGuns.name
But then SQLite Admin gave me an error saying that no such column 'maxGuns' exists
even though I assigned it as an alias: maxGuns
Do any of you know what the correct query for this problem would be?
Thanks!
The problem in your query is that the subquery has no column named name.
Anyway, to find the largest amount of guns, just use SELECT MAX(gunSize) FROM Ships.
To get all ships with that number of guns, you need nothing more than a simple comparison with that value:
SELECT name
FROM Ships
WHERE gunSize = (SELECT MAX(gunSize)
FROM Ships)
It does not exist because you are trying to alias a subquery in the 'Where' clause, instead of aliasing specific column from a table. In order to identify the ship with the most guns you could try something like:
with cte as (select *
,ROW_NUMBER() over (order by s.gunsize desc) seq
from ships s )
select * from cte
where seq = '1'
Another approach could be: And it will only select the 1st row,containing the ship with highest number of guns.
select Top 1 *
from ships s
order by s.gunsize desc
WITH TAB_SHIPS(NAME, NUMGUNS,DISPLACEMENT) AS (SELECT NAME, NUMGUNS,DISPLACEMENT FROM SHIPS AS S
LEFT JOIN CLASSES AS C
ON S.CLASS=C.CLASS
WHERE C.NUMGUNS >=ALL(SELECT NUMGUNS FROM CLASSES C1 WHERE C1.DISPLACEMENT = C.DISPLACEMENT )
UNION
SELECT SHIP, NUMGUNS,DISPLACEMENT FROM OUTCOMES AS O
LEFT JOIN CLASSES AS C
ON C.CLASS=O.SHIP
WHERE C.NUMGUNS >=ALL(SELECT NUMGUNS FROM CLASSES C1 WHERE C1.DISPLACEMENT = C.DISPLACEMENT ) )
SELECT NAME FROM TAB_SHIPS
WHERE NUMGUNS IS NOT NULL

Speeding up a slow SQL query

I am using the MySQL world.sql database. Exactly what is in it doesn't matter, but the schema that matters to use looks like:
CREATE TABLE city (
name char(35),
country_code char(3),
population int(11),
);
CREATE TABLE country (
code char(3),
name char(52),
population int(11)
);
The query in question is, in english, "for each country, give me its name and population, along with the name and population for the city who has the highest ratio of its population to the country's population"
Currently I have the following SQL:
SELECT t.name, t.population, c.name, c.population
FROM country c
JOIN city t
ON t.country_code = c.code
WHERE t.population / c.population = (
SELECT MAX(tt.population / c.population)
FROM city tt
WHERE t.country_code = tt.country_code
)
Currently the query takes about 10 minutes to run on my SQLite database. The world.sql database isn't large (4000-5000 rows?) so I'm guessing I'm doing something wrong here.
I currently don't have any sort of indexes or anything: the database is an empty database with this dataset (https://dl.dropboxusercontent.com/u/7997532/world.sql) entered into it. Could anyone give me any pointers as to what I need to fix to make it run in a reasonable amount of time?
EDIT: well here's another twist to the question:
This runs in <2 seconds
SELECT t.name, t.population, c.name, c.population
FROM country c
JOIN city t
ON t.country_code = c.code
WHERE t.population * 1.0 / c.population = (
SELECT MAX(tt.population * 1.0 / c.population)
FROM city tt
WHERE tt.country_code = t.country_code
)
While this take 10 minutes to run
SELECT t.name, t.population, c.name, c.population
FROM country c
JOIN city t
ON t.country_code = c.code
AND t.population * 1.0 / c.population = (
SELECT MAX(tt.population * 1.0 / c.population)
FROM city tt
WHERE tt.country_code = t.country_code
)
Is the solution then to simply stuff as much as possible into the ON clause when i'm doing JOINs? It seems in this case I can get away without an index if I do that...
For each country, the city that has the highest ratio of population to it's country's population is the city with the highest population, so try this:
SELECT t.name, t.population, c.name, c.population
FROM country c
JOIN city t
ON t.country_code = c.code
And population =
(Select Max(population) from city
Where country_code = c.Code)
But this may still not improve performance much... if you have no indicies. You need to put an index on country.code, and on city.country_code
Ideally, I would first start with indexes and consider adding a computed field that pre-calculates the t.population / c.population into a link table
So for each country and city, you can look up it's ratio of population without computing in RBAR.
I suggest adding numeric primary keys to both tables and a foreign key on country_code in your city table. One of the benefits will be better performance because primary keys are indexed.
Edit starts here
Since the question doesn't ask you to provide the actual ratio, don't worry about trying to calculate it. The city with the highest population in the country will have the highest proportion of the country's population.