Speeding up a slow SQL query - sql

I am using the MySQL world.sql database. Exactly what is in it doesn't matter, but the schema that matters to use looks like:
CREATE TABLE city (
name char(35),
country_code char(3),
population int(11),
);
CREATE TABLE country (
code char(3),
name char(52),
population int(11)
);
The query in question is, in english, "for each country, give me its name and population, along with the name and population for the city who has the highest ratio of its population to the country's population"
Currently I have the following SQL:
SELECT t.name, t.population, c.name, c.population
FROM country c
JOIN city t
ON t.country_code = c.code
WHERE t.population / c.population = (
SELECT MAX(tt.population / c.population)
FROM city tt
WHERE t.country_code = tt.country_code
)
Currently the query takes about 10 minutes to run on my SQLite database. The world.sql database isn't large (4000-5000 rows?) so I'm guessing I'm doing something wrong here.
I currently don't have any sort of indexes or anything: the database is an empty database with this dataset (https://dl.dropboxusercontent.com/u/7997532/world.sql) entered into it. Could anyone give me any pointers as to what I need to fix to make it run in a reasonable amount of time?
EDIT: well here's another twist to the question:
This runs in <2 seconds
SELECT t.name, t.population, c.name, c.population
FROM country c
JOIN city t
ON t.country_code = c.code
WHERE t.population * 1.0 / c.population = (
SELECT MAX(tt.population * 1.0 / c.population)
FROM city tt
WHERE tt.country_code = t.country_code
)
While this take 10 minutes to run
SELECT t.name, t.population, c.name, c.population
FROM country c
JOIN city t
ON t.country_code = c.code
AND t.population * 1.0 / c.population = (
SELECT MAX(tt.population * 1.0 / c.population)
FROM city tt
WHERE tt.country_code = t.country_code
)
Is the solution then to simply stuff as much as possible into the ON clause when i'm doing JOINs? It seems in this case I can get away without an index if I do that...

For each country, the city that has the highest ratio of population to it's country's population is the city with the highest population, so try this:
SELECT t.name, t.population, c.name, c.population
FROM country c
JOIN city t
ON t.country_code = c.code
And population =
(Select Max(population) from city
Where country_code = c.Code)
But this may still not improve performance much... if you have no indicies. You need to put an index on country.code, and on city.country_code

Ideally, I would first start with indexes and consider adding a computed field that pre-calculates the t.population / c.population into a link table
So for each country and city, you can look up it's ratio of population without computing in RBAR.

I suggest adding numeric primary keys to both tables and a foreign key on country_code in your city table. One of the benefits will be better performance because primary keys are indexed.
Edit starts here
Since the question doesn't ask you to provide the actual ratio, don't worry about trying to calculate it. The city with the highest population in the country will have the highest proportion of the country's population.

Related

Subquery yields different results when used alone

I have to write a query across two different tables country and city. The goal is to get every district and that district's population for every country. As the district is just an attribute of each city, I have to subsume all the populations of every city belonging to a district.
My query so far looks like this:
SELECT country.name, country.population, array_agg(
(SELECT (c.district, sum(city.population))
FROM city GROUP BY c.district))
AS districts
FROM country
FULL OUTER JOIN city c ON country.code = c.countrycode
GROUP BY country.name, country.population;
The result:
name | population | districts
---------------------------------------------+------------+------------------------------------------------------------------------------------------------------------------
Afghanistan | 22720000 | {"(Balkh,1429559884)","(Qandahar,1429559884)","(Herat,1429559884)","(Kabol,1429559884)"}
Albania | 3401200 | {"(Tirana,1429559884)"}
Algeria | 31471000 | {"(Blida,1429559884)","(Béjaïa,1429559884)","(Annaba,1429559884)","(Batna,1429559884)","(Mostaganem,1429559884)"
American Samoa | 68000 | {"(Tutuila,1429559884)","(Tutuila,1429559884)"}
So apparently it sums all the city-populations of the world. I need to limit that somehow to each district alone.
But if I run the Subquery alone as
SELECT (city.district, sum(city.population)) FROM city GROUP BY city.district;
it gives me the districts with their population:
row
----------------------------------
(Bali,435000)
(,4207443)
(Dnjestria,194300)
(Mérida,224887)
(Kochi,324710)
(Qazvin,291117)
(Izmir,2130359)
(Meta,273140)
(Saint-Denis,131480)
(Manitoba,618477)
(Changhwa,354117)
I realized it has to do something with the abbreviation that I use when joining. I used it for convenience but it seems to have real consequences because if I don't use it, it gives me the error
more than one row returned by a subquery used as an expression
Also, if I use
sum(c.population)
in the subquery it won't execute because
aggregate function calls cannot be nested
This abbreviation when joining apparently changes a lot.
I hope someone can shed some light on that.
Solved it myself.
Window functions are the most convenient method for this kind of task:
SELECT DISTINCT
country.name
, country.population
, city.district
, sum(city.population) OVER (PARTITION BY city.district)
AS district_population
, sum(city.population) OVER (PARTITION BY city.district)/ CAST(country.population as float)
AS district_share
FROM
country JOIN city ON country.code = city.countrycode
;
But it also works with subselects:
SELECT DISTINCT
country.name
, country.population
, city.district
,(
SELECT
sum(ci.population)
FROM
city ci
WHERE ci.district = city.district
) AS district_population
,(
SELECT
sum(ci2.population)/ CAST(country.population as float)
FROM
city ci2
WHERE ci2.district = city.district
) AS district_share
FROM
country JOIN city ON country.code = city.countrycode
ORDER BY
country.name
, country.population
;

Find string from table in cell in BiqQuery --> Query exceeded resource limits

I have two tables in BigQuery:
City List: Table: invertible-fin-XXX238.Reports.City
StationionNames: invertible-fin-XXX238.Reports.Station
Most of the StationNames containing City Names. Now I want to extract the city from the Station Table.
Here some example data:
City: Berlin
Stationname: inStore_Berlin_Alexanderplatz
Stationname: Berlin Schönefeld Airport
Stationname: Train Station Franchise Berlin
I tried the INSTR Function, but had no success (the INSTR works only with Legacy SQL and there I couldn’t use SUBSELECTS).
SELECT City,
INSTR((SELECT AdGroupName
FROM [invertible-fin-XXX238.Reports.City]),City) AS Match
FROM [invertible-fin-XXX238.Reports.Station]
Therefore I tried it with WHERE LIKE. Below the SQL Code:
SELECT a.City
FROM [invertible-fin-XXX238.Reports.City] a
CROSS JOIN [invertible-fin-XXX238.Reports.Station] b
WHERE b. Name LIKE '%' + a.City + '%'
GROUP BY a.City
But now the Query is too computationally intensive and I got the Error Code “Query exceeded resource limits for tier 1. Tier 18 or higher required.” back.
Could some please help me, writing a more resource friendly query.
Thanks in advance,
Philipp
Below are few of many possible versions for BiigQuery Standard SQL
#standardSQL
SELECT city, station
FROM `invertible-fin-XXX238.Reports.Station` AS s
JOIN `invertible-fin-XXX238.Reports.City` AS c
ON REPLACE(LOWER(station), LOWER(city), '') <> LOWER(station)
or
#standardSQL
SELECT city, station
FROM `invertible-fin-XXX238.Reports.Station` AS s
JOIN `invertible-fin-XXX238.Reports.City` AS c
ON LOWER(station) LIKE CONCAT('%',LOWER(city),'%')
You can remove LOWER() function if names of City are spelled in same case in both tables
While above versions look more straightforward - i would prefer below one as it allows control way you extract city from station -r'([^ _]+)' - you should all characters that you observe being delimiters in column station. So in this case you will extract only city when it is not part of longer name
Of course you should validate if you even need to worry of this
#standardSQL
WITH tokens AS (
SELECT token, station
FROM `invertible-fin-XXX238.Reports.Station` AS s,
UNNEST(REGEXP_EXTRACT_ALL(LOWER(station), r'([^ _]+)')) token
)
SELECT city, station
FROM tokens AS s
JOIN `invertible-fin-XXX238.Reports.City` AS c
ON LOWER(city) = token
I also wonder how the performance for a sub-query would be in this case. For instance:
WITH City AS(
SELECT 'Berlin' As Name UNION ALL
SELECT 'Hamburg'
),
StationNames AS(
SELECT 'inStore_Berlin_Alexanderplatz' AS Name UNION ALL
SELECT 'Berlin Schönefeld Airport' UNION ALL
SELECT 'Train Station Franchise Berlin' UNION ALL
SELECT 'Train Station Hamburg' UNION ALL
SELECT 'Train Station Pluton'
)
SELECT
Name StationName,
(SELECT Name FROM City c WHERE LOWER(s.Name) LIKE CONCAT('%', LOWER(c.Name), '%')) city
FROM StationNames s
Or in your case:
SELECT
Name StationName,
(SELECT Name FROM `invertible-fin-XXX238.Reports.City` c WHERE LOWER(s.Name) LIKE CONCAT('%', LOWER(c.Name), '%')) city
FROM `invertible-fin-XXX238.Reports.Station` s
I know it's common sense for most databases that JOIN has better performance than sub-queries but BigQuery have lots of different optimization techniques for storing and querying data, I was curious to know how different the performance would be in this case.

SQL query that will retrieve set containing all entries from another set

I have the following relations in my db:
Organization: information about political and economical organizations.name: the full name of the organizationabbreviation: its abbreviation
isMember: memberships in political and economical organizations.organization: the abbreviation of the organizationcountry: the code of the member country
geo_desert: geographical information about desertsdesert: the name of the desertcountry: the country code where it is locatedprovince: the province of this country
My task is to retrieve organizations which have within their members full set of countries with deserts. This organization can have also countries without deserts. So I have a set of countries with deserts and every organization in result should have all of them as members and arbitrary amount of other (no desert) countries.
I tried so far to write following code, but it doesn't work.
WITH CountriesWithDeserts AS (
SELECT DISTINCT country
FROM dbmaster.geo_desert
), OrganizationsWithAllDesertMembers AS (
SELECT organization
FROM dbmaster.isMember AS ism
WHERE (
SELECT count(*)
FROM (
SELECT *
FROM CountriesWithDeserts
EXCEPT
SELECT country
FROM dbmaster.isMember
WHERE organization = ism.organization
)
) IS NULL
), OrganizationCode AS (
SELECT name, abbreviation
FROM dbmaster.Organization
)
SELECT oc.name AS Organization
FROM OrganizationCode AS oc, OrganizationsWithAllDesertMembers AS owadm
WHERE oc.abbreviation=owadm.organization;
UPD: DBMS says: "ism.organization is not defined"
I'm using DB2/LINUXX8664 9.7.0
Output should look like this:
NAME --------------------------------------------------------------------------------
African, Caribbean, and Pacific Countries
African Development Bank
Agency for Cultural and Technical Cooperation
Andean Group
I find the easiest way to handle this is by using group by and having. You just want to focus on the deserts, so the rest of the countries don't matter.
select m.organization
from isMember m join
geo_desert d
on m.country = d.country
group by m.organization
having count(distinct m.country) = (select count(distinct d.country) from geo_desert);
The having clause simply counts the number of matching (i.e. desert) countries and checks that all are included.
Word it like this: You are looking for organizations for which not exists a desert country they don't include.
select *
from organization o
where not exists
(
select country from geo_desert
except
select country from ismember
where organization = o.abbreviation
);
Here are two equivalent solutions:
First:
WITH CountriesWithDeserts AS (
SELECT DISTINCT country
FROM dbmaster.geo_desert
), OrganizationsWithAllDesertMembers AS (
SELECT ism.organization
FROM dbmaster.isMember AS ism
JOIN CountriesWithDeserts AS cwd
ON ism.country = cwd.country
GROUP BY ism.organization
HAVING count(ism.country) = (SELECT count(*) FROM CountriesWithDeserts)
), OrganizationCode AS (
SELECT name, abbreviation
FROM dbmaster.Organization
)
SELECT oc.name AS Organization
FROM OrganizationCode AS oc, OrganizationsWithAllDesertMembers AS owadm
WHERE oc.abbreviation=owadm.organization;
Second:
WITH CountriesWithDeserts AS (
SELECT DISTINCT country
FROM dbmaster.geo_desert
)
SELECT org.name AS Organization
FROM dbmaster.Organization AS org
WHERE NOT EXISTS (
SELECT *
FROM CountriesWithDeserts
EXCEPT
SELECT country
FROM dbmaster.isMember
WHERE organization = org.abbreviation
);

Divide two queries in SQL then group by

I am looking for the rate change between new accounts and all accounts, I have both queries listed below. I need to divide NewAccounts by AllAccounts, take that percentage and group by town in the same query. Thanks
SELECT DISTINCT Count(NewAccounts), Town
FROM (SELECT Stuff)
WHERE (Newaccounts)
Group By Town
;
SELECT DISTINCT Count(AllAccounts), Town
FROM (SELECT DifferentSTUFF)
WHERE (AllAccounts)
Group By Town
You need to rewrite your queries as subqueries and join them together:
SELECT CAST(na.NewAccounts AS FLOAT) / aa.AllAccounts
FROM ( SELECT Count(NewAccounts) AS NewAccounts, Town
FROM (SELECT Stuff)
WHERE (Newaccounts)
GROUP BY Town
) na
INNER JOIN
( SELECT Count(AllAccounts) AS AllAccounts, Town
FROM (SELECT DifferentSTUFF)
WHERE (AllAccounts)
GROUP BY Town
) aa
ON aa.Town = na.Town;
n.b. I have removed DISTINCT from both queries as it is redundant. The cast to float on NewAccounts is to avoid the implicit conversion of the result integer division back to an integer.
You may need to alter this slightly depending on the availability of data in each of the queries, i.e. if you won't always have a result in the new accounts for a town it would be better written as:
SELECT CAST(COALESCE(na.NewAccounts, 0) AS FLOAT) / aa.AllAccounts
FROM
( SELECT Count(AllAccounts) AS AllAccounts, Town
FROM (SELECT DifferentSTUFF)
WHERE (AllAccounts)
GROUP BY Town
) aa
LEFT JOIN
( SELECT Count(NewAccounts) AS NewAccounts, Town
FROM (SELECT Stuff)
WHERE (Newaccounts)
GROUP BY Town
) na
ON aa.Town = na.Town

Query for Counting number of orders by UK postcode

I have got a table of orders placed by customer , what i want is to check from which part of the country orders are coming historically, I can only check this by postcodes , for intance an order with post code SK... means its stockport , similarly the post code starting from M .. means the order is from manchester, Is it possible to write a query which can count the orders by postcode.
Some of the fields of the Order table:
OrderNumber OGUID custID firstname last name address postcode email authorisation date etc...
Any suggestion or assistance will be appreciated.
Thanks
Here is way that works... but it can get too long for a huge list. I will try to find a way around that problem.
SELECT
CASE
WHEN postcode LIKE 'SK%' THEN 'SK'
WHEN postcode LIKE 'M%' THEN 'M'
END AS group_by_value
, COUNT(*) AS group_by_count
FROM [Table] a
GROUP BY
CASE
WHEN postcode LIKE 'SK%' THEN 'SK'
WHEN postcode LIKE 'M%' THEN 'M'
END
If you have a table that contains the city code and city name, then you might be able to use something like the following which joins your orders table to the codes using a LIKE:
select o.postcode,
c.city,
count(c.code) over(partition by c.code) Total
from orders o
inner join codes c
on o.postcode like '%'+c.code+'%'
See SQL Fiddle with Demo
You can use GROUP BY to get the total number of orders in each postcode:
select postcode, count(postcode) TotalOrdersByPostCode
from orders
group by postcode
If you want the City included, then you can also GROUP BY city:
select city, postcode, count(postcode) TotalOrdersByPostCode
from orders
group by city, postcode
select count(1) over(partition by postcode) as countByPostcode, othecolumnhere
from Order
Have you tried something like this? The town part of the postcode will be the first 1 or 2 bytes, delimited by a number after, I think. So this will give you the first few letters.
select substring(postcode,1, patindex('%[0-9]%',postcode)-1), count(*)
from Order
group by substring(postcode,1, patindex('%[0-9]%',postcode)-1)
Then you'll have to decode M into Manchester, W into West London, GU into Guildford etc...