Recursion in PostgreSQL - sql

Let's assume we have a table borders(country1,country2) that contains two countries that border each other, eg. (Sweden, Norway), etc. I would like to find all the countries that can be reached from a given country, say Sweden, by using border crossing only.
Here's the first part of my solution:
WITH RECURSIVE border(countryin) AS (
select distinct country
from (select country2::character varying(4) as country
from borders where country1 = 'S'
union
select country1::character varying(4) as country
from borders where country2 = 'S' ) a
UNION
select distinct sp.country::varchar(4)
from (select country1::varchar(4) as country, country2 as n
from borders) sp
join (select country2::varchar(4) as country, country1 as n, countryin as temp
from borders, border) st
on sp.country = st.n
and sp.country in st.temp
where true
)
SELECT distinct countryin, name
FROM border, country
where countryin = code ;
The only thing that I cannot get to work is how to set a constraint so that a specific country exists in the result border table. I tried using and sp.country in st.temp, and several other ways, but I cannot get it to work.
Could some one give me a hint of how this can be solved?
Current Results:
Right now, I get an error stating "
ERROR: syntax error at or near "st"
LINE 4: ...s, border) st on sp.country = st.n and sp.country in st.temp
"
Desired Results
List all counties that can be reached recursively using borders starting from 'S'. So, if we have (S,N), (N,R), (R,C), (D,A), we would get: (N,R,C)

I believe there is room for improvement, but seem like do the job.
base case, you get the "other" country where 'S' appear in any side
recursive case get new country with border with any country already in travel path, but avoid the one with 'S' so doesnt return to origin. Also include a variable to track recursive depth so doesnt keep looping for ever. (dont remember how many country are now).
After finish I add filter DISTINCT to remove duplicate.
Maybe I could include a filter on the recursive case To avoid travel back to same countries. Not sure which one is more efficient.
AND ( b.country1 NOT IN (SELECT country FROM Travel)
AND b.country2 NOT IN (SELECT country FROM Travel)
)
SQL Fiddle DEMO
WITH RECURSIVE travel(r_level, country) AS (
select distinct 1 as r_level,
CASE WHEN country1 = 'S' THEN country2
ELSE country1
END as country
from borders
where country1 = 'S'
or country2 = 'S'
UNION
select distinct t.r_level + 1 as r_level,
CASE WHEN b.country1 = t.country THEN b.country2
ELSE b.country1
END as country
from borders b
join travel t
ON (b.country1 = t.country OR b.country2 = t.country)
AND (b.country1 <> 'S' AND b.country2 <> 'S')
WHERE t.r_level < 300
)
SELECT DISTINCT country
FROM travel
OUTPUT
| country |
|---------|
| N |
| R |
| C |
Please feel free to provide a more complete sqlFiddle with more country to improve the testing.

Related

How to select some rows from a table but average values from WHOLE column in T-SQL?

I have a database of airlines delays and I need to average the delays of their ALL flights by air line but then display only the air lines that fly from city X.
I tried this code:
SELECT
B.airline_name,
AVG(A.arrival_delay) avg_delay
FROM
TableDelays A
JOIN
TableAirlines B ON A.airline_id = B.airline_id
WHERE
A.city = 'X'
GROUP BY
B.airline_name
But when I use WHERE Origin = 'X' line, I get incorrect average delay of only the flights that departure from city X. Whereas, when I don't use the WHERE line, I have all air lines with correct averages displayed (from all their flights), but I only need to display the ones from city X.
Does anyone know how to "extract" only the air lines departing from city X so that I don't take it into consideration while averaging the values?
Use a HAVING clause:
SELECT ta.airline_name,
AVG(td.arrival_delay) as avg_delay
FROM TableDelays td JOIN
TableAirlines ta
ON td.airline_id = ta.airline_id
GROUP BY ta.airline_name
HAVING SUM(CASE WHEN td.city = 'X' THEN 1 ELSE 0 END) > 0
Perhaps a CTE would help achieve this?
WITH CTE AS(
SELECT TA.airline_name,
AVG(TD.arrival_delay) AS avg_delay,
MAX(CASE WHEN TD.City = 'X' THEN 1 END) AS XCity
FROM dbo.TableDelays TD --"A" is a poor choice for an alias here
JOIN dbo.TableAirlines TA ON TD.airline_id = TA.airline_id --"B" doesn't even appear in "Airlines", why use it?
GROUP BY TA.airline_name)
SELECT airline_name,
avg_delay
FROM CTE
WHERE XCity = 1;
As I note in my comments A, and B are poor choices for your table alias. Use suitable ones when giving your tables aliases. Bad habits to kick : using table aliases like (a, b, c) or (t1, t2, t3)

Subquery yields different results when used alone

I have to write a query across two different tables country and city. The goal is to get every district and that district's population for every country. As the district is just an attribute of each city, I have to subsume all the populations of every city belonging to a district.
My query so far looks like this:
SELECT country.name, country.population, array_agg(
(SELECT (c.district, sum(city.population))
FROM city GROUP BY c.district))
AS districts
FROM country
FULL OUTER JOIN city c ON country.code = c.countrycode
GROUP BY country.name, country.population;
The result:
name | population | districts
---------------------------------------------+------------+------------------------------------------------------------------------------------------------------------------
Afghanistan | 22720000 | {"(Balkh,1429559884)","(Qandahar,1429559884)","(Herat,1429559884)","(Kabol,1429559884)"}
Albania | 3401200 | {"(Tirana,1429559884)"}
Algeria | 31471000 | {"(Blida,1429559884)","(Béjaïa,1429559884)","(Annaba,1429559884)","(Batna,1429559884)","(Mostaganem,1429559884)"
American Samoa | 68000 | {"(Tutuila,1429559884)","(Tutuila,1429559884)"}
So apparently it sums all the city-populations of the world. I need to limit that somehow to each district alone.
But if I run the Subquery alone as
SELECT (city.district, sum(city.population)) FROM city GROUP BY city.district;
it gives me the districts with their population:
row
----------------------------------
(Bali,435000)
(,4207443)
(Dnjestria,194300)
(Mérida,224887)
(Kochi,324710)
(Qazvin,291117)
(Izmir,2130359)
(Meta,273140)
(Saint-Denis,131480)
(Manitoba,618477)
(Changhwa,354117)
I realized it has to do something with the abbreviation that I use when joining. I used it for convenience but it seems to have real consequences because if I don't use it, it gives me the error
more than one row returned by a subquery used as an expression
Also, if I use
sum(c.population)
in the subquery it won't execute because
aggregate function calls cannot be nested
This abbreviation when joining apparently changes a lot.
I hope someone can shed some light on that.
Solved it myself.
Window functions are the most convenient method for this kind of task:
SELECT DISTINCT
country.name
, country.population
, city.district
, sum(city.population) OVER (PARTITION BY city.district)
AS district_population
, sum(city.population) OVER (PARTITION BY city.district)/ CAST(country.population as float)
AS district_share
FROM
country JOIN city ON country.code = city.countrycode
;
But it also works with subselects:
SELECT DISTINCT
country.name
, country.population
, city.district
,(
SELECT
sum(ci.population)
FROM
city ci
WHERE ci.district = city.district
) AS district_population
,(
SELECT
sum(ci2.population)/ CAST(country.population as float)
FROM
city ci2
WHERE ci2.district = city.district
) AS district_share
FROM
country JOIN city ON country.code = city.countrycode
ORDER BY
country.name
, country.population
;

DB2 SQL Getting distinct value when grouping rows

BUSINESSTABLE looks like this:
HOTEL_CHAIN HOTEL_LOCATION HOTEL_OWNER
_____________________________________________________
Marriott Las Vegas Nelson
Best Western New York Richards
Best Western San Francisco Smith
Marriott New York Nelson
Hilton Boston James
I'm trying to execute an SQL statement in a DB2 database that groups these entries by HOTEL_CHAIN. If the rows that are grouped together contain the same HOTEL_LOCATION or HOTEL_OWNER, that info should be preserved. Otherwise, a value of 'NULL' should be displayed. For example, both Marriott hotels have the same owner, Nelson, so I want to display that information in the new table. However, each Marriott hotel is in a different location, so I'd like to display 'NULL' in that column.
The resulting table (HOTELTABLE) should look like this:
HOTEL_CHAIN HOTEL_LOCATION HOTEL_OWNER
_____________________________________________________
Marriott NULL Nelson
Best Western NULL NULL
Hilton Boston James
I'm trying to use the following SQL statement to accomplish this:
INSERT INTO HOTELTABLE(HOTEL_CHAIN,HOTEL_LOCATION,HOTEL_OWNER)
SELECT
HOTEL_CHAIN,
CASE COUNT(DISTINCT(HOTEL_LOCATION)) WHEN 1 THEN HOTEL_LOCATION ELSE 'NULL' END,
CASE COUNT(DISTINCT(HOTEL_OWNER)) WHEN 1 THEN HOTEL_OWNER ELSE 'NULL' END,
FROM BUSINESSTABLE GROUP BY HOTEL_CHAIN
I get an SQL error SQLCODE-119 A COLUMN OR EXPRESSION IN A HAVING CLAUSE IS NOT VALID. It seems to be complaining about the 2nd HOTEL_LOCATION and the 2nd HOTEL_OWNER within my case statements. I also tried using DISTINCT(HOTEL_LOCATION) and that threw another error. Can someone please explain the correct way to code this? Thank you!
Don't use COUNT(DISTINCT). Use MIN() and MAX():
INSERT INTO HOTELTABLE(HOTEL_CHAIN,HOTEL_LOCATION,HOTEL_OWNER)
SELECT HOTEL_CHAIN,
(CASE WHEN MIN(HOTEL_LOCATION) = MAX(HOTEL_LOCATION)
THEN MIN(HOTEL_LOCATION) ELSE 'NULL'
END),
(CASE WHEN MIN(HOTEL_OWNER) = MAX(HOTEL_OWNER)
THEN MIN(HOTEL_OWNER) ELSE 'NULL'
END)
FROM BUSINESSTABLE
GROUP BY HOTEL_CHAIN;
Notes:
Why not COUNT(DISTINCT)? It is generally much more expensive than MIN() and MAX() because it needs to maintain internal lists of all values.
I don't approve of a string value called 'NULL'. Seems like it is designed to foster confusion. Perhaps just NULL the value itself?
I agree Gordon for the null (gj Gordon).
other method
INSERT INTO HOTELTABLE(HOTEL_CHAIN,HOTEL_LOCATION,HOTEL_OWNER)
select distinct f1.HOTEL_CHAIN,
case when f2.HasDiffLocation is not null then 'NULL' else f1.HOTEL_LOCATION end as HOTEL_LOCATION,
case when f3.HasDiffOwner is not null then 'NULL' else f1.HOTEL_OWNER end as HOTEL_OWNER
from BUSINESSTABLE f1
left outer join lateral
(
select 1 HasDiffLocation from BUSINESSTABLE f2b
where f1.HOTEL_CHAIN=f2b.HOTEL_CHAIN and f1.HOTEL_LOCATION<>f2b.HOTEL_LOCATION
fetch first rows only
) f2 on 1=1
left outer join lateral
(
select 1 HasDiffOwner from BUSINESSTABLE f3b
where f1.HOTEL_CHAIN=f3b.HOTEL_CHAIN and f1.HOTEL_OWNER<>f3b.HOTEL_OWNER
fetch first rows only
) f3 on 1=1
or like this :
INSERT INTO HOTELTABLE(HOTEL_CHAIN,HOTEL_LOCATION,HOTEL_OWNER)
select distinct f1.HOTEL_CHAIN,
ifnull(f2.result, f1.HOTEL_LOCATION) as HOTEL_LOCATION,
ifnull(f3.result, f1.HOTEL_OWNER) as HOTEL_LOCATION,
from BUSINESSTABLE f1
left outer join lateral
(
select 'NULL' result from BUSINESSTABLE f2b
where f1.HOTEL_CHAIN=f2b.HOTEL_CHAIN and f1.HOTEL_LOCATION<>f2b.HOTEL_LOCATION
fetch first rows only
) f2 on 1=1
left outer join lateral
(
select 'NULL' result from BUSINESSTABLE f3b
where f1.HOTEL_CHAIN=f3b.HOTEL_CHAIN and f1.HOTEL_OWNER<>f3b.HOTEL_OWNER
fetch first rows only
) f3 on 1=1

Comparing a list of values

For example, I have a head-table with one column id and a position-table with id, head-id (reference to head-table => 1 to N), and a value. Now I select one row in the head-table, say id 1. I look into the position-table and find 2 rows which referencing to the head-table and have the values 1337 and 1338. Now I wanna select all heads which have also 2 positions with these values 1337 and 1338. The position-ids are not the same, only the values, because it is not a M to N relation. Can anyone tell me a SQL-Statement? I have no idea to get it done :/
Assuming that the value is not repeated for a given headid in the position table, and that it is never NULL, then you can do this using the following logic. Do a full outer join on the position table to the specific head positions you care about. Then check whether there is a full match.
The following query does this:
select *
from (select p.headid,
sum(case when p.value is not null then 1 else 0 end) as pmatches,
sum(case when ref.value is not null then 1 else 0 end) as refmatches
from (select p.value
from position p
where p.headid = <whatever>
) ref full outer join
position p
on p.value = ref.value and
p.headid <> ref.headid
) t
where t.pmatches = t.refmatches
If you do have NULLs in the values, you can accommodate these using coalesce. If you have duplicates, you need to specify more clearly what to do in this case.
Assuming you have:
Create table head
(
id int
)
Create table pos
(
id int,
head_id int,
value int
)
and you need to find duplicates by value, then I'd use:
Select distinct p.head_id, p1.head_id
from pos p
join pos p1 on p.value = p1.value and p.head_id<>p1.head_id
where p.head_id = 1
for specific head_id, or without last where for every head_id

How can I choose the closest match in SQL Server 2005?

In SQL Server 2005, I have a table of input coming in of successful sales, and a variety of tables with information on known customers, and their details. For each row of sales, I need to match 0 or 1 known customers.
We have the following information coming in from the sales table:
ServiceId,
Address,
ZipCode,
EmailAddress,
HomePhone,
FirstName,
LastName
The customers information includes all of this, as well as a 'LastTransaction' date.
Any of these fields can map back to 0 or more customers. We count a match as being any time that a ServiceId, Address+ZipCode, EmailAddress, or HomePhone in the sales table exactly matches a customer.
The problem is that we have information on many customers, sometimes multiple in the same household. This means that we might have John Doe, Jane Doe, Jim Doe, and Bob Doe in the same house. They would all match on on Address+ZipCode, and HomePhone--and possibly more than one of them would match on ServiceId, as well.
I need some way to elegantly keep track of, in a transaction, the 'best' match of a customer. If one matches 6 fields, and the others only match 5, that customer should be kept as a match to that record. In the case of multiple matching 5, and none matching more, the most recent LastTransaction date should be kept.
Any ideas would be quite appreciated.
Update: To be a little more clear, I am looking for a good way to verify the number of exact matches in the row of data, and choose which rows to associate based on that information. If the last name is 'Doe', it must exactly match the customer last name, to count as a matching parameter, rather than be a very close match.
for SQL Server 2005 and up try:
;WITH SalesScore AS (
SELECT
s.PK_ID as S_PK
,c.PK_ID AS c_PK
,CASE
WHEN c.PK_ID IS NULL THEN 0
ELSE CASE WHEN s.ServiceId=c.ServiceId THEN 1 ELSE 0 END
+CASE WHEN (s.Address=c.Address AND s.Zip=c.Zip) THEN 1 ELSE 0 END
+CASE WHEN s.EmailAddress=c.EmailAddress THEN 1 ELSE 0 END
+CASE WHEN s.HomePhone=c.HomePhone THEN 1 ELSE 0 END
END AS Score
FROM Sales s
LEFT OUTER JOIN Customers c ON s.ServiceId=c.ServiceId
OR (s.Address=c.Address AND s.Zip=c.Zip)
OR s.EmailAddress=c.EmailAddress
OR s.HomePhone=c.HomePhone
)
SELECT
s.*,c.*
FROM (SELECT
S_PK,MAX(Score) AS Score
FROM SalesScore
GROUP BY S_PK
) dt
INNER JOIN Sales s ON dt.s_PK=s.PK_ID
INNER JOIN SalesScore ss ON dt.s_PK=s.PK_ID AND dt.Score=ss.Score
LEFT OUTER JOIN Customers c ON ss.c_PK=c.PK_ID
EDIT
I hate to write so much actual code when there was no shema given, because I can't actually run this and be sure it works. However to answer the question of the how to handle ties using the last transaction date, here is a newer version of the above code:
;WITH SalesScore AS (
SELECT
s.PK_ID as S_PK
,c.PK_ID AS c_PK
,CASE
WHEN c.PK_ID IS NULL THEN 0
ELSE CASE WHEN s.ServiceId=c.ServiceId THEN 1 ELSE 0 END
+CASE WHEN (s.Address=c.Address AND s.Zip=c.Zip) THEN 1 ELSE 0 END
+CASE WHEN s.EmailAddress=c.EmailAddress THEN 1 ELSE 0 END
+CASE WHEN s.HomePhone=c.HomePhone THEN 1 ELSE 0 END
END AS Score
FROM Sales s
LEFT OUTER JOIN Customers c ON s.ServiceId=c.ServiceId
OR (s.Address=c.Address AND s.Zip=c.Zip)
OR s.EmailAddress=c.EmailAddress
OR s.HomePhone=c.HomePhone
)
SELECT
*
FROM (SELECT
s.*,c.*,row_number() over(partition by s.PK_ID order by s.PK_ID ASC,c.LastTransaction DESC) AS RankValue
FROM (SELECT
S_PK,MAX(Score) AS Score
FROM SalesScore
GROUP BY S_PK
) dt
INNER JOIN Sales s ON dt.s_PK=s.PK_ID
INNER JOIN SalesScore ss ON dt.s_PK=s.PK_ID AND dt.Score=ss.Score
LEFT OUTER JOIN Customers c ON ss.c_PK=c.PK_ID
) dt2
WHERE dt2.RankValue=1
Here's a fairly ugly way to do this, using SQL Server code. Assumptions:
- Column CustomerId exists in the Customer table, to uniquely identify customers.
- Only exact matches are supported (as implied by the question).
SELECT top 1 CustomerId, LastTransaction, count(*) HowMany
from (select Customerid, LastTransaction
from Sales sa
inner join Customers cu
on cu.ServiceId = sa.ServiceId
union all select Customerid, LastTransaction
from Sales sa
inner join Customers cu
on cu.EmailAddress = sa.EmailAddress
union all select Customerid, LastTransaction
from Sales sa
inner join Customers cu
on cu.Address = sa.Address
and cu.ZipCode = sa.ZipCode
union all [etcetera -- repeat for each possible link]
) xx
group by CustomerId, LastTransaction
order by count(*) desc, LastTransaction desc
I dislike using "top 1", but it is quicker to write. (The alternative is to use ranking functions and that would require either another subquery level or impelmenting it as a CTE.) Of course, if your tables are large this would fly like a cow unless you had indexes on all your columns.
Frankly I would be wary of doing this at all as you do not have a unique identifier in your data.
John Smith lives with his son John Smith and they both use the same email address and home phone. These are two people but you would match them as one. We run into this all the time with our data and have no solution for automated matching because of it. We identify possible dups and actually physically call and find out id they are dups.
I would probably create a stored function for that (in Oracle) and oder on the highest match
SELECT * FROM (
SELECT c.*, MATCH_CUSTOMER( Customer.Id, par1, par2, par3 ) matches FROM Customer c
) WHERE matches >0 ORDER BY matches desc
The function match_customer returns the number of matches based on the input parameters... I guess is is probably slow as this query will always scan the complete customer table
For close matches you can also look at a number of string similarity algorithms.
For example, in Oracle there is the UTL_MATCH.JARO_WINKLER_SIMILARITY function:
http://www.psoug.org/reference/utl_match.html
There is also the Levenshtein distance algorithym.