Oracle REGEXP_SUBSTR for string matching b/w two columns - sql

The problem
Users are frequently inputting "country name" strings into the "city name" field. Heuristically, this appears to be an extremely common practice. For example, a user might put "TAIPEI TAIWAN" in the city name when only "TAIPEI" should be input and then the country would be "TAIWAN". I am working to aggregate these instances for this specific field (your help will allow me to expand this to other columns and tables) and then identify where possible rankings associated with strictly the "country" names in the "city" field.
I have two tables that I am attempting to leverage to track down data validation issues. Tbl1 is named "Customer_Address" comprised of geographic columns like (Customer_Num, Address, City_Name, State, Country_Code, Zipcode). Tbl2 named "HR_Countries" is clean table of 2-digit ISO country codes with their corresponding name values (Lebanon, Taiwan, China, Syria, Russia, Ukraine, etc) and some other fields not presently used.
The initial step is to query "Customer_Address" to find City_Names LIKE a series of OR statements (LIKE '%CHINA', OR LIKE 'TAIWAN', OR etc etc) and count the number of occurrences where the City_Name is like the designated country_name string I passed it and the results are pretty good. I've coded in some exclusions to deal with things like "Lebanon, OH" so my overall results are satisfactory for the first phase.
Part of the query does a LEFT join from Tbl1 to Tbl2 to add the risk rating from tbl2 as a result of the query against tbl1:
LEFT JOIN tbl2 risk
ON INSTR(addr.CITY_NM, risk.COUNTRY_NAME,1) <> 0
Example of Tbl1 Data Output (head(tbl1), n=7)
CountryNameInCity CountOfOccurences RR
China 15 High
Taiwan 2000 Medium
Japan 250 Low
Taipei, Taiwan 25 NULL
Kabul, Afghanistan 10 NULL
Shenzen China 100 NULL
Afghanistan 52 Very High
Example of Tb2 Data (head(tbl2), n=6)
CountryName CountryCode RR
China CN High
Taiwan TW High
Iraq IQ Very High
Cuba CU Medium
Lebanon LB Very High
Greece GR High
So my question(s) are as follows:
1) Instead of manually passing in a series of OR-statements for country codes is there a better way to using Tbl2 as the matching "LIKE" driving the query?
2) Can you recommend a better way of comparing the output of the query (see Tbl1 example) and ensuring that multiple strings (Taipei, Taiwan, etc) are appropriately aggregated and bring back the correct 'RR' rating.
Thanks for taking the time to review this and respond.

Related

How does SQL count(distinct) work in this case?

I'm trying to find the match no in which Germany played against Poland. This is from https://www.w3resource.com/sql-exercises/soccer-database-exercise/sql-subqueries-exercise-soccer-database-4.php. There are two tables : match_details and soccer_country. I don't understand how the count(distinct) works in this case. Can someone please clarify? Thanks!
SELECT match_no
FROM match_details
WHERE team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Germany')
OR team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Poland')
GROUP BY match_no
HAVING COUNT(DISTINCT team_id) = 2;
As Lamak mentioned, what an ugly consideration for a query, but many ways to approach a query.
As mentioned, counting for (Distinct team_id) makes sure that there are only 2 unique teams. If there is ever a Cartesian result, you could get repetition of multiple rows showing more than one instance of both teams. So the count of distinct on the TEAM_ID eliminates that.
Now, that said, Other "team" query data structures I have seen have a single record for the match and a column for EACH TEAM playing the match. That is easier by a long-shot, but still a relatively easy query.
Break the query down a little, and consider a large scale set of data (not that this, or any sort of even professional league would have such large record counts to give delay with a sql engine).
Your first criteria is games with Germany. So lets start with that.
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
So, why even look at any other record/match if Germany is not even part of the match on either side. Of which this in itself would return 6 matches from the sample data of 51 matches. So now, all you need to do is join AGAIN to the match details table a second time for only those matches, but ALSO the second team is Poland
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
-- joining again for the same match Germany was already qualified
JOIN match_details md2
on md1.match_no = md2.match_no
-- but we want the OTHER team record since Germany was first team
and md1.team_id != md2.team_id
-- and on to the second country table based on the SECOND team ID
JOIN soccer_country sc2
on md2.team_id = sc2.country_id
-- and the second team was Poland
AND sc2.country_name = 'Poland'
Yes, may be a longer query, but by eliminating 45 other matches (again, thinking a LARGE database), you have already saved blowing through tons of data to a very finite set. And now finishing only those Germany / Poland. No aggregates, counts, distincts, just direct joins.
FEEDBACK
Lets take a look at some BAD sample data... which as all programmers know, there is no such thing (NOT). Anyhow, lets take a look at these few matches.
Match Team ID blah
52 Poland Just put the names here for simplistic purposes
52 Poland
53 Germany
53 Germany
If you were to run the query without DISTINCT Teams, both match 52 and 53 would show up... As Poland is one team and appears 2 times for match 52, and similarly Germany 2 times for match 53. By doing DISTINCT Team, you can see that for each match, there is only 1 team being returned and thus excluded. Does that help? Again, no such thing as bad data :)
And yet another sample match where more than 2 teams created
Match Team ID
54 France
54 Poland
54 England
55 Hungary
56 Austria
In each of these matches, NONE would be returned. Match 54 has 3 distinct teams, and Match 55 and 56 only have single entry, thus no opponent to compete against.
2nd FEEDBACK
To clarify the query. If you look at the short query for just Germany, that aliased instance of "md1" is already sitting on any given record for a Germany match. So the second join to the "md2", I only care about the same match, so I can join on the same match_no. However, in the "md2" alias, the "!=" means NOT EQUAL. ! = logical NOT. So the join is saying from the MD1, join to the MD2 alias on the same match id. However, only give me where the teams are NOT the same. So the first instance holds Germany's team ID (already qualified) and thus give me the secondary team id. So now I can use the secondary (md2) instance team ID to join to the country to confirm only for Poland.
Does this now clarify things for you?

match tables with intermediate mapping table (fuzzy joins with similar strings)

I'm using BigQuery.
I have two simple tables with "bad" data quality from our systems. One represents revenue and the other production rows for bus journeys.
I need to match every journey to a revenue transaction but I only have a set of fields and no key and I don't really know how to do this matching.
This is a sample of the data:
Revenue
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, London, Manchester, Qwerty
Journeys
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, Kings Cross, Piccadilly Gardens, Qwer
2020, 123123, Kings Cross, Victoria Station, Qwert
2020, 123123, London, Manchester, Qwerty
Every station has a maximum of 9 alternative names and these are stored in a "station" table.
Stations
Station Name, Station Name 2, Station Name 3,...
London, Kings Cross, Euston,...
Manchester, Piccadilly Gardens, Victoria Station,...
I would like to test matching or joining the tables first with the original fields. This will generate some matches but there are many journeys that are not matched. For the unmatched revenue rows, I would like to change the product name (shorten it to two letters and possibly get many matches from production table) and then station names by first change the station_origin and then station_destination. When using a shorter product name I could possibly get many matches but I want the row from the production table with the most common product.
Something like this:
1. Do a direct match. That is, I can use the fields as they are in the tables.
2. Do a match where the revenue.product is changed by shortening it to two letters. substr(product,0,2)
3. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
4. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product is changed as above with a substr(product,0,2) but rev.station_destination is not changed.
5. Change the rev.station_destination to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
I was told that maybe I should create an intermediate table with all combinations of stations and products and let a rank column decide the order. The station names in the station's table are in order of importance so "station name" is more important than "station name 2" and so on.
I started to do a query with a subquery per rank and do a UNION ALL but there are so many combinations that there must be another way to do this.
Don't know if this makes any sense but I would appreciate any help or ideas to do this in a better way.
Cheers,
Cris
To implement a complex joining strategy with approximate matching, it might make more sense to define the strategy within JavaScript - and call the function from a BigQuery SQL query.
For example, the following query does the following steps:
Take the top 200 male names in the US.
Find if one of the top 200 female names matches.
If not, look for the most similar female name within the options.
Note that the logic to choose the closest option is encapsulated within the JS UDF fhoffa.x.fuzzy_extract_one(). See https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83 to learn more about this.
WITH data AS (
SELECT name, gender, SUM(number) c
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY 1,2
), top_men AS (
SELECT * FROM data WHERE gender='M'
ORDER BY c DESC LIMIT 200
), top_women AS (
SELECT * FROM data WHERE gender='F'
ORDER BY c DESC LIMIT 200
)
SELECT name male_name,
COALESCE(
(SELECT name FROM top_women WHERE name=a.name)
, fhoffa.x.fuzzy_extract_one(name, ARRAY(SELECT name FROM top_women))
) female_version
FROM top_men a

Output different query results in same table

After the recent update, google bigquery now allows querying from country-specific tables.I wanted to find the number of origins(websites) in the us table containing the word 'space' and display it side by side with a similar result from the Japan table. The query I'm making is -
WITH
query_1 as
(select distinct origin as japan
from `chrome-ux-report.country_jp.201712` where
origin like "%space%"),
query_2 as
(select distinct origin as usa
from `chrome-ux-report.country_us.201712`
where origin like "%space%" )
SELECT japan,usa from query_1,query_2
But it results in a table having multiple repetitions of the same origin in both the japan and usa column. Another strange thing is that the o/p table contains same number of rows for japan and usa where clearly, the number of sites containing the word 'space' is not same in the 2 tables. I'm using standard sql, not legacy.
Any help is appreciated. Thanks.
Note: by side by side, I mean there will be two columns, the japan column displaying sites for japan and the usa column displaying results for usa.
in BigQuery Standard SQL (that you are using in your query) comma between tables in FROM statement mean CROSS JOIN. This explains why it results in a table having multiple repetitions of the same origin in both the japan and usa column
Depends on how exactly you want your result to look - you can construct your query in many different ways - for example
WITH
query_1 AS
(SELECT DISTINCT origin AS japan
FROM `chrome-ux-report.country_jp.201712` WHERE
origin LIKE "%space%"),
query_2 AS
(SELECT DISTINCT origin AS usa
FROM `chrome-ux-report.country_us.201712`
WHERE origin LIKE "%space%" )
SELECT
ARRAY(SELECT japan FROM query_1) AS japan,
ARRAY(SELECT usa FROM query_2) AS usa
Also you can check counts as below
WITH
query_1 AS
(SELECT DISTINCT origin AS japan
FROM `chrome-ux-report.country_jp.201712` WHERE
origin LIKE "%space%"),
query_2 AS
(SELECT DISTINCT origin AS usa
FROM `chrome-ux-report.country_us.201712`
WHERE origin LIKE "%space%" )
SELECT
ARRAY_LENGTH(ARRAY(SELECT japan FROM query_1)) AS japan_count,
ARRAY_LENGTH(ARRAY(SELECT usa FROM query_2)) AS usa_count

SQL JOINing on max value, even if it is 0

I have two tables that look roughly like this:
Airports
uniqueID | Name
0001 | Dallas
Runways
uniqueID | AirportID | Length
000101 | 0001 | 8000
I'm doing a join that looks like this:
SELECT Airports.Name, Runways.Length FROM Airports, Runways
WHERE Airports.uniqueID==Runways.AirportID
Obviously, each runway has exactly one airport, and each airport has 1..n runways.
For an airport with multiple runways, this gives me several rows, one for each runway at that airport.
I want a result set that contains ONLY the row for the longest runway, i.e. MAX(Length).
Sometimes, the Length is 0 for several runways in the database, because the source data is missing. In that case I only want one row with the Length = 0 obviously.
I've tried the approach laid out here: Inner Join table with respect to a maximum value but that's actually not helpful because that's like searching for the longest runway of all, not for the longest at one particular airport.
This seems to simple to be what you want but it seems to meet all the cases you've described...
SELECT A.Name, Max(R.Length)
FROM Airports A
INNER JOIN Runways R
on A.uniqueID=R.AirportID
Group by A.Name
This should give you the max runway for each airport.
If you need additional data elements then use the above as a inline view (Subquery within the joins) to limit the results sets to just those airports and their max runway.

SQL Server - copy data across tables , but copy the data only when it match with a specific column name

For example I got this 2 table
dbo.fc_states
StateId Name
6316 Alberta
6317 British Columbia
and dbo.fc_Query
Name StatesName StateId
Abbotsford Quebec NULL
Abee Alberta NULL
100 Mile House British Columbia NULL
Ok pretty straightforward , how do I copy the stateId over from fc_states to fc_Query, but match it with the StatesName, let say the result would be
Name StatesName StateId
Abee Alberta 6316
100 Mile House British Columbia 6317
Thanks, and both stateName column type is text
How about:
update fc_Query set StateId =
(select StateId from fc_states where fc_states.Name = fc_Query.StatesName)
That should give you the result you're looking for.
This is a different way than what Eddie did, I like MERGE for updates if they're not dead simple (like I wouldn't consider yours dead simple). So if you're bored/curious also try
WITH stateIds as
(SELECT name, MAX(stateID) as stID
FROM fc_states
GROUP BY name)
MERGE fc_Query
on stateids.name = fc_query.statesname
WHEN MATCHED THEN UPDATE
SET fc_query.stateid = convert(int, stid)
;
The first part, from "WITH" to the GROUP BY NAME), is a CTE, that creates a table-like thing - a name 'stateIds' that is good as a table for the immediately following part of the query - where there's guaranteed to be only one row per state name. Then the MERGE looks for anything in the fc_query with a matching name. And if there's a match, it sets it as you want. YOu can make a small edit if you don't want to overwrite existing stateids in fc_query:
WITH stateIds as
(SELECT name, MAX(stateID) as stID
FROM fc_states
GROUP BY name)
MERGE fc_Query
ON stateids.name = fc_query.statesname
AND fc_query.statid IS NOT NULL
WHEN MATCHED THEN UPDATE
SET fc_query.stateid = convert(int, stid)
;
And you can have it do something different to rows that don't match. So I think MERGE is good for a lot of applications. You need a semicolon at the end of MERGE statements, and you have to guarantee that there will only be one match or zero matches in the source (that is "stateids", my CTE) for each row in the target; if there's more than one match some horrible thing happens, Satan wins or the US economy falters, I'm not sure what, just never let it happen.