How to get an accurate JOIN using Fuzzy matching in Oracle - sql

I'm trying to join a set of county names from one table with county names in another table. The issue here is that, the county names in both tables are not normalized. They are not same in count; also, they may not be appearing in similar pattern always. For instance, the county 'SAINT JOHNS' in "Table A" may be represented as 'ST JOHNS' in "Table B". We cannot predict a common pattern for them.
That means , we cannot use "equal to" (=) condition while joining. So, I'm trying to join them using the JARO_WINKLER_SIMILARITY function in oracle.
My Left Outer Join condition would be like:
Table_A.State = Table_B.State
AND UTL_MATCH.JARO_WINKLER_SIMILARITY(Table_A.County_Name,Table_B.County_Name)>=80
I've given the measure 80 after some testing of the results and it seemed to be optimal.
Here, the issue is that I'm getting set of "false Positives" when joining. For instance, if there are some counties with similarity in names under the same state ("BARRY'and "BAY" for example), they will be matched if the measure is >=80.
This creates inaccurate set of joined data.
Can anyone please suggest some work around?
Thanks,
DAV

Can you plz help me to build a query that will lookup Table_A for each record in Table B/C/D, and match against the county name in A with highest ranked similarity that is >=80
Oracle Setup:
CREATE TABLE official_words ( word ) AS
SELECT 'SAINT JOHNS' FROM DUAL UNION ALL
SELECT 'MONTGOMERY' FROM DUAL UNION ALL
SELECT 'MONROE' FROM DUAL UNION ALL
SELECT 'SAINT JAMES' FROM DUAL UNION ALL
SELECT 'BOTANY BAY' FROM DUAL;
CREATE TABLE words_to_match ( word ) AS
SELECT 'SAINT JOHN' FROM DUAL UNION ALL
SELECT 'ST JAMES' FROM DUAL UNION ALL
SELECT 'MONTGOMERY BAY' FROM DUAL UNION ALL
SELECT 'MONROE ST' FROM DUAL;
Query:
SELECT *
FROM (
SELECT wtm.word,
ow.word AS official_word,
UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) AS similarity,
ROW_NUMBER() OVER ( PARTITION BY wtm.word ORDER BY UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) DESC ) AS rn
FROM words_to_match wtm
INNER JOIN
official_words ow
ON ( UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word )>=80 )
)
WHERE rn = 1;
Output:
WORD OFFICIAL_WO SIMILARITY RN
-------------- ----------- ---------- ----------
MONROE ST MONROE 93 1
MONTGOMERY BAY MONTGOMERY 94 1
SAINT JOHN SAINT JOHNS 98 1
ST JAMES SAINT JAMES 80 1

Using some made up test data inline (you would use your own TABLE_A and TABLE_B in place of the first two with clauses, and begin at with matches as ...):
with table_a (state, county_name) as
( select 'A', 'ST JOHNS' from dual union all
select 'A', 'BARRY' from dual union all
select 'B', 'CHEESECAKE' from dual union all
select 'B', 'WAFFLES' from dual union all
select 'C', 'UMBRELLAS' from dual )
, table_b (state, county_name) as
( select 'A', 'SAINT JOHNS' from dual union all
select 'A', 'SAINT JOANS' from dual union all
select 'A', 'BARRY' from dual union all
select 'A', 'BARRIERS' from dual union all
select 'A', 'BANANA' from dual union all
select 'A', 'BANOFFEE' from dual union all
select 'B', 'CHEESE' from dual union all
select 'B', 'CHIPS' from dual union all
select 'B', 'CHICKENS' from dual union all
select 'B', 'WAFFLING' from dual union all
select 'B', 'KITTENS' from dual union all
select 'C', 'PUPPIES' from dual union all
select 'C', 'UMBRIA' from dual union all
select 'C', 'UMBRELLAS' from dual )
, matches as
( select a.state, a.county_name, b.county_name as matched_name
, utl_match.jaro_winkler_similarity(a.county_name,b.county_name) as score
from table_a a
join table_b b on b.state = a.state )
, ranked_matches as
( select m.*
, rank() over (partition by m.state, m.county_name order by m.score desc) as ranking
from matches m
where score > 50 )
select rm.state, rm.county_name, rm. matched_name, rm.score
from ranked_matches rm
where ranking = 1
order by 1,2;
Results:
STATE COUNTY_NAME MATCHED_NAME SCORE
----- ----------- ------------ ----------
A BARRY BARRY 100
A ST JOHNS SAINT JOHNS 80
B CHEESECAKE CHEESE 92
B WAFFLES WAFFLING 86
C UMBRELLAS UMBRELLAS 100
The idea is matches computes all scores, ranked_matches assigns them a sequence within (state, county_name), and the final query picks all the top scorers (i.e. filters on ranking = 1).
You may still get some duplicates as there is nothing to stop two different fuzzy matches scoring the same.

Related

Apply value to group

need help with,
if any ID with same Groupid has Yes in Payable, add Yes value to Results, otherwise blank.
This should be applicable for hundreds of IDs grouped in hundreds of GroupIDs.
ID
GroupID
Payable
Result
111
a
Yes
Yes
222
a
Yes
333
a
Yes
444
b
Yes
555
b
Yes
Yes
777
b
Yes
888
c
I tried to group based on groupID and created a case where groupId count equals or is higher as 1 and the eligibility is Yes.
Your question is a bit unclear due to lack of information about constraints, all the values in table etc.
With the given information tho, your task might be done by following query:
with groupData as
(
Select groupid,payable from put_your_table_name_here
where payable is not null
group by groupid
)
Select pt.id
,pt.groupId
,gd.payable
,pt.result
from put_your_table_name_here pt
left join groupData gd on gd.groupid=pt.groupid
The query has it drawbacks- you should give some more info about constraints, but generally it should work.
If you wouldnt want to have null values in payable column, you could change left join to join.
WITH CTE(ID, GroupID, Payable)AS
(
SELECT 111,'A','YES'
UNION ALL
SELECT 222,'A',''
UNION ALL
SELECT 333,'A',''
UNION ALL
SELECT 444,'B',''
UNION ALL
SELECT 555,'B','YES'
UNION ALL
SELECT 777,'B',''
UNION ALL
SELECT 888,'C',''
)
SELECT C.ID,C.GroupID,C.Payable,F.FLAG
FROM CTE AS C
JOIN
(
SELECT X.GROUPID,MAX(PAYABLE)FLAG
FROM CTE AS X
GROUP BY X.GroupID
)F ON C.GroupID=F.GroupID
ORDER BY C.ID;
You can try something like this
or
SELECT C.ID,C.GroupID,C.Payable,
FIRST_VALUE(C.PAYABLE)OVER(PARTITION BY C.GROUPID ORDER BY C.PAYABLE DESC)XCOL
FROM CTE AS C
ORDER BY C.ID;
with data (ID,GroupID,Payable) as (
Select 111, 'a', 'Yes' from dual union all
Select 222, 'a', null from dual union all
Select 333, 'a', null from dual union all
Select 444, 'b', null from dual union all
Select 555, 'b', 'Yes' from dual union all
Select 777, 'b', null from dual union all
Select 888, 'c', null from dual
)
,result as(
select GroupID, case when Count(*) > 1 then 'Yes' else null end Result
from data
group by GroupID)
Select * from data
Join result on data.GroupID = result.GroupID
order by data.ID,data.GroupID
Db fiddle link

Select total average of averages grouped by id

In my database that represents a car service station, I am trying to figure out a SQL query that would give me a total average of how much does the customer pays for a single service but instead of getting AVG() of the price on all existing Invoices, I want to group the invoices by the same reservation_id. After that, I would like to get the total average of all of those grouped results.
I am using the two tables listed in the picture below. I want to get the value of a total average price by applying AVG() on all averages that are made by grouping prices by the same FK Reservation_reservation_id.
I tried to make this into a single query but I failed so I came looking for help from more experienced users. Also, I need to select (get) only the result of the total average. This result should give me an overview of how much each customer pays on average for one reservation.
Thanks for your time
You appear to want to aggregate twice:
SELECT AVG( avg_price ) avg_avg_price
FROM (
SELECT AVG( price ) AS avg_price
FROM invoice
GROUP BY reservation_reservation_id
)
Which, for the sample data:
CREATE TABLE invoice ( reservation_reservation_id, price ) AS
SELECT 1, 10 FROM DUAL UNION ALL
SELECT 1, 12 FROM DUAL UNION ALL
SELECT 1, 14 FROM DUAL UNION ALL
SELECT 1, 16 FROM DUAL UNION ALL
SELECT 2, 10 FROM DUAL UNION ALL
SELECT 2, 11 FROM DUAL UNION ALL
SELECT 2, 12 FROM DUAL;
Outputs:
AVG_AVG_PRICE
12
db<>fiddle here
If you want this per customer:
SELECT customer_customer_id, AVG(avg_reservation_price)
FROM (SELECT i.customer_customer_id, i.reservation_reservation_id,
AVG(i.price) as avg_reservation_price
FROM invoice i
GROUP BY i.customer_customer_id, i.reservation_reservation_id
) ir
GROUP BY customer_customer_id;
If you want this for a particular "checkout reason" -- which is the closest that I imagine that "service" means -- then join in the reservations table and filter:
SELECT customer_customer_id, AVG(avg_reservation_price)
FROM (SELECT i.customer_customer_id, i.reservation_reservation_id,
AVG(i.price) as avg_reservation_price
FROM invoice i JOIN
reservation r
ON i.reservation_reservation_id = r.reservation_id
WHERE r.checkup_type = ?
GROUP BY i.customer_customer_id, i.reservation_reservation_id
) ir
GROUP BY customer_customer_id;
You might want to try the below:
with aux (gr, subgr, val) as (
select 'a', 'a1', 1 from dual union all
select 'a', 'a2', 2 from dual union all
select 'a', 'a3', 3 from dual union all
select 'a', 'a4', 4 from dual union all
select 'b', 'b1', 5 from dual union all
select 'b', 'b2', 6 from dual union all
select 'b', 'b3', 7 from dual union all
select 'b', 'b4', 8 from dual)
SELECT
gr,
avg(val) average_gr,
avg(avg(val)) over () average_total
FROM
aux
group by gr;
Which, applied to your table, would result in:
SELECT
reservation_id,
avg(price) average_rn,
avg(avg(price)) over () average_total
FROM
invoices
group by reservation_id;

How filter rows by matched values using BigQuery?

I have a table in BigQuery
SELECT 1 as big_id, 1 as temp_id, '101' as names
UNION ALL SELECT 1,1, 'z3Awwer',
UNION ALL SELECT 1,1, 'gA1sd03',
UNION ALL SELECT 1,2, 'z3Awwer',
UNION ALL SELECT 1,2, 'gA1sd03',
UNION ALL SELECT 1,3, 'gA1sd03',
UNION ALL SELECT 1,3, 'sAs10sdf4',
UNION ALL SELECT 1,4, 'sAs10sdf4',
UNION ALL SELECT 1,5, 'Adf105',
UNION ALL SELECT 2,1, 'A1sdf02',
UNION ALL SELECT 2,1, '345A103',
UNION ALL SELECT 2,2, '345A103',
UNION ALL SELECT 2,2, 'A1sd04',
UNION ALL SELECT 2,3, 'A1sd04',
UNION ALL SELECT 2,4, '6_0Awe105'
I want to filter it by temp_id if all names of one temp_id included in some another temp_id in partition by big_id window. For example I do not need to select all rows where temp_id = 2 because all names of temp_id = 2 included in temp_id = 1. As well as need to keep all rows of temp_id = 1 because this names range covers names range of temp_id = 2
So expected output:
SELECT 1 as big_id, 1 as temp_id, '101' as names
UNION ALL SELECT 1,1, 'z3Awwer',
UNION ALL SELECT 1,1, 'gA1sd03',
UNION ALL SELECT 1,3, 'gA1sd03',
UNION ALL SELECT 1,3, 'sAs10sdf4',
UNION ALL SELECT 1,5, 'Adf105',
UNION ALL SELECT 2,1, 'A1sdf02',
UNION ALL SELECT 2,1, '345A103',
UNION ALL SELECT 2,2, '345A103',
UNION ALL SELECT 2,2, 'A1sd04',
UNION ALL SELECT 2,4, '6_0Awe105'
How can I make it using BigQuery?
Below is for BigQuery Standard SQL
#standardsql
with temp as (
select big_id, temp_id, array_agg(names) names
from `project.dataset.table`
group by big_id, temp_id
)
select big_id, temp_id, names
from (
select big_id, temp_id, any_value(names) names
from (
select t1.*,
( select count(1)
from t1.names name
join t2.names name
using(name)
where t1.temp_id != t2.temp_id
) = array_length(t1.names) as flag
from temp t1
join temp t2
using (big_id)
)
group by big_id, temp_id
having countif(flag) = 0
), unnest(names) names
If to apply above to sample data from your question - the output is

Write a query that gets Poorly Mastered records and correctly Mastered Records?

I have to write a query to that has all correctly mastered recipients ( group by first_name and last_name)
I have to write another query that have all poorly mastered recipients ( group by first_name , last_name)
Please see the images below if there are multiple Master Id's against First Name and Last Name then its poorly Mastered.. if it have same Master ID then its correctly Mastered.
Sample data for the query is provided below
WITH DATA1 AS
(
SELECT 5175133 ID,'Yun' FIRST_NAME,'Yue' LAST_NAME,NULL MASTER_ID FROM dual UNION ALL
SELECT 5157093,'Yun','Yue',5157093 FROM dual UNION ALL
SELECT 5226656,'Yun','Yue',NULL FROM dual UNION ALL
SELECT 6345852,'Yun','Yue',5157093 FROM dual UNION ALL
SELECT 5882603,'Ye','Han',5157093 FROM dual UNION ALL
SELECT 5902219,'Ye','Han',5157093 FROM dual UNION ALL
SELECT 6362890,'Rick','Kaylor',NULL FROM dual UNION ALL
SELECT 6362940,'Rick','Kaylor',NULL FROM dual UNION ALL
SELECT 5215659,'Rick','Kaylor',NULL FROM dual UNION ALL
SELECT 5962837,'Rick','Kaylor',5962837 FROM dual UNION ALL
SELECT 5841556,'Rick','Kaylor',5841556 FROM dual UNION ALL
SELECT 5916218,'Sherlene','Heard',5916218 FROM dual UNION ALL
SELECT 6356086,'Sherlene','Heard',5916218 FROM dual UNION ALL
SELECT 5885157,'Ye','Kong',5884937 FROM dual UNION ALL
SELECT 5884937,'Ye','Kong',NULL FROM dual UNION ALL
SELECT 5898890,'Ye','Kong',5884937 FROM dual
)
SELECT * FROM DATA1
I think its a simple query please provide help?
Thanks
As this is very probably some kind of homework or assignment, just a clue:
Have you think about using COUNT(*) in a sub-query ? As far as I can tell, "correctly mastered recipients" will all have one and only one master_id...

How to get query to return rows where first three characters of one row match another row?

Here's my data:
with first_three as
(
select 'AAAA' as code from dual union all
select 'BBBA' as code from dual union all
select 'BBBB' as code from dual union all
select 'BBBC' as code from dual union all
select 'CCCC' as code from dual union all
select 'CCCD' as code from dual union all
select 'FFFF' as code from dual union all
select 'GFFF' as code from dual )
select substr(code,1,3) as r1
from first_three
group by substr(code,1,3)
having count(*) >1
This query returns the characters that meet the cirteria. Now, how do I select from this to get desired results? Or, is there another way?
Desired Results
BBBA
BBBB
BBBC
CCCC
CCCD
WITH code_frequency AS (
SELECT code,
COUNT(1) OVER ( PARTITION BY SUBSTR( code, 1, 3 ) ) AS frequency
FROM table_name
)
SELECT code
FROM code_frequency
WHERE frequency > 1
WITH first_three AS (
...
)
SELECT *
FROM first_three f1
WHERE EXISTS (
SELECT 1 FROM first_three f2
WHERE f1.code != f2.code
AND substr(f1.code, 1, 3) = substr(f2.code, 1, 3)
)
select res from (select res,count(*) over
(partition by substr(res,1,3) order by null) cn from table_name) where cn>1;