Check if string variations exists in another string - sql

I need to check if a partial name matches full name. For example:
Partial_Name | Full_Name
--------------------------------------
John,Smith | Smith William John
Eglid,Timothy | Timothy M Eglid
I have no clue how to approach this type of matching.
Another thing is that name and last name may come in the wrong order, making it harder.
I could do something like this, but this only works if names are in the same order and 100% match
decode(LOWER(REGEXP_REPLACE(Partial_Name,'[^a-zA-Z'']','')), LOWER(REGEXP_REPLACE(Full_Name,'[^a-zA-Z'']','')), 'Same', 'Different')

you could use this pattern on the text provided - works for most engines
([^ ,]+),([^ ,]+)(?=.*\b\1\b)(?=.*\b\2\b)
Demo

WITH
/*
tab AS
(
SELECT 'Smith William John' Full_Name, 'John,Smith' Partial_Name FROM dual
UNION ALL SELECT 'Timothy M Eglid', 'Eglid,timothy' FROM dual
UNION ALL SELECT 'Tim M Egli', 'Egli,Tim,M2' FROM dual
UNION ALL SELECT 'Timot M Eg', 'Eg' FROM dual
),
*/
tmp AS (
SELECT Full_Name, Partial_Name,
trim(CASE WHEN instr(Partial_Name, ',') = 0 THEN Partial_Name
ELSE regexp_substr(Partial_Name, '[^,]+', 1, lvl+1)
END) token
FROM tab t CROSS JOIN (SELECT lvl FROM (SELECT LEVEL-1 lvl FROM dual
CONNECT BY LEVEL <= (SELECT MAX(LENGTH(Partial_Name) - LENGTH(REPLACE(Partial_Name, ',')))+1 FROM tab)))
WHERE LENGTH(Partial_Name) - LENGTH(REPLACE(Partial_Name, ',')) >= lvl
)
SELECT Full_Name, Partial_Name
FROM tmp
GROUP BY Full_Name, Partial_Name
HAVING count(DISTINCT token)
= count(DISTINCT CASE WHEN REGEXP_LIKE(Full_Name, token, 'i')
THEN token ELSE NULL END);
In the tmp each partial_name is splitted on tokens (separated by comma)
The resulting query retrieves only those rows which full_name matches all the corresponding tokens.
This query works with the dynamic number of commas in partial_name. If there can be only zero or one commas then the query will be much easier:
SELECT * FROM tab
WHERE instr(Partial_Name, ',') > 0
AND REGEXP_LIKE(full_name, substr(Partial_Name, 1, instr(Partial_Name, ',')-1), 'ix')
AND REGEXP_LIKE(full_name, substr(Partial_Name,instr(Partial_Name, ',')+1), 'ix')
OR instr(Partial_Name, ',') = 0
AND REGEXP_LIKE(full_name, Partial_Name, 'ix');

This is what I ended up doing... Not sure if this is the best approach.
I split partials by comma and check if first name present in full name and last name present in full name. If both are present then match.
CASE
WHEN
instr(trim(lower(Full_Name)),
trim(lower(REGEXP_SUBSTR(Partial_Name, '[^,]+', 1, 1)))) > 0
AND
instr(trim(lower(Full_Name)),
trim(lower(REGEXP_SUBSTR(Partial_Name, '[^,]+', 1, 2)))) > 0
THEN 'Y'
ELSE 'N'
END AS MATCHING_NAMES

Related

Is there an Oracle Function to determine the number of repeat character

I need to validate the number of repeat character in a email.
I try the next code who give me the percentage of repeat character, but only work if character are next to each other. So one posibily its order the email by character to get my result.
SELECT
round(((REGEXP_COUNT(regexp_replace(SUBSTR('999824123#HOTMAIL.COM',1,INSTR('989824123#HOTMAIL.COM', '#', 1)-1), '(.)\1+','&'),'&')+length(SUBSTR('989824123#HOTMAIL.COM',1,INSTR('989824123#HOTMAIL.COM', '#', 1)-1)) - length(regexp_replace(SUBSTR('989824123#HOTMAIL.COM',1,INSTR('989824123#HOTMAIL.COM', '#', 1)-1), '(.)\1+','\1')))* 100)/length(SUBSTR('989824123#HOTMAIL.COM',1,INSTR('989824123#HOTMAIL.COM', '#', 1)-1)),2) AS PORCENTAJE_IGUAL
FROM DUAL A;
I expect 60% of repeat character for this email 989824123#HOTMAIL.COM. not incluing domain.
please Help.
PD: sorry for the bad english
Numbers 9, 8, 2 repeats in email, so we have 6 characters (9, 9, 8, 8, 2, 2) which repeats and 3 unique (1, 3, 4). 6/9 gives us 66,67%.
You can use this query to count this:
with
t(email) as (select '989824123#hotmail.com' from dual),
a(email) as (select substr(email, 1,instr(email, '#', 1)-1) from t),
l as (select substr(email, level, 1) ltr from a connect by level <= length(email))
select sum(case when cnt <> 1 then cnt end) / sum(cnt)
from (select ltr, count(1) cnt from l group by ltr)
I cut domain, then in subquery l I divided string into one-letter rows, rest was only to count non-unique chars and divide by number of all chars.
edit:
how do you apply something like this in a update or select for a large
scale data base with many email?
You can create function:
create or replace function rpt_similarity(i_email in varchar2) return number is
v_email varchar2(100);
v_ret number;
begin
v_email := substr(i_email, 1, instr(i_email, '#', 1) - 1);
with l as (
select substr(v_email, level, 1) ltr
from dual
connect by level <= length(v_email))
select sum(case when cnt <> 1 then cnt end) / sum(cnt)
into v_ret
from (select ltr, count(1) cnt from l group by ltr);
return v_ret;
end;
and use it like here:
select rpt_similarity('abxabc#pqr.com') from dual;
or:
select rpt_similarity(email) from your_table;
Also you can use above solution in select directly, without function, here is the example:
create table test(id, email) as (
select 101, '989824123#hotmail.com' from dual union all
select 102, 'hsimpson#gmail.com' from dual union all
select 103, 'msimpson#gmail.com' from dual union all
select 104, 'bsimpson121314#hotmail.com' from dual union all
select 105, 'abxabx#hotmail.com' from dual );
with
a(id, email) as (select id, substr(email, 1,instr(email, '#', 1)-1) from test),
l as (
select id, email, substr(email, level, 1) ltr from a
connect by level <= length(email)
and prior id = id and prior sys_guid() is not null)
select id, email, sum(case when cnt <> 1 then cnt end) / sum(cnt)
from (select id, email, ltr, count(1) cnt from l group by id, ltr, email)
group by id, email;
connect by queries tends to be slow for large sets of data. Maybe you can adapt your regexp functions and it will be faster. I tried to do it, but your regexp_replace changes 99 into $ and 999 also into one $.

How to trim out letter in the column

I don't know the effective way to trim out letter in the name. For example, the f_name column have Jenny, Johnny, Doe, Ken, Smith.
I wanted to trim out the letter in these name so it consist only the first 2 letter. Like Je, Jo, Do, Ke, Sm as the output for the new column.
But the letter in these name don't have equal number of letter, like Johnny have 6 letter and John have 4 letter.
Is there any effective way to trim the uneven character's length without count all the character's length in f_name and place all the condition to trim all names. Like these below.
CASE WHEN LENGTH(f_name) > 4 THEN LTRIM(f_name, 2)
For Oracle use substr():
with data (f_name) as (
select 'Jenny' from dual union all
select 'Johnny' from dual union all
select 'Doe' from dual union all
select 'Ken' from dual union all
select 'Smith' from dual
)
select substr(f_name, 1, 2)
from data
Returns:
SUBSTR(F_NAME,1,2)
------------------
Je
Jo
Do
Ke
Sm
USE SUBSTRING
CASE WHEN LENGTH(f_name) > 4 THEN SUBSTR(f_name,1, 2)
If you want to get least acronym by all names. You may write something like
with s as (select level as lvl from dual connect by level <(select max(LENGTH(f_name)) from your_table ))
select f_name,
max(sub_f_name) keep (dense_rank FIRST order by cnt, t.lvl desc) as least_acronym
select f_name
, substr(t.f_name,-lvl) as sub_f_name
, t.lvl
, count(*) over (partition by substr(t.f_name,-lvl)) as cnt
from your_table t
, s)
group by f_name
NB. Just as Idea. Not tested yet

Select Value Match - SQL

Can you advise if it is possible, to select a count for numerous substrings in a query
so if I have a message field which contains for example, text messages and I could do
SELECT COUNT(1)
FROM MESSAGES
WHERE MESSAGE_BODY LIKE '%hello%'
but what I want to do is more:
SELECT STRING, COUNT(1)
FROM MESSAGES
WHERE MESSAGE_BODY IN (list of strings with wild card)
is this possible?
to break down example:
ID | Message_Body
1 | Hello, How Are You?
2 | Hi, Great Thanks
3 | Hello, How is things?
4 | Ciao
Output wanted:
hello , 2
ciao, 1
SELECT (input strings), COUNT(1)
FROM TABLE
WHERE (input strings) IN ('%hello%','%ciao%')
If I understood you correctly, you can try something like this:
SELECT t.string,
CASE WHEN t.MESSAGE_BODY LIKE '%laptop%' then 1 else 0 END +
CASE WHEN t.MESSAGE_BODY LIKE '%one%' then 1 else 0 END +
CASE WHEN t.MESSAGE_BODY LIKE '%two%' then 1 else 0 END as count_col
FROM YourTable t
If you just want multiple LIKE comaparison, use REGEXP_LIKE() :
SELECT STRING, COUNT(1)
FROM MESSAGES
where regexp_like(MESSAGE_BODY, 'one|two|laptop')
EDIT: You can use a derived table containing all strings you are intrested on and left join to the original table for count:
SELECT t.wrd,COUNT(s.id) as cnt
FROM (
SELECT 'hello' as wrd FROM DUAL
UNION ALL
SELECT 'ciao' as wrd FROM DUAL) t
LEFT OUTER JOIN messages s
ON(s.message_body LIKE '%' || t.wrd || '%')
GROUP BY t.wrd
Here is with looking for whole words:
SELECT a.word, COUNT (message.message_body)
FROM ( SELECT REGEXP_SUBSTR ('hello,ciao', '[^,]+', 1, LEVEL) word
FROM DUAL
CONNECT BY REGEXP_SUBSTR ('hello,ciao', '[^,]+', 1, LEVEL) IS NOT NULL) a
LEFT OUTER JOIN MESSAGES ON REGEXP_INSTR (MESSAGE_BODY, '(^|\s)' || a.word || '(\s|$)', 1, 1, 0, 'i') > 0
GROUP BY a.word

How to convert only first letter uppercase without using Initcap in Oracle?

Is there a way to convert the first letter uppercase in Oracle SQl without using the Initcap Function?
I have the problem, that I must work with the DISTINCT keyword in SQL clause and the Initcap function doesn´t work.
Heres is my SQL example:
select distinct p.nr, initcap(p.firstname), initcap(p.lastname), ill.describtion
from patient p left join illness ill
on p.id = ill.id
where p.deleted = 0
order by p.lastname, p.firstname;
I get this error message: ORA-01791: not a SELECTed expression
When SELECT DISTINCT, you can't ORDER BY columns that aren't selected. Use column aliases instead, as:
select distinct p.nr, initcap(p.firstname) fname, initcap(p.lastname) lname, ill.describtion
from patient p left join illness ill
on p.id = ill.id
where p.deleted = 0
order by lname, fname
this would do it, but i think you need to post your query as there may be a better solution
select upper(substr(<column>,1,1)) || substr(<column>,2,9999) from dual
To change string to String, you can use this:
SELECT
regexp_replace ('string', '[a-z]', upper (substr ('string', 1, 1)), 1, 1, 'i')
FROM dual;
This assumes that the first letter is the one you want to convert. It your input text starts with a number, such as 2 strings then it won't change it to 2 Strings.
You can also use the column number instead of the name or alias:
select distinct p.nr, initcap(p.firstname), initcap(p.lastname), ill.describtion
from patient p left join illness ill
on p.id = ill.id
where p.deleted = 0
order by 3, 2;
WITH inData AS
(
SELECT 'word1, wORD2, word3, woRD4, worD5, word6' str FROM dual
),
inRows as
(
SELECT 1 as tId, LEVEL as rId, trim(regexp_substr(str, '([A-Za-z0-9])+', 1, LEVEL)) as str
FROM inData
CONNECT BY instr(str, ',', 1, LEVEL - 1) > 0
)
SELECT tId, LISTAGG( upper(substr(str, 1, 1)) || substr(str, 2) , '') WITHIN GROUP (ORDER BY rId) AS camelCase
FROM inRows
GROUP BY tId;

Remove duplicate values from comma separated string in Oracle

I need your help with the regexp_replace function. I have a table which has a column for concatenated string values which contain duplicates. How do I eliminate them?
Example:
Ian,Beatty,Larry,Neesha,Beatty,Neesha,Ian,Neesha
I need the output to be
Ian,Beatty,Larry,Neesha
The duplicates are random and not in any particular order.
Update--
Here's how my table looks
ID Name1 Name2 Name3
1 a b c
1 c d a
2 d e a
2 c d b
I need one row per ID having distinct name1,name2,name3 in one row as a comma separated string.
ID Name
1 a,c,b,d,c
2 d,c,e,a,b
I have tried using listagg with distinct but I'm not able to remove the duplicates.
The easiest option I would go with -
SELECT ID, LISTAGG(NAME_LIST, ',')
FROM (SELECT ID, NAME1 NAME_LIST FROM DATA UNION
SELECT ID, NAME2 FROM DATA UNION
SELECT ID, NAME3 FROM DATA
)
GROUP BY ID;
Demo.
So, try this out...
([^,]+),(?=.*[A-Za-z],[] ]*\1)
I don't think you can do it just with regexp_replace if the repeated values are not next to each other. One approach is to split the values up, eliminate the duplicates, and then put them back together.
The common method to tokenize a delimited string is with regexp_substr and a connect by clause. Using a bind variable with your string to make the code a bit clearer:
var value varchar2(100);
exec :value := 'Ian,Beatty,Larry,Neesha,Beatty,Neesha,Ian,Neesha';
select regexp_substr(:value, '[^,]+', 1, level) as value
from dual
connect by regexp_substr(:value, '[^,]+', 1, level) is not null;
VALUE
------------------------------
Ian
Beatty
Larry
Neesha
Beatty
Neesha
Ian
Neesha
You can use that as a subquery (or CTE), get the distinct values from it, then reassemble it with listagg:
select listagg(value, ',') within group (order by value) as value
from (
select distinct value from (
select regexp_substr(:value, '[^,]+', 1, level) as value
from dual
connect by regexp_substr(:value, '[^,]+', 1, level) is not null
)
);
VALUE
------------------------------
Beatty,Ian,Larry,Neesha
It's a bit more complicated if you're looking at multiple rows in a table as that confused the connect-by syntax, but you can use a non-determinisitic reference to avoid loops:
with t42 (id, value) as (
select 1, 'Ian,Beatty,Larry,Neesha,Beatty,Neesha,Ian,Neesha' from dual
union all select 2, 'Mary,Joe,Mary,Frank,Joe' from dual
)
select id, listagg(value, ',') within group (order by value) as value
from (
select distinct id, value from (
select id, regexp_substr(value, '[^,]+', 1, level) as value
from t42
connect by regexp_substr(value, '[^,]+', 1, level) is not null
and id = prior id
and prior dbms_random.value is not null
)
)
group by id;
ID VALUE
---------- ------------------------------
1 Beatty,Ian,Larry,Neesha
2 Frank,Joe,Mary
Of course this wouldn't be necessary if you were storing relational data properly; having a delimited string in a column is not a good idea.
There is a way to find duplicates in this case, but it is a problem to remove them if there are more than one duplicated name within a string per id. Here is code that can deal with one duplicate per id.
Sample data:
WITH
tbl AS
(
Select 1 "ID", 'a' "NAME_1", 'b' "NAME_2", 'c' "NAME_3" From Dual Union All
Select 1 "ID", 'c' "NAME_1", 'd' "NAME_2", 'a' "NAME_3" From Dual Union All
Select 2 "ID", 'd' "NAME_1", 'e' "NAME_2", 'a' "NAME_3" From Dual Union All
Select 2 "ID", 'c' "NAME_1", 'd' "NAME_2", 'b' "NAME_3" From Dual
),
lists AS
(
Select 1 "ID", 'a,c,b,d,c' "NAME" From Dual Union All
Select 2 "ID", 'd,c,e,a,b' "NAME" From Dual
),
Creating CTE that compares your LISTAGG sttring with original data finding duplicate values:
grid AS
(
Select DISTINCT l.ID, l.NAME,
CASE WHEN ( Length(l.NAME || ',') - Length(Replace(l.NAME || ',', t.NAME_1 || ',', '')) ) / Length(t.NAME_1 || ',') > 1 THEN NAME_1 END "NAME_1",
CASE WHEN ( Length(l.NAME || ',') - Length(Replace(l.NAME || ',', t.NAME_2 || ',', '')) ) / Length(t.NAME_2 || ',') > 1 THEN NAME_2 END "NAME_2",
CASE WHEN ( Length(l.NAME || ',') - Length(Replace(l.NAME || ',', t.NAME_3 || ',', '')) ) / Length(t.NAME_3 || ',') > 1 THEN NAME_3 END "NAME_3"
From
lists l
Inner Join
tbl t ON(t.ID = l.ID)
)
ID NAME NAME_1 NAME_2 NAME_3
---------- --------- ------ ------ ------
2 d,c,e,a,b
1 a,c,b,d,c c
1 a,c,b,d,c c
Main SQL, using Union, builds new string (removing second appearance) where the duplicate was found and then puts that new string after comparison with the old one.
SELECT DISTINCT l.ID, Nvl(g.NAME, l.NAME) NAME
FROM
lists l
LEFT JOIN
(
SELECT ID, CASE WHEN NAME_1 Is Not Null
THEN REPLACE(NAME, NAME, COALESCE( REPLACE( SubStr(NAME, 1, InStr(NAME, NAME_1, 1, 2) - 1) || SubStr(NAME, InStr(NAME, NAME_1, 1, 2) + Length(NAME_1)), ',,', ','), NULL ) )
END "NAME"
FROM grid
WHERE COALESCE(NAME_1, NAME_2, NAME_3) IS NOT NULL
UNION ALL
SELECT ID, CASE WHEN NAME_2 Is Not Null
THEN REPLACE(NAME, NAME, COALESCE( REPLACE( SubStr(NAME, 1, InStr(NAME, NAME_2, 1, 2) - 1) || SubStr(NAME, InStr(NAME, NAME_2, 1, 2) + Length(NAME_2)), ',,', ','), NULL ) )
END "NAME"
FROM grid
WHERE COALESCE(NAME_1, NAME_2, NAME_3) IS NOT NULL
UNION ALL
SELECT ID, CASE WHEN NAME_3 Is Not Null
THEN REPLACE(NAME, NAME, COALESCE( REPLACE( SubStr(NAME, 1, InStr(NAME, NAME_3, 1, 2) - 1) || SubStr(NAME, InStr(NAME, NAME_3, 1, 2) + Length(NAME_3)), ',,', ','), NULL ) )
END "NAME"
FROM grid
WHERE COALESCE(NAME_1, NAME_2, NAME_3) IS NOT NULL
) g ON(g.ID = l.ID And Length(g.NAME) < Length(l.NAME))
R e s u l t :
ID NAME
---------- -------------
2 d,c,e,a,b
1 a,c,b,d
For multiple occurences within a string or for multiplicated different names there should be done some recursions or multiplied nestings to get it done...