Comparing string values within a table - sql

Is there any way to compare two columns with strings to each other, and getting the matches?
I have two columns containing Names, once with the Full Name the other with (mostly) just the Surname.
I just tried it with soundex, but it will just return if the values are almost similar in both columns.
SELECT * FROM TABLE
WHERE soundex(FullName) = soundex(Surname)
1 John Doe Doe
2 Peter Parker Parker
3 Brian Griffin Brian Griffin
with soundex it will only match the 3rd line.

A simple option is to use instr, which shows whether surname exists in fullname:
SQL> with test (id, fullname, surname) as
2 (select 1, 'John Doe' , 'Doe' from dual union all
3 select 2, 'Peter Parker' , 'Parker' from dual union all
4 select 3, 'Brian Griffin', 'Brian Griffin' from dual
5 )
6 select *
7 from test
8 where instr(fullname, surname) > 0;
ID FULLNAME SURNAME
---------- ------------- -------------
1 John Doe Doe
2 Peter Parker Parker
3 Brian Griffin Brian Griffin
Another option is to use one of UTL_MATCH functions, e.g. Jaro-Winkler similarity which shows how well those strings match:
SQL> with test (id, fullname, surname) as
2 (select 1, 'John Doe' , 'Doe' from dual union all
3 select 2, 'Peter Parker' , 'Parker' from dual union all
4 select 3, 'Brian Griffin', 'Brian Griffin' from dual
5 )
6 select id, fullname, surname,
7 utl_match.jaro_winkler_similarity(fullname, surname) jws
8 from test
9 order by id;
ID FULLNAME SURNAME JWS
---------- ------------- ------------- ----------
1 John Doe Doe 48
2 Peter Parker Parker 62
3 Brian Griffin Brian Griffin 100
SQL>
Feel free to explore other function that package offers.
Also, note that I didn't pay attention to possible letter case differences (e.g. "DOE" vs. "Doe"). If you need that as well, compare e.g. upper(surname) to upper(fullname).

Please use instring function,
SELECT * FROM TABLE
WHERE instr(Surname, FullName) > 0;
SELECT * FROM TABLE
WHERE instr(upper(Surname), upper(FullName)) > 0;
SELECT * FROM TABLE
WHERE upper(FullName) > upper(Surname);

As far as I know there is nothing out of the box when matching becomes complicated. For the cases shown, however, the following expression would suffice:
where fullname like '%' || surname
Update
The main problem may be false positives:
The last name 'Park' appears in 'Peter Parker'. Above query solves this by looking at the full name's end.
Another problem may be upper / lower case as mentioned in the other answers (not shown in your sample data).
You want the last name 'PARKER' match 'Peter Parker'.
But when looking at the strings case insensitively, another problem arises:
The last name 'Strong' will suddenly match 'Louis Armstrong'.
A solution for this is to add a blank to make the difference:
where ' ' || upper(fullname) like '% ' || upper(surname)
' LOUIS ARMSTRONG' like '% STRONG' -> false
' LOUIS ARMSTRONG' like '% ARMSTRONG' -> true
' LOUIS ARMSTRONG' like '% LOUIS ARMSTRONG' -> true
Demo: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=0ac5c80061b4aeac1153a8c5976e6e54

Related

How do I split entire full name column in Oracle [duplicate]

This question already has answers here:
Split varchar into separate columns in Oracle
(3 answers)
Closed 5 months ago.
In Oracle, I've got a full name column. I want to split that into a first name column and a last name column. SQL code?
If it's not working for all rows, then your rows have different delimiters.
with my_data as (
select 'john smith' as full_name from dual union all
select 'rudy chan' from dual union all
select 'h gonzalez' from dual
)
SELECT full_name,
SUBSTR(full_name, 1, INSTR(full_name, ' ')-1) AS first_name,
SUBSTR(full_name, INSTR(full_name, ' ')+1) AS last_name
FROM my_data
FULL_NAME
FIRST_NAME
LAST_NAME
john smith
john
smith
rudy chan
rudy
chan
h gonzalez
h
gonzalez
fiddle
UPDATE
Based on your comments below, you are looking for how to ADD columns to a table. Broke this out into two steps....
--adding two columns
Alter table my_data
add (
first_name varchar(20),
last_name varchar(20)
);
--update the newly added columns
update my_data
set first_name = SUBSTR(full_name, 1, INSTR(full_name, ' ')-1),
last_name = SUBSTR(full_name, INSTR(full_name, ' ')+1);
select *
from my_data
FULL_NAME
FIRST_NAME
LAST_NAME
john smith
john
smith
rudy chan
rudy
chan
h gonzalez
h
gonzalez
Isolated's option is just fine (according to the task); another option might be regular expressions.
It looks simpler and works OK on small data sets, such as your table with 40 rows; for large tables, e.g. hundreds of millions rows, substr + instr combination will most probably work way faster.
Sample data:
SQL> select * from my_data;
FULL_NAME FIRST_NAME LAST_NAME
---------- -------------------- --------------------
john smith
rudy chan
h gonzalez
Query:
SQL> update my_data set
2 first_name = regexp_substr(full_name, '^\w+'),
3 last_name = regexp_substr(full_name, '\w+$');
3 rows updated.
Result:
SQL> select * from my_data;
FULL_NAME FIRST_NAME LAST_NAME
---------- -------------------- --------------------
john smith john smith
rudy chan rudy chan
h gonzalez h gonzalez
SQL>

SQL query to remove the comma at the end with no characters beyond

I have created a query to get the output like from table IRC_TABLE-
Requistion_number Name
12 John Mayer, Andrew,
11 Swastak,
I want if the values in Name has comma at the end and nothing beyond then it should be removed-
Requistion_number Name
12 John Mayer, Andrew
11 Swastak
Which function will help me achieve this ?
The easiest and probably most performant way to do this would be to use TRIM:
SELECT Requistion_number, TRIM(TRAILING ',' FROM Name) AS Name
FROM yourTable;
You could use also REGEXP_REPLACE here:
SELECT Requistion_number, REGEXP_REPLACE(Name, ',$', '') AS Name
FROM yourTable;
The regex option would be of more value if the replacement logic were more complex than just stripping off a certain final character.
Yet another option (apart from what Tim already said) is the rtrim (right-trim) function.
When we're searching for various options, even substr with case expression might do (but hey, you surely will not want to use it):
SQL> select * From your_table;
REQUISTION_NUMBER NAME
----------------- -------------------
122 John Mayer, Andrew,
111 Swastak,
333 No comma, here
SQL> select requistion_number,
2 rtrim(name, ',') as name_1,
3 substr(name, 1, length(name) - case when substr(name, -1) = ',' then 1
4 else 0
5 end) as name_2
6 from your_table;
REQUISTION_NUMBER NAME_1 NAME_2
----------------- ------------------- -------------------
122 John Mayer, Andrew John Mayer, Andrew
111 Swastak Swastak
333 No comma, here No comma, here
SQL>

Filter invalid ids from the oracle table

My table
NAME
Peter
Lance
Oscar
Steve
Reddy
Input to my query is array of string, let's say Peter, Bond, Steve, Smith
My query should return me the invalid values of my input (i.e) Bond & Smith
I am using Oracle 12.1.0 and odcivarchar2list is not supported.
Any suggestions would be highly appreciated
You can use cte :
with list_string as (
select 'Peter' as name union all
select 'Bond' as name union all
select 'Steve' as name union all
select 'Smith' as name
)
select ls.name, 'Invalid Values'
from list_string ls
where not exists (select 1 from table t1 where t1.name = t.name);
Some more options.
Data you have:
SQL> select * from test;
NAME
-----
Peter
Lance
Oscar
Steve
Reddy
If you don't mind enclosing names into single quotes, then this might be an option:
SQL> select column_value result
2 from table(sys.odcivarchar2list('Peter', 'Bond', 'Steve', 'Smith'))
3 minus
4 select t.name
5 from test t;
RESULT
-----------------------------------------------------------------------------
Bond
Smith
SQL>
If you'd just want to enter those names "normally", comma-separated, then:
SQL> with
2 sample (val) as
3 (select 'Peter, Bond, Steve, Smith' from dual)
4 select trim(regexp_substr(s.val, '[^,]+', 1, level)) result
5 from sample s
6 connect by level <= regexp_count(s.val, ',') + 1
7 minus
8 select t.name
9 from test t;
RESULT
---------------------------------------------------------------------
Bond
Smith
SQL>

Oracle SQL Repeated words in the String

I need your suggestions/inputs on of the following task. I have the following table:
ID ID_NAME
------ ---------------------------------
1 TOM HANKS TOM JR
2 PETER PATROL PETER JOHN PETER
3 SAM LIVING
4 JOHNSON & JOHNSON INC
5 DUHGT LLC
6 THE POST OF THE OFFICE
7 TURNING REP WEST
8 GEORGE JOHN
I Need a SQL query to find a repetitive word for every ID. if it exists, i need to get the count of the repeated word.
For instance in ID 2, the word PETER was repeated 3 times and in ID 1 the word TOM was repeated twice. so I need the output something like this:
ID ID_NAME COUNT
------ --------------------------------- --------
1 TOM HANKS TOM JR 2
2 PETER PATROL PETER JOHN PETER 3
3 SAM LIVING 0
4 JOHNSON & JOHNSON INC 2
5 DUHGT LLC 0
6 THE POST OF THE OFFICE 2
7 TURNING REP WEST 0
8 GEORGE JOHN 0
Just an FYI, The table has 560K rows
I tried the below and it didn't work and it is literally looking for every single word.
SELECT RESULT, COUNT(*)
FROM (SELECT
REGEXP_SUBSTR(COL_NAME, '[^ ]+', 1, COLUMN_VALUE) RESULT
FROM TABLE_NAME T ,
TABLE(CAST(MULTISET(SELECT DISTINCT LEVEL
FROM TABLE_NAME X
CONNECT BY LEVEL <= LENGTH(X.COL_NAME) - LENGTH(REPLACE(X.COL_NAME, ' ', '')) + 1
) AS SYS.ODCINUMBERLIST)) T1
)
WHERE RESULT IS NOT NULL
GROUP BY RESULT
ORDER BY 1;
Please let me know your inputs.
The query below counts repeated words and returns the highest count (if a word appears three times and another appears twice, the result will be the number 3). It treats JOHN as different from John (if capitalization shouldn't count as "different" then wrap the input strings within UPPER(...)). It only considers space as a word delimiter; if something else, like dash, is also considered as a delimiter, add to the REGEXP search pattern. Make sure you put a dash right at the end of a square-bracketed matching character list, etc. - the usual "tricks" for matching character lists. More generally, adapt as needed.
The query first breaks each input string into individual words, and counts how many times each word appears. For the count, I only need the words ("tokens") in the GROUP BY clause, I don't need to actually SELECT them, this is why the innermost query may look odd if you aren't forewarned. (Now you are!)
It also seems you want to show null rather than 1 if there are no repeated words, so I wrote the query to accommodate that. (Not sure why 1 wasn't OK.)
with
test_data ( id, id_name ) as (
select 1, 'TOM HANKS TOM JR' from dual union all
select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
select 3, 'SAM LIVING' from dual union all
select 4, 'JOHNSON & JOHNSON INC' from dual union all
select 5, 'DUHGT LLC' from dual union all
select 6, 'THE POST OF THE OFFICE' from dual union all
select 7, 'TURNING REP WEST' from dual union all
select 8, 'GEORGE JOHN' from dual
)
-- end of test data; SQL query begins below this line
select id, id_name, case when max(cnt) >= 2 then max(cnt) end as max_count
from (
select id, id_name, count(*) as cnt
from test_data
connect by level <= 1 + regexp_count(id_name, ' ')
and prior id = id
and prior sys_guid() is not null
group by id, id_name, regexp_substr(id_name, '[^ ]+', 1, level)
)
group by id, id_name
order by id -- if needed
;
Output:
ID ID_NAME MAX_COUNT
-- ----------------------------- ----------
1 TOM HANKS TOM JR 2
2 PETER PATROL PETER JOHN PETER 3
3 SAM LIVING
4 JOHNSON & JOHNSON INC 2
5 DUHGT LLC
6 THE POST OF THE OFFICE 2
7 TURNING REP WEST
8 GEORGE JOHN
8 rows selected.
EDIT:
If you only need to find the returns where the string column has at least one repeated word, and you don't care what the highest "repeated word count" is or how many words are repeated, the solution is simpler and more efficient; you don't need to split the input string into component words and count them.
(The OP indicated in the comments, after long dialogue, that this would suffice.)
In the solution the "match pattern" in regexp_like searches for a string of letters, preceded by either the beginning of the string or a space or a dash and ended by space, comma, period, question mark, exclamation point or dash. Both "markers", for beginning and end of a word, can be modified as needed. Make sure the dash is either the first or last character in [...], anywhere else it has a special meaning.
Then it looks for a repetition of the word. That's what \2 does in the match pattern. It's 2 and not 1 because the "word" is in the second pair of parentheses; I need the first pair for the alternation, EITHER start-of-string OR (space or dash).
Look at the first and the last string for special situations that this query covers correctly. Think of any other possible situations that the query may or may not cover.
with
test_data ( id, id_name ) as (
select 1, 'TOM HANKS TOM-ALAN' from dual union all
select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
select 3, 'SAM LIVING' from dual union all
select 4, 'JOHNSON & JOHNSON INC' from dual union all
select 5, 'DUHGT LLC' from dual union all
select 6, 'THE POST OF THE OFFICE' from dual union all
select 7, 'TURNING REP WEST' from dual union all
select 8, 'GEORGE JOHN-JOHN' from dual
)
-- end of test data; SQL query begins below this line
select id, id_name
from test_data
where regexp_like(id_name, '(^|[ -])([[:alpha:]]+)[ ,.?!-].*\2')
order by id -- if needed
;
ID ID_NAME
-- -----------------------------
1 TOM HANKS TOM-ALAN
2 PETER PATROL PETER JOHN PETER
4 JOHNSON & JOHNSON INC
6 THE POST OF THE OFFICE
8 GEORGE JOHN-JOHN
The next solution find first repeted word and in next step find count of repeating. Edit just now to fix extra subword findings
with s (ID, ID_NAME) as (
select 1, 'TOM HANKS TOM JR' from dual union all
select 10, 'TO TOM TOM TOM TOM TO TO TO STOM HANKS TOM TOMMY' from dual union all
select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
select 3, 'SAM LIVING' from dual union all
select 4, 'qwe JOHNSON & JOHNSON INC' from dual union all
select 5, 'DUHGT LLC' from dual union all
select 6, 'THE POST OF THE OFFICE ' from dual union all
select 7, 'TURNING REP WEST ' from dual union all
select 8, 'GEORGE JOHN ' from dual)
select id,
case when r1 = 0 then 0
else regexp_count(id_name, r3)
- regexp_count(id_name, r3||'\w+') -- exlude word with tail
- regexp_count(id_name, '\w+'||r3) -- exclude words with head
+ regexp_count(id_name, '\w+'||r3||'\w+') -- double calc with head and tail
end as rep_count
from (
select
s.*,
regexp_instr(s.id_name, '(^|\s)(\w+)(\s|$)(.*(\2))+') as r1 ,
regexp_replace(s.id_name, '.*?(^|\s)(\w+)(\s)(.*(\s)\2(\s|$))+.*$', '\2') as r3
from s);
the result is
ID REP_COUNT
---------- ----------
1 2
10 4
2 3
3 0
4 2
5 0
6 2
7 0
8 0

Detect duplicate string or word in a row

I want to know how to detect duplicate word in a row. This is to ensure that we have clean data in our database.
For example see below
Name count
James James McCarthy 1
Donald Hughes Hughes 1
I want the result to be like
Name count
James McCarthy 1
Donald Hughes 1
Is there a solution to this using Oracle SQL?
For adjacent words
select 1
from dual
where regexp_like ('John John Doe','(^|\s)(\S+)\s+\2(\s|$)')
;
or
select case when regexp_like ('John John Doe','(^|\s)(\S+)\s+\2(\s|$)') then 'Y' end as adj_duplicate
from dual
;