Oracle SQL Repeated words in the String - sql

I need your suggestions/inputs on of the following task. I have the following table:
ID ID_NAME
------ ---------------------------------
1 TOM HANKS TOM JR
2 PETER PATROL PETER JOHN PETER
3 SAM LIVING
4 JOHNSON & JOHNSON INC
5 DUHGT LLC
6 THE POST OF THE OFFICE
7 TURNING REP WEST
8 GEORGE JOHN
I Need a SQL query to find a repetitive word for every ID. if it exists, i need to get the count of the repeated word.
For instance in ID 2, the word PETER was repeated 3 times and in ID 1 the word TOM was repeated twice. so I need the output something like this:
ID ID_NAME COUNT
------ --------------------------------- --------
1 TOM HANKS TOM JR 2
2 PETER PATROL PETER JOHN PETER 3
3 SAM LIVING 0
4 JOHNSON & JOHNSON INC 2
5 DUHGT LLC 0
6 THE POST OF THE OFFICE 2
7 TURNING REP WEST 0
8 GEORGE JOHN 0
Just an FYI, The table has 560K rows
I tried the below and it didn't work and it is literally looking for every single word.
SELECT RESULT, COUNT(*)
FROM (SELECT
REGEXP_SUBSTR(COL_NAME, '[^ ]+', 1, COLUMN_VALUE) RESULT
FROM TABLE_NAME T ,
TABLE(CAST(MULTISET(SELECT DISTINCT LEVEL
FROM TABLE_NAME X
CONNECT BY LEVEL <= LENGTH(X.COL_NAME) - LENGTH(REPLACE(X.COL_NAME, ' ', '')) + 1
) AS SYS.ODCINUMBERLIST)) T1
)
WHERE RESULT IS NOT NULL
GROUP BY RESULT
ORDER BY 1;
Please let me know your inputs.

The query below counts repeated words and returns the highest count (if a word appears three times and another appears twice, the result will be the number 3). It treats JOHN as different from John (if capitalization shouldn't count as "different" then wrap the input strings within UPPER(...)). It only considers space as a word delimiter; if something else, like dash, is also considered as a delimiter, add to the REGEXP search pattern. Make sure you put a dash right at the end of a square-bracketed matching character list, etc. - the usual "tricks" for matching character lists. More generally, adapt as needed.
The query first breaks each input string into individual words, and counts how many times each word appears. For the count, I only need the words ("tokens") in the GROUP BY clause, I don't need to actually SELECT them, this is why the innermost query may look odd if you aren't forewarned. (Now you are!)
It also seems you want to show null rather than 1 if there are no repeated words, so I wrote the query to accommodate that. (Not sure why 1 wasn't OK.)
with
test_data ( id, id_name ) as (
select 1, 'TOM HANKS TOM JR' from dual union all
select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
select 3, 'SAM LIVING' from dual union all
select 4, 'JOHNSON & JOHNSON INC' from dual union all
select 5, 'DUHGT LLC' from dual union all
select 6, 'THE POST OF THE OFFICE' from dual union all
select 7, 'TURNING REP WEST' from dual union all
select 8, 'GEORGE JOHN' from dual
)
-- end of test data; SQL query begins below this line
select id, id_name, case when max(cnt) >= 2 then max(cnt) end as max_count
from (
select id, id_name, count(*) as cnt
from test_data
connect by level <= 1 + regexp_count(id_name, ' ')
and prior id = id
and prior sys_guid() is not null
group by id, id_name, regexp_substr(id_name, '[^ ]+', 1, level)
)
group by id, id_name
order by id -- if needed
;
Output:
ID ID_NAME MAX_COUNT
-- ----------------------------- ----------
1 TOM HANKS TOM JR 2
2 PETER PATROL PETER JOHN PETER 3
3 SAM LIVING
4 JOHNSON & JOHNSON INC 2
5 DUHGT LLC
6 THE POST OF THE OFFICE 2
7 TURNING REP WEST
8 GEORGE JOHN
8 rows selected.
EDIT:
If you only need to find the returns where the string column has at least one repeated word, and you don't care what the highest "repeated word count" is or how many words are repeated, the solution is simpler and more efficient; you don't need to split the input string into component words and count them.
(The OP indicated in the comments, after long dialogue, that this would suffice.)
In the solution the "match pattern" in regexp_like searches for a string of letters, preceded by either the beginning of the string or a space or a dash and ended by space, comma, period, question mark, exclamation point or dash. Both "markers", for beginning and end of a word, can be modified as needed. Make sure the dash is either the first or last character in [...], anywhere else it has a special meaning.
Then it looks for a repetition of the word. That's what \2 does in the match pattern. It's 2 and not 1 because the "word" is in the second pair of parentheses; I need the first pair for the alternation, EITHER start-of-string OR (space or dash).
Look at the first and the last string for special situations that this query covers correctly. Think of any other possible situations that the query may or may not cover.
with
test_data ( id, id_name ) as (
select 1, 'TOM HANKS TOM-ALAN' from dual union all
select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
select 3, 'SAM LIVING' from dual union all
select 4, 'JOHNSON & JOHNSON INC' from dual union all
select 5, 'DUHGT LLC' from dual union all
select 6, 'THE POST OF THE OFFICE' from dual union all
select 7, 'TURNING REP WEST' from dual union all
select 8, 'GEORGE JOHN-JOHN' from dual
)
-- end of test data; SQL query begins below this line
select id, id_name
from test_data
where regexp_like(id_name, '(^|[ -])([[:alpha:]]+)[ ,.?!-].*\2')
order by id -- if needed
;
ID ID_NAME
-- -----------------------------
1 TOM HANKS TOM-ALAN
2 PETER PATROL PETER JOHN PETER
4 JOHNSON & JOHNSON INC
6 THE POST OF THE OFFICE
8 GEORGE JOHN-JOHN

The next solution find first repeted word and in next step find count of repeating. Edit just now to fix extra subword findings
with s (ID, ID_NAME) as (
select 1, 'TOM HANKS TOM JR' from dual union all
select 10, 'TO TOM TOM TOM TOM TO TO TO STOM HANKS TOM TOMMY' from dual union all
select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
select 3, 'SAM LIVING' from dual union all
select 4, 'qwe JOHNSON & JOHNSON INC' from dual union all
select 5, 'DUHGT LLC' from dual union all
select 6, 'THE POST OF THE OFFICE ' from dual union all
select 7, 'TURNING REP WEST ' from dual union all
select 8, 'GEORGE JOHN ' from dual)
select id,
case when r1 = 0 then 0
else regexp_count(id_name, r3)
- regexp_count(id_name, r3||'\w+') -- exlude word with tail
- regexp_count(id_name, '\w+'||r3) -- exclude words with head
+ regexp_count(id_name, '\w+'||r3||'\w+') -- double calc with head and tail
end as rep_count
from (
select
s.*,
regexp_instr(s.id_name, '(^|\s)(\w+)(\s|$)(.*(\2))+') as r1 ,
regexp_replace(s.id_name, '.*?(^|\s)(\w+)(\s)(.*(\s)\2(\s|$))+.*$', '\2') as r3
from s);
the result is
ID REP_COUNT
---------- ----------
1 2
10 4
2 3
3 0
4 2
5 0
6 2
7 0
8 0

Related

Comparing string values within a table

Is there any way to compare two columns with strings to each other, and getting the matches?
I have two columns containing Names, once with the Full Name the other with (mostly) just the Surname.
I just tried it with soundex, but it will just return if the values are almost similar in both columns.
SELECT * FROM TABLE
WHERE soundex(FullName) = soundex(Surname)
1 John Doe Doe
2 Peter Parker Parker
3 Brian Griffin Brian Griffin
with soundex it will only match the 3rd line.
A simple option is to use instr, which shows whether surname exists in fullname:
SQL> with test (id, fullname, surname) as
2 (select 1, 'John Doe' , 'Doe' from dual union all
3 select 2, 'Peter Parker' , 'Parker' from dual union all
4 select 3, 'Brian Griffin', 'Brian Griffin' from dual
5 )
6 select *
7 from test
8 where instr(fullname, surname) > 0;
ID FULLNAME SURNAME
---------- ------------- -------------
1 John Doe Doe
2 Peter Parker Parker
3 Brian Griffin Brian Griffin
Another option is to use one of UTL_MATCH functions, e.g. Jaro-Winkler similarity which shows how well those strings match:
SQL> with test (id, fullname, surname) as
2 (select 1, 'John Doe' , 'Doe' from dual union all
3 select 2, 'Peter Parker' , 'Parker' from dual union all
4 select 3, 'Brian Griffin', 'Brian Griffin' from dual
5 )
6 select id, fullname, surname,
7 utl_match.jaro_winkler_similarity(fullname, surname) jws
8 from test
9 order by id;
ID FULLNAME SURNAME JWS
---------- ------------- ------------- ----------
1 John Doe Doe 48
2 Peter Parker Parker 62
3 Brian Griffin Brian Griffin 100
SQL>
Feel free to explore other function that package offers.
Also, note that I didn't pay attention to possible letter case differences (e.g. "DOE" vs. "Doe"). If you need that as well, compare e.g. upper(surname) to upper(fullname).
Please use instring function,
SELECT * FROM TABLE
WHERE instr(Surname, FullName) > 0;
SELECT * FROM TABLE
WHERE instr(upper(Surname), upper(FullName)) > 0;
SELECT * FROM TABLE
WHERE upper(FullName) > upper(Surname);
As far as I know there is nothing out of the box when matching becomes complicated. For the cases shown, however, the following expression would suffice:
where fullname like '%' || surname
Update
The main problem may be false positives:
The last name 'Park' appears in 'Peter Parker'. Above query solves this by looking at the full name's end.
Another problem may be upper / lower case as mentioned in the other answers (not shown in your sample data).
You want the last name 'PARKER' match 'Peter Parker'.
But when looking at the strings case insensitively, another problem arises:
The last name 'Strong' will suddenly match 'Louis Armstrong'.
A solution for this is to add a blank to make the difference:
where ' ' || upper(fullname) like '% ' || upper(surname)
' LOUIS ARMSTRONG' like '% STRONG' -> false
' LOUIS ARMSTRONG' like '% ARMSTRONG' -> true
' LOUIS ARMSTRONG' like '% LOUIS ARMSTRONG' -> true
Demo: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=0ac5c80061b4aeac1153a8c5976e6e54

Filter invalid ids from the oracle table

My table
NAME
Peter
Lance
Oscar
Steve
Reddy
Input to my query is array of string, let's say Peter, Bond, Steve, Smith
My query should return me the invalid values of my input (i.e) Bond & Smith
I am using Oracle 12.1.0 and odcivarchar2list is not supported.
Any suggestions would be highly appreciated
You can use cte :
with list_string as (
select 'Peter' as name union all
select 'Bond' as name union all
select 'Steve' as name union all
select 'Smith' as name
)
select ls.name, 'Invalid Values'
from list_string ls
where not exists (select 1 from table t1 where t1.name = t.name);
Some more options.
Data you have:
SQL> select * from test;
NAME
-----
Peter
Lance
Oscar
Steve
Reddy
If you don't mind enclosing names into single quotes, then this might be an option:
SQL> select column_value result
2 from table(sys.odcivarchar2list('Peter', 'Bond', 'Steve', 'Smith'))
3 minus
4 select t.name
5 from test t;
RESULT
-----------------------------------------------------------------------------
Bond
Smith
SQL>
If you'd just want to enter those names "normally", comma-separated, then:
SQL> with
2 sample (val) as
3 (select 'Peter, Bond, Steve, Smith' from dual)
4 select trim(regexp_substr(s.val, '[^,]+', 1, level)) result
5 from sample s
6 connect by level <= regexp_count(s.val, ',') + 1
7 minus
8 select t.name
9 from test t;
RESULT
---------------------------------------------------------------------
Bond
Smith
SQL>

How to get a row of unique values from prior row followed by nulls till the next value?

I have a query that has multiple joins and fields. I have one row that has alot of duplicates. I need to only get the distict values from this specific row while leaving the size of the query the same due to the other joins.
I have tried group by and districts but they eliminate other critical information in the query. I need to leave the query length the same.
example:(pseudocode)
SELECT
Name
,StateID
,Age
,Toy
,ManufactureName
From
peopleTable as people
LEFT JOIN toyTable on people.id = toytable.id
LEFT JOIN ManufactureTable on toyTable.toyId=ManufactureTable.ManId
WHERE
toytable.id >1000
output
Name StateID Age Toy Manufacture
Carlo 1 10 Woody Disney
Sid 1 10 Buzz Disney
Abby 1 10 Car RaceMan
Bobby 4 10 Doll Barbie
Sally 6 10 Book Barns&
Jim 6 10 Woody Disney
ExpectedOutput
Name StateID Age Toy Manufacture NewField
Carlo 1 10 Woody Disney 1
Sid 1 10 Buzz Disney NULL
Abby 1 10 Car RaceMan NULL
Bobby 4 10 Doll Barbie 4
Sally 6 10 Book Barns& 6
Jim 6 10 Woody Disney Null
Would something like this help?
Using ROW_NUMBER analytic function, find out the first row in a group of those that share the same stateid. Note that I used order by null as I don't know which one is the first (name isn't, nor is age or toy or manufacture). If you don't care, leave it as is. If you know how to sort them, use that column.
SQL> with test (name, stateid, age, toy, manufacture) as
2 (select 'Carlo', 1, 10, 'Woody', 'Disney' from dual union all
3 select 'Sid' , 1, 10, 'Buzz' , 'Disney' from dual union all
4 select 'Abby' , 1, 10, 'Car' , 'RaceMan' from dual union all
5 select 'Bobby', 4, 10, 'Doll' , 'Barbie' from dual union all
6 select 'Sally', 6, 10, 'Book' , 'Barns&' from dual union all
7 select 'Jim' , 6, 10, 'Woody', 'Disney' from dual
8 )
9 select name, stateid, age, toy, manufacture,
10 case when row_number() over (partition by stateid order by null) = 1 then stateid
11 else null
12 end new_field
13 from test;
NAME STATEID AGE TOY MANUFAC NEW_FIELD
----- ---------- ---------- ----- ------- ----------
Carlo 1 10 Woody Disney 1
Sid 1 10 Buzz Disney
Abby 1 10 Car RaceMan
Bobby 4 10 Doll Barbie 4
Sally 6 10 Book Barns& 6
Jim 6 10 Woody Disney
6 rows selected.
SQL>

Extracting numbers from a string without the following characters in SQL

So I have street addresses like the following:
123 Street Ave
1234 Road St Apt B
12345 Passage Way
Now, I'm having a hard time extracting just the street numbers without any of the street names.
I just want:
123
1234
12345
The way you put it, two simple options return the desired result. One uses regular expressions (selects the first number in a string), while another one returns the first substring (which is delimited by a space).
SQL> with test (address) as
2 (select '123 Street Ave' from dual union all
3 select '1234 Road St Apt B' from dual union all
4 select '12345 Passage Way' from dual
5 )
6 select
7 address,
8 regexp_substr(address, '^\d+') result_1,
9 substr(address, 1, instr(address, ' ') - 1) result_2
10 from test;
ADDRESS RESULT_1 RESULT_2
------------------ ------------------ ------------------
123 Street Ave 123 123
1234 Road St Apt B 1234 1234
12345 Passage Way 12345 12345
SQL>

Oracle find common value in two different columns

If I have a structure like this:
CREATE TABLE things (
id,
personA varchar2,
personB varchar2,
attribute ...,
)
And I want to find, for a given attribute, if I have at least 1 common person for all my things, how would I go about it?
So if my data is (and it could be more than 2 per attribute):
1, John, Steve, Apple
2, Steve, Larry, Apple
3, Paul, Larry, Orange
4, Paul, Larry, Orange
5, Chris, Michael, Tomato
6, Steve, Larry, Tomato
For Apple, Steve is my common person, For Orange both Paul and Larry are, and for Tomato I have no common people. I don't need a query that returns all of these at once, however. I have one of these attributes and want 0, 1, or 2 rows depending on what kind of commonality I have. I've been trying to come up with something but can't quite figure out.
This will give you your common person / attribute list. I ran it against your sample data and got the expected result. Hope it's at least pointing in the right direction :)
WITH NormNames AS (
SELECT PersonA AS Person, Attribute FROM things
UNION ALL SELECT PersonB AS Person, Attribute FROM things
)
SELECT
Person, Attribute, COUNT(*)
FROM NormNames
GROUP BY Person, Attribute
HAVING COUNT(*) >= 2
If you're on 11gR2 you could also use the unpivot operator to avoid the self-join:
select person, attribute
from (
select *
from things
unpivot (person for which_person in (persona as 'A', personb as 'B'))
)
group by person, attribute
having count(*) > 1;
PERSON ATTRIBUTE
---------- ----------
Steve Apple
Paul Orange
Larry Orange
3 rows selected.
Or to just the the people who match the attribute, which I think is what the end of your question is looking for:
select person
from (
select *
from things
unpivot (person for which_person in (persona as 'A', personb as 'B'))
)
where attribute = 'Apple'
group by person, attribute
having count(*) > 1;
PERSON
----------
Steve
1 row selected.
The unpivot translates columns into rows. Run on its own it transforms your original six rows into twelve, replacing the original persona/personb columns with a single person and an additional column indicating which column the new row was formed from, which we don't really care about here:
select *
from things
unpivot (person for which_person in (persona as 'A', personb as 'B'));
ID ATTRIBUTE W PERSON
---------- ---------- - ----------
1 Apple A John
1 Apple B Steve
2 Apple A Steve
2 Apple B Larry
3 Orange A Paul
3 Orange B Larry
4 Orange A Paul
4 Orange B Larry
5 Tomato A Chris
5 Tomato B Michael
6 Tomato A Steve
6 Tomato B Larry
12 rows selected.
The outer query is then doing a simple group.
Here's one method.
It implements an unpivot method by cross-joining to a list of numbers (you could use the unpivot method Alex uses) and then joins the result set, hopefully with a hash join for added goodness.
with
row_generator as (
select 1 counter from dual union all
select 2 counter from dual),
data_generator as (
select
attribute,
id ,
case counter
when 1 then persona
when 2 then personb
end person
from
things,
row_generator)
select
t1.attribute,
t1.person
from
row_generator t1,
row_generator t2
where
t1.attribute = t2.attribute and
t1.person = t2.person and
t1.id != t2.id;