Detect duplicate string or word in a row - sql

I want to know how to detect duplicate word in a row. This is to ensure that we have clean data in our database.
For example see below
Name count
James James McCarthy 1
Donald Hughes Hughes 1
I want the result to be like
Name count
James McCarthy 1
Donald Hughes 1
Is there a solution to this using Oracle SQL?

For adjacent words
select 1
from dual
where regexp_like ('John John Doe','(^|\s)(\S+)\s+\2(\s|$)')
;
or
select case when regexp_like ('John John Doe','(^|\s)(\S+)\s+\2(\s|$)') then 'Y' end as adj_duplicate
from dual
;

Related

proc sql function to find mulitple LIKE matches?

I'm having trouble with a LIKE function in proc sql.
PROC SQL;
CREATE TABLE NAMES_IDS AS
SELECT DISTINCT
T1.*
,T2.NAMES
,T2.NAME_ID
FROM WORK.table1 T1
LEFT JOIN data.table2 T2 ON T2.NAMES like T1.NAMES1
;QUIT;
I have several names in t2, lets say for example theres John 1, John 2, John 3, John 4, etc and in t1.Names1 there is %John%
proc sql is just pulling in the first match, John 1 and its associated ID, and applying it to all the data in T1, instead of duplicated a match for all matching names (this is what I want to achieve).
So the end table would have something like
COLUMN A COLUMN B
John John 1
John John 2
John John 3
John John 4
But instead, what I get is:
COLUMN A COLUMN B
John John 1
John John 1
John John 1
John John 1
Hopefully this makes some sort of sense...
I think I figured it out, I added TRIM to my code and I guess there may have been some erroneous spaces somewhere because that seems to fix my issue. Thanks for your responses!

Comparing string values within a table

Is there any way to compare two columns with strings to each other, and getting the matches?
I have two columns containing Names, once with the Full Name the other with (mostly) just the Surname.
I just tried it with soundex, but it will just return if the values are almost similar in both columns.
SELECT * FROM TABLE
WHERE soundex(FullName) = soundex(Surname)
1 John Doe Doe
2 Peter Parker Parker
3 Brian Griffin Brian Griffin
with soundex it will only match the 3rd line.
A simple option is to use instr, which shows whether surname exists in fullname:
SQL> with test (id, fullname, surname) as
2 (select 1, 'John Doe' , 'Doe' from dual union all
3 select 2, 'Peter Parker' , 'Parker' from dual union all
4 select 3, 'Brian Griffin', 'Brian Griffin' from dual
5 )
6 select *
7 from test
8 where instr(fullname, surname) > 0;
ID FULLNAME SURNAME
---------- ------------- -------------
1 John Doe Doe
2 Peter Parker Parker
3 Brian Griffin Brian Griffin
Another option is to use one of UTL_MATCH functions, e.g. Jaro-Winkler similarity which shows how well those strings match:
SQL> with test (id, fullname, surname) as
2 (select 1, 'John Doe' , 'Doe' from dual union all
3 select 2, 'Peter Parker' , 'Parker' from dual union all
4 select 3, 'Brian Griffin', 'Brian Griffin' from dual
5 )
6 select id, fullname, surname,
7 utl_match.jaro_winkler_similarity(fullname, surname) jws
8 from test
9 order by id;
ID FULLNAME SURNAME JWS
---------- ------------- ------------- ----------
1 John Doe Doe 48
2 Peter Parker Parker 62
3 Brian Griffin Brian Griffin 100
SQL>
Feel free to explore other function that package offers.
Also, note that I didn't pay attention to possible letter case differences (e.g. "DOE" vs. "Doe"). If you need that as well, compare e.g. upper(surname) to upper(fullname).
Please use instring function,
SELECT * FROM TABLE
WHERE instr(Surname, FullName) > 0;
SELECT * FROM TABLE
WHERE instr(upper(Surname), upper(FullName)) > 0;
SELECT * FROM TABLE
WHERE upper(FullName) > upper(Surname);
As far as I know there is nothing out of the box when matching becomes complicated. For the cases shown, however, the following expression would suffice:
where fullname like '%' || surname
Update
The main problem may be false positives:
The last name 'Park' appears in 'Peter Parker'. Above query solves this by looking at the full name's end.
Another problem may be upper / lower case as mentioned in the other answers (not shown in your sample data).
You want the last name 'PARKER' match 'Peter Parker'.
But when looking at the strings case insensitively, another problem arises:
The last name 'Strong' will suddenly match 'Louis Armstrong'.
A solution for this is to add a blank to make the difference:
where ' ' || upper(fullname) like '% ' || upper(surname)
' LOUIS ARMSTRONG' like '% STRONG' -> false
' LOUIS ARMSTRONG' like '% ARMSTRONG' -> true
' LOUIS ARMSTRONG' like '% LOUIS ARMSTRONG' -> true
Demo: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=0ac5c80061b4aeac1153a8c5976e6e54

Postgreql. How to select unique values and count it?

I have column in postgresql
Names
Mike
Alex
Mike
Bill, Abigail
Abigail
Bill
Kurt, Adele, John
Mike
John
, is a delimiter when values two or more in the field.
How to to select it as result
Abigail 2
Adele 1
Alex 1
Bill 2
John 2
Kurt 1
Mike 3
I read about distinct and join but I can't make query.
You need to split the values and then count:
select u.name, count(*)
from t cross join lateral
unnest(string_to_array(names, ', ')) u(name)
group by u.name;
Here is a db<>fiddle.
Then you should fix your data model. Do not store multiple values in a string column. Postgres supports arrays which is one option. Another is a proper junction/association table.

Sorting Middle string

I have the following values in a column
Name
Smith
Marry
Tom
Robert
Albert
I have to display in the following order. I want MARRY on top. For the rest of the values ordering is not matter.
Name
Marry
Smith
Tom
Robert
Albert
How can I achieve this?
You can use case in order by:
order by (case when name = 'Marry' then 1 else 2 end)

pl/sql query to remove duplicates and replace the data

I have the following table:
data_id new_data_id first_name last_name
1 john smith
2 john smith
3 john smith
4 jeff louis
5 jeff louis
6 jeff louis
The above table has duplicate first and last names, and the data_id is different for all of them. In order to remove these duplicates, I would need to write a SQL query to replace the highest data_id in new_data_id column. My output would look something like this:
data_id new_data_id first_name last_name
1 3 john smith
2 3 john smith
3 3 john smith
4 6 jeff louis
5 6 jeff louis
6 6 jeff louis
How would I do this?
What you're looking for is an Oracle analytic function.
The aggregate function MAX can be used to select the highest data_id from your entire resultset, but that's not exactly what you need. Instead, use its alter ego, the MAX analytic function like so:
SELECT
data_id,
MAX(data_id) OVER (PARTITION BY first_name, last_name) AS new_data_id,
first_name,
last_name
FROM employees
ORDER BY data_id
This works by "partitioning" your resultset by first_name and last_name, and then it performs the given function within that subset.
Good luck!
Here's a fiddle: http://sqlfiddle.com/#!4/48b29/4
More info can be found here:
http://docs.oracle.com/cd/E11882_01/server.112/e41084/functions004.htm#SQLRF06174
If you need a change in place, a correlated update is probably the simplest way to write that:
UPDATE T
SET "new_data_id" =
(SELECT MAX("data_id") FROM T T2
WHERE T2."first_name" = T."first_name"
AND T2."last_name" = T."last_name")
See http://sqlfiddle.com/#!4/51a69/1