How to find partial duplicate values in Oracle - sql

I have table in which one of the columns is having 1000s of records out which most of them are duplicates. Finding duplicates are easy but in this situation, they are partial duplicates e.g
ID NAME Status
1 abc Capital Approved
2 (abc Capital) Terminated
3 abc capital (dupe) Null
4 abc capitalx Null
5 BT Capital Approved
6 XE Capital Approved
7 xyz Finance Approved
8 xyz Finance X Null
9 xyz finance dupe Null
So from the above data, I want to retrieve duplicate names which are partially duplicate E.g
output:
1 abc Capital Approved
2 (abc Capital) Terminated
3 abc capital (dupe) Null
4 abc capitalx Null
5 xyz Finance Approved
6 xyz Finance X Null
7 xyz finance dupe Null

One option might be to use UTL_MATCH package and one of its functions. I chose edit_distance_similarity for this demonstration, but you might pick another.
SQL> with test (id, name) as
2 (select 1, 'abc Capital' from dual union all
3 select 2, '(abc Capital)' from dual union all
4 select 3, 'abc capital (dupe)' from dual union all
5 select 4, 'abc capitalx' from dual union all
6 select 5, 'BT Capital' from dual union all
7 select 6, 'XE Capital' from dual union all
8 select 7, 'xyz Finance' from dual union all
9 select 8, 'xyz Finance X' from dual union all
10 select 9, 'xyz finance dupe' from dual
11 ),
12 simil as
13 (select a.id, a.name aname, b.name bname,
14 utl_match.edit_distance_similarity(a.name, b.name) simil
15 from test a cross join test b
16 )
17 select distinct id, aname as name
18 from simil
19 where aname <> bname
20 and simil > 80
21 order by id;
ID NAME
---------- ------------------
1 abc Capital
2 (abc Capital)
4 abc capitalx
7 xyz Finance
8 xyz Finance X
SQL>

Related

How to create a unique pairwise list of a column value based on table entry and its referenced entry in oracle

Im trying to extract the following information from the oracle table below: A list of all the unique pairwise Status combinations for entries and their referenced entries. Entries with no referenced entry will be ignored. For example, for the entry 10 I expect the output to be (1,3) because its status is 1 and the status of the referenced entry 7 is 3. If the list doesn't already have this combination, it should be added to the list. Can anyone guide me in the right direction? I'm totally clueless as to how to even google what I want to achieve.
EDIT: The first column is the ID of the entry, the second column is the status of the entry, and the third column is the ID of another entry in the same table that is referenced.
Looks like a self join:
Sample data:
SQL> with test (id, status, ref_id) as
2 (select 1, 0, null from dual union all
3 select 2, 1, 3 from dual union all
4 select 3, 3, null from dual union all
5 select 4, 6, 6 from dual union all
6 select 5, 0, 1 from dual union all
7 select 6, 4, null from dual union all
8 select 7, 3, null from dual union all
9 select 8, 5, 9 from dual union all
10 select 9, 2, null from dual union all
11 select 10, 1, 7 from dual
12 )
Query:
13 select a.id, a.status, b.status
14 from test a join test b on b.id = a.ref_id
15 where a.ref_id is not null
16 order by a.id;
ID STATUS STATUS
---------- ---------- ----------
2 1 3
4 6 4
5 0 0
8 5 2
10 1 3
SQL>
If you want to get distinct pairs (but still know IDs involved), you could use listagg (it'll work as long as resulting string doesn't exceed 4000 characters; if it does, use xmlagg instead):
13 select listagg(a.id, ', ') within group (order by a.id) id,
14 a.status, b.status
15 from test a join test b on b.id = a.ref_id
16 where a.ref_id is not null
17 group by a.status, b.status
18 order by id;
ID STATUS STATUS
-------------------- ---------- ----------
2, 10 1 3
4 6 4
5 0 0
8 5 2
SQL>
If you don't care about IDs, then
13 select distinct a.status, b.status
14 from test a join test b on b.id = a.ref_id
15 where a.ref_id is not null
16 order by a.status, b.status;
STATUS STATUS
---------- ----------
0 0
1 3
5 2
6 4
SQL>

[1]: ORA-00904: "VW_D"."CL_NAME": invalid identifier

I am trying to run this code in Oracle database but it's giving error :
ORA-00904: "vw_d"."cl_name": invalid identifier
What's wrong with the query:
SELECT *
FROM vw_doctrans vw_d
WHERE (SELECT COUNT(*)
FROM (SELECT *
FROM vw_doctrans vw
WHERE vw.cl_name = vw_d.cl_name
GROUP BY vw.country)) > 1
I tried this query in MySQL and works fine
To me, it looks like
Sample data:
SQL> with vw_doctrans (id, cl_name, country) as
2 (select 1, 'Bob', 'UK' from dual union all
3 select 2, 'Bob', 'USA' from dual union all
4 select 3, 'Bob', 'UAE' from dual union all
5 select 4, 'Bob', 'UAE' from dual union all
6 select 5, 'Bob', 'UAE' from dual union all
7 --
8 select 6, 'John', 'Canada' from dual union all
9 select 7, 'John', 'Canada' from dual union all
10 --
11 select 8, 'Caroline', 'India' from dual union all
12 select 9, 'Caroline', 'USA' from dual union all
13 select 10, 'Caroline', 'USA' from dual union all
14 select 11, 'Caroline', 'USA' from dual
15 ),
Query begins here:
16 temp as
17 (select v.*,
18 count(distinct country) over (partition by cl_name) cnt
19 from vw_doctrans v
20 )
21 select distinct cl_name, country
22 from temp
23 where cnt >= 2
24 order by cl_name, country;
CL_NAME COUNTR
-------- ------
Bob UAE
Bob UK
Bob USA
Caroline India
Caroline USA
SQL>
[EDIT: if the table already existed (i.e. without a CTE)]
Table contents:
SQL> select * from vw_doctrans;
ID CL_NAME COUNTR
---------- -------- ------
1 Bob UK
2 Bob USA
3 Bob UAE
4 Bob UAE
5 Bob UAE
6 John Canada
7 John Canada
8 Caroline India
9 Caroline USA
10 Caroline USA
11 Caroline USA
11 rows selected.
Here's how to use a CTE (that's line #16 in previous query):
SQL> with temp as
2 (select v.*,
3 count(distinct country) over (partition by cl_name) cnt
4 from vw_doctrans v
5 )
6 select distinct cl_name, country
7 from temp
8 where cnt >= 2
9 order by cl_name, country;
CL_NAME COUNTR
-------- ------
Bob UAE
Bob UK
Bob USA
Caroline India
Caroline USA
SQL>
You are way overthinking it. This is just:
SELECT vw_d.cl_name, vw_d.country
FROM vw_doctrans vw_d
WHERE EXISTS (
SELECT *
FROM vw_doctrans vw
WHERE vw.cl_name = vw_d.cl_name AND vw.country != vw_d.country
)
GROUP BY vw_d.cl_name, vw_d.country
But your "I need also other columns too: Bob with countries, date, message of text. Not only cl_name and country column" directly conflicts with your "if message came from same country 2-3times show only one time the country." You can't have both; if you want only one row for each name/country combination, you need to decide how to pick which values for the other columns you want.

JOIN Two tables in SQL with one to many mapping and do not join for an Id if any one of the row does not statisfy the condition

I have to join 2 tables with one to many mapping. I have to select the rows for an Id of all rows satisfy the condition or else do not select it.
Example:
Table A
Id Company_Name
1 ABC
2 DEF
3 GHI
4 JKL
TABLE B
ID REGION BRANCH Number
1 ASIA 1
1 AMERICA 1
1 AUSTRALIA 2
2 ASIA. 1
2 AFRICA. 2
3 ASIA. 3
3 AMERICA. 3
4.ASIA. 1
4. ASIA. 2
4 ASIA. 3
Here I want to join table A and table B only when the company of present in both asia and America only.
Output:
ID company_name region branch_number
3. GHI Asia 3
3. GHI America. 3
4. JKL ASIA. 1
4. JKL ASIA. 2
4. JKL ASIA. 3
It should not select ID 1 since it is also present in Australia.
It should also not select 2 as it is not present in America.
It selects 3 as ASIA and AMERICA is present.
IT selects 4 AS ASIA is present.
Here's one option; subquery calculates whether certain ID has any rows for regions not begin Asia and America (for them, sumreg value is larger than 0 so you omit them from the final result).
Sample data:
SQL> with
2 a (id, company_name) as
3 (select 1, 'abc' from dual union all
4 select 2, 'def' from dual union all
5 select 3, 'ghi' from dual union all
6 select 4, 'jkl' from dual
7 ),
8 b (id, region, branch) as
9 (select 1, 'asia' , 1 from dual union all
10 select 1, 'america' , 1 from dual union all
11 select 1, 'australia', 2 from dual union all
12 select 2, 'asia' , 1 from dual union all
13 select 2, 'africa' , 2 from dual union all
14 select 3, 'asia' , 3 from dual union all
15 select 3, 'america' , 3 from dual union all
16 select 4, 'asia' , 1 from dual union all
17 select 4, 'asia' , 2 from dual union all
18 select 4, 'asia' , 3 from dual
19 )
Query begins here:
20 select b.id, a.company_name, b.region, b.branch
21 from a join b on a.id = b.id
22 join -- the C subquery returns only IDs that are valid only for Asia and America
23 (select x.id,
24 sum(case when x.region not in ('asia', 'america') then 1 else 0 end) sumreg
25 from b x
26 group by x.id
27 ) c on c.id = b.id and c.sumreg = 0;
ID COM REGION BRANCH
---------- --- --------- ----------
3 ghi america 3
3 ghi asia 3
4 jkl asia 3
4 jkl asia 2
4 jkl asia 1
SQL>

SQL query to to check uniqueness of a column value

Need a query to check(Select) if every combination of Value in colA and B has a unique value in col C. Please help. Thank you
This is the query I tried which doesn't give me the desired result:
select unique CONCAT(A,B),C
from tab1
group by CONCAT(A,B),C
having count(distinct (CONCAT(A,B)))>1;
Sample table:
A B C
Tim 123 1
Jill 123 1
Jill 456 2
John 456 1
Jill 456 1
Here row 3 and 5 with same values in col A and B have different values in col C which is incorrect. I need to select those
Something like this? (Sample data in lines #1 - 7; query begins at line #8):
SQL> with test (a, b, c) as
2 (select 'Tim' , 123, 1 from dual union all
3 select 'Jill', 123, 1 from dual union all
4 select 'Jill', 456, 2 from dual union all
5 select 'John', 456, 1 from dual union all
6 select 'Jill', 456, 1 from dual
7 )
8 select t.*
9 from test t
10 where (t.a, t.b) in (select x.a, x.b
11 from test x
12 group by x.a, x.b
13 having count(distinct x.c) > 1
14 );
A B C
---- ---------- ----------
Jill 456 1
Jill 456 2
SQL>

Split words and insert it new table with counting these words

I have the foolwoing table and data.
i need to:
1- split each sentence in each row into new row
2-count the words in each row based on last part of sentence based on soundex function
create table a (id number(9), words varchar(500));
insert into a values(1,'UK,LONDON,YEMEN,JOHN,CAIRO,OMAR ALI,EGYPT,Cairo,YEMAN,OMAR AMR ALI,LONDAN');
insert into a values(2,'UK,SUDAI,SUDAIN,AYHAM SHAHER YAFOOZ,ALI YAFOOZ');
insert into a values(3,'MALAYSIA, AHMED ALI,MALYSIAN');
expexted output
create table temp_words(id number(9),words varchar2(100), count_words number(9));
id words count_words
1 UK 1
1 LONDON 2
1 YEMEN 2
1 CAIRO 2
1 OMAR ALI 2
1 JOHN 1
2 UK 1
2 SUDAI 2
2 AYHAM SHAHER YAFOOZ 2
3 MALAYSIA 2
3 AHMED ALI 1
regards
to split the data as you want you can use a "connect by" as a row generator.
SQL> with src as (select id,',' || words || ',' as words,
2 length(words) - length(translate(words, '.,', '.')) + 1 no_of_words
3 from a)
4 select a.id,
5 substr(a.words,
6 instr(words, ',', 1, r) + 1,
7 instr(words, ',', 1, r + 1) - instr(words, ',', 1, r) - 1) word,
8 a.no_of_words
9 from (select level r
10 from dual
11 connect by level <= (select max(no_of_words) from src)) d
12 inner join src a
13 on d.r <= a.no_of_words
14 where a.no_of_words is not null
15 order by a.id, d.r
16 /
ID WORD NO_OF_WORDS
---------- -------------------- -----------
1 UK 11
1 LONDON 11
1 YEMEN 11
1 JOHN 11
1 CAIRO 11
1 OMAR ALI 11
1 EGYPT 11
1 Cairo 11
1 YEMAN 11
1 OMAR AMR ALI 11
1 LONDAN 11
2 UK 5
2 SUDAI 5
2 SUDAIN 5
2 AYHAM SHAHER YAFOOZ 5
2 ALI YAFOOZ 5
3 MALAYSIA 3
3 AHMED ALI 3
3 MALYSIAN 3
19 rows selected.
SQL>
Here is a SQLFiddle demo
select id,words,
case when i=0 then
SUBSTR(words,
1,
case when INSTR(words,',', 1, 1)=0
then 100000
else
INSTR(words,',', 1, 1)-1
end
)
ELSE
SUBSTR(words,
INSTR(words,',', 1, i)+1,
case when INSTR(words,',', 1, i+1)=0
then 100000
else
INSTR(words,',', 1, i+1)-INSTR(words,',', 1, i)-1
end
)
END word,
i+1 COUNTWORDS
from a,
(
select * from
(
select 0 i from dual
union
select 1 i from dual
union
select 2 i from dual
union
select 3 i from dual
union
select 4 i from dual
union
select 5 i from dual
union
select 6 i from dual
union
select 7 i from dual
union
select 8 i from dual
union
select 9 i from dual
union
select 10 i from dual
union
select 11 i from dual
union
select 12 i from dual
)
)
table_i
where
case when i>0 then INSTR(words,',', 1, i)
else 100000 end <>0
order by id,i
Another approach(using regexp_count and regexp_substr regular expression functions):
SQL> with Occurence(oc) as(
2 select level
3 from ( select max(regexp_count(words, '[^,]+')) ml
4 from a
5 ) t
6 connect by level <= t.ml
7 )
8 select id
9 , word
10 , count(word) over(partition by id, soundex(word) order by id) as count_words
11 From ( select a.id
12 , regexp_substr(words, '[^,]+', 1, o.oc) as word
13 from occurence o
14 cross join a
15 ) s
16 where s.word is not null
17 order by id
18 ;
ID WORD COUNT_WORDS
---------- -------------------- -----------
1 Cairo 2
1 CAIRO 2
1 EGYPT 1
1 JOHN 1
1 LONDAN 2
1 LONDON 2
1 OMAR ALI 1
1 OMAR AMR ALI 1
1 UK 1
1 YEMEN 2
1 YEMAN 2
2 ALI YAFOOZ 1
2 AYHAM SHAHER YAFOOZ 1
2 SUDAI 1
2 SUDAIN 1
2 UK 1
3 AHMED ALI 1
3 MALAYSIA 1
3 MALYSIAN 1
19 rows selected
You would need to insert your data as separate records. You can keep them as a concatenated string, if you like, but it'll just make your life very difficult. So:
create table words (
id number,
w varchar2(100),
s varchar2(4)
);
create or replace trigger words_auto
before insert or update on words
for each row
begin
select trim(upper(:new.w)), soundex(:new.w)
into :new.w, :new.s
from dual;
end;
insert into words (id, w) values (1, 'UK');
insert into words (id, w) values (1, 'LONDON');
...
insert into words (id, w) values (3, ' AHMED ALI');
insert into words (id, w) values (3, 'MALYSIAN');
You could write a procedure to split your concatenated string and populate the words table appropriately. Note that I have created a trigger that normalises your input to uppercase, removing all extraneous whitespace and automatically produces the Soundex codes.
Now here's a question: You want to group words by their Soundex codes; however, how do you determine the baseline? For example, 'LONDON' and 'LONDAN' both have code 'L535', but how do you know which record is the 'main' one?... You can't, without further lookup tables! As such, the best you can do is to group by the Soundex codes. This doesn't have to be stored in a table and it makes more sense to be a view:
create or replace view word_counts as
select id,
s soundex,
count(w) count_rows
from words
group by id,
s;
Note that I've called the count field count_rows as it counts records, rather than distinct rows. That is: records of 'LONDON', 'LONDAN' and 'LONDON' will show with a count of 3, not 2 (which you might be expecting). Anyway, with your data, the view will look something like this:
id soundex count_rows
----- --------- -----------
1 U200 1
1 L535 2
... ... ...
3 M420 2
3 A534 1
As I say, that's really the best you can expect without further infrastructure.