SQL - Find duplicate fields and count how many fields are matched - sql

I have a a large customer database where customers have been added multiple times in some circumstances which is causing problems. I am able to use a query to identify the records which are an exact match, although some records have slight variations such as different addresses or given names.
I want to query across 10 fields, some records will match all 10 which is clearly a duplicate although other fields may only match 5 fields with another record and require further investigation. Therefore i want to create a results set which has field with a count how many fields have been matched. Basically to create a rating of the likely hood the result is an actual match. All 10 would be a clear dup but 5 would only be a possible duplicate.
Some will only match on POSTCODE and FIRSTNAME which is generally can be discounted.
Something like this helps but as it only returns records which explicitly match on all 3 records its not really useful due the sheer amount of data.
SELECT field1,field2,field3, count(*)
FROM table_name
GROUP BY field1,field2,field3
HAVING count(*) > 1

You are just missing the magic of CUBE(), which generates all the combinations of columns automatically
DECLARE #duplicate_column_threshold int = 5;
WITH cte AS (
SELECT
field1,field2,...,field10
,duplicate_column_count = (SELECT COUNT(col) FROM (VALUES (field1),(field2),...,(field10)) c(col))
FROM table_name
GROUP BY CUBE(field1,field2,...,field10)
HAVING COUNT(*) > 1
)
SELECT *
INTO #duplicated_rows
FROM cte
WHERE duplicate_column_count >= #duplicate_column_threshold
Update: to fetch the rows from the original table, join it against the #duplicated_rows using a technique that treats NULLs as wildcards when comparing the columns.
SELECT
a.*
,b.duplicate_column_count
FROM table_name a
INNER JOIN #duplicated_rows b
ON NULLIF(b.field1,a.field1) IS NULL
AND NULLIF(b.field2,a.field2) IS NULL
...
AND NULLIF(b.field10,a.field10) IS NULL

You might try something like
Select field1, field2, field3, ... , field10, count(1)
from customerdatabase
group by field1, field2, field3, ... , field10
order by field1, field2, field3, ... , field10
Where field1 through field10 are ordered by the "most identifiable/important" to least.

This is as close I've got to what i'm trying to achieve, which will return all records which have any duplicate fields. I want to add a column to the results which indicate how many fields have matched any other record in the table. There are around 40,000 records in total.
select * from [CUST].[dbo].[REPORTA] as a
where exists
(select [GIVEN.NAMES],[FAMILY.NAME],[DATE.OF.BIRTH],[POST.CODE],[STREET],[TOWN.COUNTRY]
from [CUST].[dbo].[REPORTA] as b
where a.[GIVEN.NAMES] = b.[GIVEN.NAMES]
or a.[FAMILY.NAME] = b.[FAMILY.NAME]
or a.[DATE.OF.BIRTH] = b.[DATE.OF.BIRTH]
or a.[POST.CODE] = b.[POST.CODE]
or a.[STREET] = b.[STREET]
or a.[TOWN.COUNTRY] = b.[TOWN.COUNTRY]
group by [GIVEN.NAMES],[FAMILY.NAME],[DATE.OF.BIRTH],[POST.CODE],[STREET],[TOWN.COUNTRY]
having count(*) >= 1)
This query will return thousands of records but I'm generally interested in the record with a high count of exactly matching fields

Related

Find full matches in one field when related to a second field

I'd be grateful for some help with a problem that I hope to summarise reasonably with the two tables below:
Table1 contains the primary raw data where FieldA has a relationship with specific items in FieldB.
The items in FieldB are unique with respect to each unique item in FieldA - that is, cat, dog, rabbit, chicken will only ever appear once under the "a" group in FieldA (they can appear elsewhere in the field). Similarly for the b,c and d items in FieldA (all FieldB items only appear once against each).
Table2 lists the total count of each unique item in Table1, FieldB and is generated by the following query:
qryCount:
select FieldB, count(FieldB) AS FCount
from Table1
GROUP BY FieldB;
My problem:
The user enters unique values from FieldA in Table1 then, the query should return all unique values in FieldB (Table1) where a full match is achieved, with respect to the respective FCount total in Table2.
e.g.
If the user enters "a,b,d" the query outputs "cat, dog, rabbit, ferret" since the total count for cat(3), dog(2), rabbit(1) and ferret(1) are met.
If the user enters "a,c" the query outputs "chicken,rabbit" since the total count is met for chicken(2) and rabbit(1).
If the user enters "b" the query returns nothing since the respective FieldB items are also present elsewhere.
I do have this problem solved using VBA in Excel (building a hit table and seeing if the respective total counts for user-entered values are met), but though I do have some experience in using Access SQL (2007), I'm struggling to convert this idea from VBA. I'd be grateful for some help.
This query should give you the results you want. It uses a subquery to generate an effective copy of Table2 but only for the desired values of FieldA. This is then joined to Table2, giving only rows where the values of FCount match:
SELECT t1.FieldB
FROM (SELECT FieldB, COUNT(FieldB) AS FCount
FROM Table1
WHERE FieldA IN ('a', 'b', 'd')
GROUP BY FieldB) t1
INNER JOIN Table2 t2 ON t2.FieldB = t1.FieldB AND t2.FCount = t1.FCount
Output:
FieldB
cat
dog
ferret
rabbit
Demo on dbfiddle
Unless I've misunderstood the logic, I would suggest the following:
select distinct t1.fieldb from table1 t1
where
t1.fielda in ('a', 'b', 'd') and
not exists
(
select 1 from table1 t2
where t2.fieldb = t1.fieldb and t2.fielda not in ('a', 'b', 'd')
)
A few notes on the above:
The query is essentially selecting records for which the value held by FieldB only appears in the targeted FieldA groups (in this case a,b,d) and in no other groups.
Only table1 is referenced by the query, as no aggregation or counting is used.
The use of select 1 is purely an optimisation, since we don't care what the correlated subquery returns, but only that one or more records exist - as such, it can return the minimum amount of information necessary to verify this.

How i can check dual data entry in SQL/ PLSql

Dual data entry checking. Same data is enterd by two persons and now i want to compare this to ensure data quality.
This will depend a lot the measure of quality that you want to use.
As an example, you can just check for the fraction of entries that match exactly,
CASE WHEN COLUMN1 = COLUMN2 THEN '1' ELSE '0' END AS MatchedData
Then you can sum MatchedData and divide by the total number of entries
You can use a correlated subquery for this. First you need to decide which are columns when two records have the identical value in are considered duplicate records. Like you said the records entered by different users so they may have created_by_user column (if exists) different value and all other same. then put them in below sub query below to get the list of duplicate records.
SELECT
*
FROM
MY_TABLE t1
WHERE
ROWID <> (
SELECT
MAX(ROWID)
FROM
MY_TABLE t2
WHERE
t1.col1 = t2.col1
AND
t1.col2 = t2.col2
)

Minus operator in sql

I am trying to create a sql query with minus.
I have query1 which returns 28 rows with 2 columns
I have query2 which returns 22 row2 with same 2 columns in query 2.
when I create a query query1 minus query 2 it should have only show the 28-22=6 rows.
But it showing up all the 28 rows returned by query1.
Please advise.
Try using EXCEPT instead of MINUS. For Example:
Lets consider a case where you want to find out what tasks are in a table that haven't been assigned to you(So basically you are trying to find what tasks could be available to do).
SELECT TaskID, TaskType
FROM Tasks
EXCEPT
SELECT TaskID, TaskType
FROM Tasks
WHERE Username = 'Vidya'
That would return all the tasks that haven't been assigned to you. Hope that helps.
If MINUS won't work for you, the general form you want is the main query in the outer select and a variation of the other query in a not exists clause.
select <insert list of fields here>
from mytable a
join myothertable b
on b.aId = a.aid
where not exists (select * from tablec c where a.aid = c.aid)
The fields might not be exactly alike. may be one of the fields is char(10) and the other is char(20) and they both have the string "TEST" in them. They might "look" the same.
If the database you are working on supports "INTERSECT", try this query and see how many are perfectly matching results.
select field1, field2 from table1
intersect
select field1, field2 from table2
To get the results you are expecting, this query should give you 22 rows.
something like this:
select field1, field2, . field_n
from tables
MINUS
select field1, field2, . field_n
from tables;
MINUS works on the same principle as it does in the set operations. Suppose if you have set A and B,
A = {1,2,3,4}; B = {3,5,6}
then, A-B = {1,2,4}
If A = {1,3,5} and B = {2,4,6}
then, A-B = {1,3,5}. Here the count(A) before and after the MINUS operation will be the same, as it does not contain any overlapping terms with set B.
On similar lines, may be the result set obtained in query 2 may not have matching terms with the result of query1. Hence you are still getting 28 instead of 6 rows.
Hope this helps.
It returns the difference records in the upper query which are not contained by the second query.
In your case for example
A={1,2,3,4,5...28} AND B={29,30} then A-B={1,2,3....28}

how to compare two rows in one mdb table?

I have one mdb table with the following structure:
Field1 Field2 Field3 Field4
A ...
B ...
I try to use a query to list all the different fields of row A and B in a result-set:
SELECT * From Table1
WHERE Field1 = 'A'
UNION
SELECT * From Table1
WHERE Field1 = 'B';
However this query has two problems:
it list all the fields including the
identical cells, with a large table
it gives out an error message: too
many fields defined.
How could i get around these issues?
Is it not easiest to just select all fields needed from the table, based on the Field1 value and group on the values needed?
So something like this:
SELECT field1, field2,...field195
FROM Table1
WHERE field1 = 'A' or field1 = 'B'
GROUP BY field1, field2, ....field195
This will give you all rows where field1 is A or B and there is a difference in one of the selected fields.
Oh and for the group by statement as well as the SELECT part, indeed use the previously mentioned edit mode for the query. There you can add all fields (by selecting them in the table and dragging them down) that are needed in the result, then click the 'totals' button in the ribbon to add the group by- statements for all. Then you only have to add the Where-clause and you are done.
Now that the question is more clear (you want the query to select fields instead of records based on the particular requirements), I'll have to change my answer to:
This is not possible.
(untill proven otherwise) ;)
As far as I know, a query is used to select records using for example the where clause, never used to determine which fields should be shown depending on a certain criterium.
One thing that MIGHT help in this case is to look at the database design. Are those tables correctly made?
Suppose you have 190 of those fields that are merely details of the main data. You could separate this in another table, so you have a main table and details table.
The details table could look something like:
ID ID_Main Det_desc Det_value
This way you can filter all Detail values that are equal between the two main values A and B using something like:
Select a.det_desc, a.det_value, b.det_value
(Select Det_desc, det_value
from tblDetails
where id_main = a) as A inner join
(Select Det_desc, det_value
from tblDetails
where id_main = a) as B
on A.det_desc = B.det_desc and A.det_value <> B.det_value
This you can join with your main table again if needed.
You can full join the table on itself, matching identical rows. Then you can filter on mismatches if one of the two join parts is null. For example:
select *
from (
select *
from Table1
where Field1 = 'A'
) A
full join
(
select *
from Table1
where Field1 = 'B'
) B
on A.Field2 = B.Field2
and A.Field3 = B.Field3
where A.Field1 is null
or B.Field1 is null
If you have 200 fields, ask Access to generate the column list by creating a query in design view. Switch to SQL view and copy/paste. An editor with column mode (like UltraEdit) will help create the query.

Use of CASE statement values in THEN expression

I am attempting to use a case statement but keep getting errors. Here's the statement:
select TABLE1.acct,
CASE
WHEN TABLE1.acct_id in (select acct_id
from TABLE2
group by acct_id
having count(*) = 1 ) THEN
(select name
from TABLE3
where TABLE1.acct_id = TABLE3.acct_id)
ELSE 'All Others'
END as Name
from TABLE1
When I replace the TABLE1.acct_id in the THEN expression with a literal value, the query works. When I try to use TABLE1.acct_id from the WHEN part of the query, I get a error saying the result is more than one row. It seems like the THEN expression is ignoring the single value that the WHEN statement was using. No idea, maybe this isn't even a valid use of the CASE statement.
I am trying to see names for accounts that have one entry in TABLE2.
Any ideas would be appreciated, I'm kind of new at SQL.
First, you are missing a comma after TABLE1.acct. Second, you have aliased TABLE1 as acct, so you should use that.
Select acct.acct
, Case
When acct.acct_id in ( Select acct_id
From TABLE2
Group By acct_id
Having Count(*) = 1 )
Then ( Select name
From TABLE3
Where acct.acct_id = TABLE3.acct_id
Fetch First 1 Rows Only)
Else 'All Others'
End as Name
From TABLE1 As acct
As others have said, you should adjust your THEN clause to ensure that only one value is returned. You can do that by add Fetch First 1 Rows Only to your subquery.
Then ( Select name
From TABLE3
Where acct.acct_id = TABLE3.acct_id
Fetch First 1 Rows Only)
Fetch is not accepting in CASE statement - "Keyword FETCH not expected. Valid tokens: ) UNION EXCEPT. "
select name from TABLE3 where TABLE1.acct_id = TABLE3.acct_id
will give you all the names in Table3, which have a accompanying row in Table 1. The row selected from Table2 in the previous line doesn't enter into it.
Must be getting more than one value.
You can replace the body with...
(select count(name) from TABLE3 where TABLE1.acct_id = TABLE3.acct_id)
... to narrow down which rows are returning multiples.
It may be the case that you just need a DISTINCT or a TOP 1 to reduce your result set.
Good luck!
I think that what is happening here is that your case must return a single value because it will be the value for the "name" column. The subquery (select acct_id from TABLE2 group by acct_id having count(*) = 1 ) is OK because it will only ever return one value. (select name from TABLE3 where TABLE1.acct_id= TABLE3.acct_id) could return multiple values depending on your data. The problem is you trying to shove multiple values into a single field for a single row.
The next thing to do would be to find out what data causes multiple rows to be returned by (select name from TABLE3 where TABLE1.acct_id= TABLE3.acct_id), and see if you can further limit this query to only return one row. If need be, you could even try something like ...AND ROWNUM = 1 (for Oracle - other DBs have similar ways of limiting rows returned).