Find full matches in one field when related to a second field - sql

I'd be grateful for some help with a problem that I hope to summarise reasonably with the two tables below:
Table1 contains the primary raw data where FieldA has a relationship with specific items in FieldB.
The items in FieldB are unique with respect to each unique item in FieldA - that is, cat, dog, rabbit, chicken will only ever appear once under the "a" group in FieldA (they can appear elsewhere in the field). Similarly for the b,c and d items in FieldA (all FieldB items only appear once against each).
Table2 lists the total count of each unique item in Table1, FieldB and is generated by the following query:
qryCount:
select FieldB, count(FieldB) AS FCount
from Table1
GROUP BY FieldB;
My problem:
The user enters unique values from FieldA in Table1 then, the query should return all unique values in FieldB (Table1) where a full match is achieved, with respect to the respective FCount total in Table2.
e.g.
If the user enters "a,b,d" the query outputs "cat, dog, rabbit, ferret" since the total count for cat(3), dog(2), rabbit(1) and ferret(1) are met.
If the user enters "a,c" the query outputs "chicken,rabbit" since the total count is met for chicken(2) and rabbit(1).
If the user enters "b" the query returns nothing since the respective FieldB items are also present elsewhere.
I do have this problem solved using VBA in Excel (building a hit table and seeing if the respective total counts for user-entered values are met), but though I do have some experience in using Access SQL (2007), I'm struggling to convert this idea from VBA. I'd be grateful for some help.

This query should give you the results you want. It uses a subquery to generate an effective copy of Table2 but only for the desired values of FieldA. This is then joined to Table2, giving only rows where the values of FCount match:
SELECT t1.FieldB
FROM (SELECT FieldB, COUNT(FieldB) AS FCount
FROM Table1
WHERE FieldA IN ('a', 'b', 'd')
GROUP BY FieldB) t1
INNER JOIN Table2 t2 ON t2.FieldB = t1.FieldB AND t2.FCount = t1.FCount
Output:
FieldB
cat
dog
ferret
rabbit
Demo on dbfiddle

Unless I've misunderstood the logic, I would suggest the following:
select distinct t1.fieldb from table1 t1
where
t1.fielda in ('a', 'b', 'd') and
not exists
(
select 1 from table1 t2
where t2.fieldb = t1.fieldb and t2.fielda not in ('a', 'b', 'd')
)
A few notes on the above:
The query is essentially selecting records for which the value held by FieldB only appears in the targeted FieldA groups (in this case a,b,d) and in no other groups.
Only table1 is referenced by the query, as no aggregation or counting is used.
The use of select 1 is purely an optimisation, since we don't care what the correlated subquery returns, but only that one or more records exist - as such, it can return the minimum amount of information necessary to verify this.

Related

How i can check dual data entry in SQL/ PLSql

Dual data entry checking. Same data is enterd by two persons and now i want to compare this to ensure data quality.
This will depend a lot the measure of quality that you want to use.
As an example, you can just check for the fraction of entries that match exactly,
CASE WHEN COLUMN1 = COLUMN2 THEN '1' ELSE '0' END AS MatchedData
Then you can sum MatchedData and divide by the total number of entries
You can use a correlated subquery for this. First you need to decide which are columns when two records have the identical value in are considered duplicate records. Like you said the records entered by different users so they may have created_by_user column (if exists) different value and all other same. then put them in below sub query below to get the list of duplicate records.
SELECT
*
FROM
MY_TABLE t1
WHERE
ROWID <> (
SELECT
MAX(ROWID)
FROM
MY_TABLE t2
WHERE
t1.col1 = t2.col1
AND
t1.col2 = t2.col2
)

SQL - Find duplicate fields and count how many fields are matched

I have a a large customer database where customers have been added multiple times in some circumstances which is causing problems. I am able to use a query to identify the records which are an exact match, although some records have slight variations such as different addresses or given names.
I want to query across 10 fields, some records will match all 10 which is clearly a duplicate although other fields may only match 5 fields with another record and require further investigation. Therefore i want to create a results set which has field with a count how many fields have been matched. Basically to create a rating of the likely hood the result is an actual match. All 10 would be a clear dup but 5 would only be a possible duplicate.
Some will only match on POSTCODE and FIRSTNAME which is generally can be discounted.
Something like this helps but as it only returns records which explicitly match on all 3 records its not really useful due the sheer amount of data.
SELECT field1,field2,field3, count(*)
FROM table_name
GROUP BY field1,field2,field3
HAVING count(*) > 1
You are just missing the magic of CUBE(), which generates all the combinations of columns automatically
DECLARE #duplicate_column_threshold int = 5;
WITH cte AS (
SELECT
field1,field2,...,field10
,duplicate_column_count = (SELECT COUNT(col) FROM (VALUES (field1),(field2),...,(field10)) c(col))
FROM table_name
GROUP BY CUBE(field1,field2,...,field10)
HAVING COUNT(*) > 1
)
SELECT *
INTO #duplicated_rows
FROM cte
WHERE duplicate_column_count >= #duplicate_column_threshold
Update: to fetch the rows from the original table, join it against the #duplicated_rows using a technique that treats NULLs as wildcards when comparing the columns.
SELECT
a.*
,b.duplicate_column_count
FROM table_name a
INNER JOIN #duplicated_rows b
ON NULLIF(b.field1,a.field1) IS NULL
AND NULLIF(b.field2,a.field2) IS NULL
...
AND NULLIF(b.field10,a.field10) IS NULL
You might try something like
Select field1, field2, field3, ... , field10, count(1)
from customerdatabase
group by field1, field2, field3, ... , field10
order by field1, field2, field3, ... , field10
Where field1 through field10 are ordered by the "most identifiable/important" to least.
This is as close I've got to what i'm trying to achieve, which will return all records which have any duplicate fields. I want to add a column to the results which indicate how many fields have matched any other record in the table. There are around 40,000 records in total.
select * from [CUST].[dbo].[REPORTA] as a
where exists
(select [GIVEN.NAMES],[FAMILY.NAME],[DATE.OF.BIRTH],[POST.CODE],[STREET],[TOWN.COUNTRY]
from [CUST].[dbo].[REPORTA] as b
where a.[GIVEN.NAMES] = b.[GIVEN.NAMES]
or a.[FAMILY.NAME] = b.[FAMILY.NAME]
or a.[DATE.OF.BIRTH] = b.[DATE.OF.BIRTH]
or a.[POST.CODE] = b.[POST.CODE]
or a.[STREET] = b.[STREET]
or a.[TOWN.COUNTRY] = b.[TOWN.COUNTRY]
group by [GIVEN.NAMES],[FAMILY.NAME],[DATE.OF.BIRTH],[POST.CODE],[STREET],[TOWN.COUNTRY]
having count(*) >= 1)
This query will return thousands of records but I'm generally interested in the record with a high count of exactly matching fields

Filter SQL query results by shared data values, similar to iTunes Get Info

I am trying to do an SQl query (MS Access) and have only the Like fields returned, similar to in iTunes, where you can select multiple songs and then, when editing their respective data, you can filter out data that is not shared by each selected song.
For example, I have a table similar to
ID,date,weight,buyerid
123,21/07/2014,5,22
124,21/07/2014,5,23
125,22/08/2014,5,23
If I search for like results for all three of the IDs (123, 124, and 125), I would only receive data under the weight column, as all three selections have the same weight. Likewise, if I searched using IDs 123 and 124, the date and weight values would be returned, as both IDs share those data values. The non-similar data would return null or no result.
Is this possible in a single query at all?
EDIT:
rephrase (trying my best to explain) ... I want to search a table of data (multiple fields) and only receive one row of results. Normally a similar search would return multiple rows with some fields containing the same data and others not, but I would like it filtered to return only one row that contains data in each field that is the same across all results (or nothing if not the same).
My other option is to loop through a standard query and pull all data out that matches for each field but i was hoping it might be able to be done in a single SQL query.
Hope that is better.
I think you are looking for something like this (assuming your table is named Table1):
SELECT t1.SharedAttributes
FROM
(
SELECT ID, CStr([Weight]) AS SharedAttributes
FROM Table1
WHERE ID IN (123, 124)
UNION ALL
SELECT ID, CStr([Date]) AS SharedAttributes
FROM Table1
WHERE ID IN (123, 124)
) t1,
(
SELECT COUNT(*) AS DistinctIDs
FROM (SELECT DISTINCT ID FROM Table1 WHERE ID IN (123, 124))
) t2
GROUP BY t1.SharedAttributes, t2.DistinctIDs
HAVING COUNT(*) = t2.DistinctIDs;
This produces:
SharedAttributes
21/07/2014
5
While:
SELECT t1.SharedAttributes
FROM
(
SELECT ID, CStr([Weight]) AS SharedAttributes
FROM Table1
WHERE ID IN (123, 124, 125)
UNION ALL
SELECT ID, CStr([Date]) AS SharedAttributes
FROM Table1
WHERE ID IN (123, 124, 125)
) t1,
(
SELECT COUNT(*) AS DistinctIDs
FROM (SELECT DISTINCT ID FROM Table1 WHERE ID IN (123, 124, 125))
) t2
GROUP BY t1.SharedAttributes, t2.DistinctIDs
HAVING COUNT(*) = t2.DistinctIDs;
produces:
SharedAttributes
5
Code is a bit ugly having the same WHERE clauses multiple times, but I don't think it can be helped. Another way to do it would be to drop all the searched IDs into a table, and instead of WHERE clauses, INNER JOIN to the table in each query. Also, a caveat: because you want all the results in one column, we have to cast all the values to a common datatype; in this case, string.

how to compare two rows in one mdb table?

I have one mdb table with the following structure:
Field1 Field2 Field3 Field4
A ...
B ...
I try to use a query to list all the different fields of row A and B in a result-set:
SELECT * From Table1
WHERE Field1 = 'A'
UNION
SELECT * From Table1
WHERE Field1 = 'B';
However this query has two problems:
it list all the fields including the
identical cells, with a large table
it gives out an error message: too
many fields defined.
How could i get around these issues?
Is it not easiest to just select all fields needed from the table, based on the Field1 value and group on the values needed?
So something like this:
SELECT field1, field2,...field195
FROM Table1
WHERE field1 = 'A' or field1 = 'B'
GROUP BY field1, field2, ....field195
This will give you all rows where field1 is A or B and there is a difference in one of the selected fields.
Oh and for the group by statement as well as the SELECT part, indeed use the previously mentioned edit mode for the query. There you can add all fields (by selecting them in the table and dragging them down) that are needed in the result, then click the 'totals' button in the ribbon to add the group by- statements for all. Then you only have to add the Where-clause and you are done.
Now that the question is more clear (you want the query to select fields instead of records based on the particular requirements), I'll have to change my answer to:
This is not possible.
(untill proven otherwise) ;)
As far as I know, a query is used to select records using for example the where clause, never used to determine which fields should be shown depending on a certain criterium.
One thing that MIGHT help in this case is to look at the database design. Are those tables correctly made?
Suppose you have 190 of those fields that are merely details of the main data. You could separate this in another table, so you have a main table and details table.
The details table could look something like:
ID ID_Main Det_desc Det_value
This way you can filter all Detail values that are equal between the two main values A and B using something like:
Select a.det_desc, a.det_value, b.det_value
(Select Det_desc, det_value
from tblDetails
where id_main = a) as A inner join
(Select Det_desc, det_value
from tblDetails
where id_main = a) as B
on A.det_desc = B.det_desc and A.det_value <> B.det_value
This you can join with your main table again if needed.
You can full join the table on itself, matching identical rows. Then you can filter on mismatches if one of the two join parts is null. For example:
select *
from (
select *
from Table1
where Field1 = 'A'
) A
full join
(
select *
from Table1
where Field1 = 'B'
) B
on A.Field2 = B.Field2
and A.Field3 = B.Field3
where A.Field1 is null
or B.Field1 is null
If you have 200 fields, ask Access to generate the column list by creating a query in design view. Switch to SQL view and copy/paste. An editor with column mode (like UltraEdit) will help create the query.

How to append distinct records from one table to another

How do I append only distinct records from a master table to another table, when the master may have duplicates. Example - I only want the distinct records in the smaller table but I need to insert/append records to what I already have in the smaller table.
Ignoring any concurency issues:
insert into smaller (field, ... )
select distinct field, ... from bigger
except
select field, ... from smaller;
You can also rephrase it as a join:
insert into smaller (field, ... )
select distinct b.field, ...
from bigger b
left join smaller s on s.key = b.key
where s.key is NULL
If you don't like NOT EXISTS and EXCEPT/MINUS (cute, Remus!), you have also LEFT JOIN solution:
INSERT INTO smaller(a,b)
SELECT DISTINCT master.a, master.b FROM master
LEFT JOIN smaller ON smaller.a=master.a AND smaller.b=master.b
WHERE smaller.pkey IS NULL
You don't say the scale of the problem so I'll mention something I recently helped a friend with.
He works for an insurance company that provides supplemental Dental and Vision benefits management for other insurance companies. When they get a new client they also get a new database that can have 10's of millions of records. They wanted to identify all possible dupes with the data they already had in a master database of 100's of millions of records.
The solution we came up with was to identify two distinct combinations of field values (normalized in various ways) that would indicate a high probability of a dupe. We then created a new table containing MD5 hashes of the combos plus the id of the master record they applied to. The MD5 columns were indexed. All new records would have their combo hashes computed and if either of them had a collision with the master the new record would be kicked out to an exceptions file for some human to deal with it.
The speed of this surprised the hell out of us (in a nice way) and it has had a very acceptable false-positive rate.
You could use the distinct keyword to filter out duplicates:
insert into AnotherTable
(col1, col2, col3)
select distinct col1, col2, col3
from MasterTable
Based on Microsoft SQL Server and its Transact-SQL. Untested as always and the target_table has the same amount of rows as the source table (otherwise use columnnames between INSERT INTO and SELECT
INSERT INTO target_table
SELECT DISTINCT row1, row2
FROM source_table
WHERE NOT EXISTS(
SELECT row1, row2
FROM target_table)
Something like this would work for SQL Server (you don't mention what RDBMS you're using):
INSERT INTO table (col1, col2, col3)
SELECT DISTINCT t2.a, t2.b, t2.c
FROM table2 AS t2
WHERE NOT EXISTS (
SELECT 1
FROM table
WHERE table.col1 = t2.a AND table.col2 = t2.b AND table.col3 = t2.c
)
Tune where appropriate, depending on exactly what defines "distinctness" for your table.