Finding duplicate entries in any of several columns - sql

I have code that identifies potential duplicate records based on the fact that several rows (with different IDs) have the same value in various other columns. This info gets manually reviewed, so I am not worried about the fact that a husband and wife could legitimately share an email address, for example. An example of the query I am using is this:
SELECT DISTINCT ID, Email
FROM Customers
WHERE Email IS NOT NULL AND Email != '' AND Email IN
(SELECT Email FROM Customers GROUP BY Email HAVING COUNT(DISTINCT ID) > 1)
ORDER BY Email;
Which gives me results like this:
ID Email
108 bob#hotmail.com
381 bob#hotmail.com
205 mary#gmail.com
772 mary#gmail.com
908 mary#gmail.com
This works great for my purposes, except when I try matching by phone number, which has multiple columns (HomePhone, BusinessPhone, CellPhone). This creates two problems - the first, which has been pretty well documented on this forum, is how to identify rows in which any of three columns contain a matching value (If a value in [row 1 column A, B, or C] matches a column in [row 2 column A, B, or C] then I want to select both rows). The second problem, which I haven't figured out yet and haven't found an answer to, is how to select [ID], [Value that Matched] as my output.
I suppose that I could select all three columns and do some further code magic in my program to make sense of it, but that prevents me from reusing existing code and also seems like the type of hack that a developer would use to keep from admitting that he needs help from a DBA. (Help!) In all seriousness, though, I am stuck trying to find an elegant solution, and any help would be appreciated.

Based on my understanding of the question,
You can initially use union all and get the different phone numbers into one column and group by that column to see if there are duplicates. Thereafter, join on the original table to get the customer id.
with cnts as (
select phone
from (select id,homephone phone from customers
union all
select id,businessphone from customers
union all
select id,cellphone from customers) x
group by phone
having count(distinct id) > 1
)
select c.id,cn.phone value_matched
from customers c
join cnts cn on cn.phone in (c.homephone,c.businessphone,c.cellphone)
order by 1,2

I would do this with apply:
select c.*, phone
from (select c.*, count(*) over (partition by phone) as cnt
from customers c cross apply
(select distinct v.phone
from (values (homephone), (businessphone), (cellphone)
) v(phone)
where v.phone is not null
) v(phone)
) c
where cnt > 1
order by phone;
The innermost subquery selects the distinct phones for each customer. The count(*) over . . . then counts the number of times that the phone appears (which because of the distinct is for different customers). The final where chooses phones that appear for multiple customers.

Related

Oracle - Join multiple columns trying different combinations

I'll try to explain my problem:
I need to find the most efficient way to join two table on 4 columns, but data is really crappy so there could be cases where I can join only on 3 or 2 columns because the fourth and/or third were stored badly (with spaces, zeros, dashes,...)
I should try to achieve something like this:
select * from table a
join table b
on a.key1=b.key1
and a.key2=b.key2
or a.key3=b.key3
or a.key4=b.key4```
I already performed some data quality but the number of records is really high (table a is 300k records and table b is about 25M records).
I know that the example I provided is not efficient and it would be better making separate joins and then "union" them, but I'm asking you if there could be some better way to do it.
Thanks in advance
You haven't explained your problem very well, so let's create an example:
There is a table of clients and a table of orders. Both are not related via keys, because both are imported from different systems. Your task is now to find the client per order.
Both tables contain the client's last name, first name, city, and a client number. However, these columns are optional in the order table (but either last name or client number are always given). And sometimes a first name or city may be abbreviated or misspelled (e.g. J./James, NY/New York, Cris/Chris).
So, if the order contains a client number, we have a match and are done. Otherwise the last name must match. In the latter case we look at first name and city, too. Do both match? Only one? Neither?
We use RANK to rank the clients per order and pick the best matches. Some orders will end up with exactly one match, others will have ties and we must examine the data manually then (the worst case being no client number and no last name match because of a misspelled name).
select *
from
(
select
o.*,
c.*,
rank() over
(
partition by o.order_number
order by
case
when c.client_number = o.client_number then 1
when c.last_name = o.last_name and c.first_name = o.first_name and c.city = o.city then 2
when c.last_name = o.last_name and (c.first_name = o.first_name or c.city = o.city) then 3
when c.last_name = o.last_name then 4
else 5
end
) as rnk
from orders o
left join clients c on c.client_number = o.client_number or c.last_name = o.last_name
) ranked
where rnk = 1
order by order_number;
I hope this gets you an idea how to write such a query and you will be able to adapt this concept to your case.

How to modify query to walk entire table rather than a single

I wrote several SQL queries and executed them against my table. Each individual query worked. I kept adding functionality until I got a really ugly working query. The problem is that I have to manually change a value every time I want to use it. Can you assist in making this query automatic rather than “manual”?
I am working with DB2.
Table below shows customers (cid) from 1 to 3. 'club' is a book seller, and 'qnty' is the number of books the customer bought from each 'club'. The full table has 45 customers.
Image below shows all the table elements for the first 3 users (cid=1 OR cid=2 OR cid=3). The final purpose of all my queries (once combined) is it to find the single 'club' with the largest 'qnty' for each 'cid'. So for 'cid =1' the 'club' is Readers Digest with 'qnty' of 3. For 'cid=2' the 'club' is YRB Gold with 'qnty' of 5. On and on until cid 45 is reached.
To give you a background on what I did here are my queries:
(Query 1-starting point for cid=1)
SELECT * FROM yrb_purchase WHERE cid=1
(Query 2 - find the 'club' with the highest 'qnty' for cid=1)
SELECT *
FROM
(SELECT club,
sum(qnty) AS t_qnty
FROM yrb_purchase
WHERE cid=1
GROUP BY club)results
ORDER BY t_qnty DESC
(Query 3 – combine the record from the above query with it’s cid)
SELECT cid,
temp.club,
temp.t_qnty
FROM yrb_purchase AS p,
(SELECT *
FROM
(SELECT club,
sum(qnty) AS t_qnty
FROM yrb_purchase
WHERE cid=1
GROUP BY club)results
ORDER BY t_qnty DESC FETCH FIRST 1 ROWS ONLY) AS TEMP
WHERE p.cid=1
AND p.club=temp.club
(Query 4) make sure there is only one record for cid=1
SELECT cid,
temp.club,
temp.t_qnty
FROM yrb_purchase AS p,
(SELECT *
FROM
(SELECT club,
sum(qnty) AS t_qnty
FROM yrb_purchase
WHERE cid=1
GROUP BY club)results
ORDER BY t_qnty DESC FETCH FIRST 1 ROWS ONLY) AS TEMP
WHERE p.cid=1
AND p.club=temp.club FETCH FIRST ROWS ONLY
To get the 'club' with the highest 'qnty' for customer 2, I would simply change the text cid=1 to cid=2 in the last query above. My query seems to always produce the correct results. My question is, how do I modify my query to get the results for all 'cid's from 1 to 45 in a single table? How do I get a table with all the cid values along with the club which sold that cid the most books, and how many books were sold within one tablei? Please keep in mind I am hoping you can modify my query as opposed to you providing a better query.
If you decide that my query is way too ugly (I agree with you) and choose to provide another query, please be aware that I just started learning SQL and may not be able to understand your query. You should be aware that I already asked this question: For common elements, how to find the value based on two columns? SQL but I was not able to make the answer work (due to my SQL limitations - not because the answer wasn't good); and in the absence of a working answer I could not reverse engineer it to understand how it works.
Thanks in advance
****************************EDIT #1*******************************************
The results of the answer is:
You could use OLAP/Window Functions to achieve this:
SELECT
cid,
club,
qnty
FROM
(
SELECT
cid,
club,
qnty,
ROW_NUMBER() OVER (PARTITION BY cid order by qnty desc) as cid_club_rank
FROM
(
SELECT
cid,
club,
sum(qnty) as qnty
FROM yrb_purchase
GROUP BY cid, club
) as sub1
) as sub2
WHERE cid_club_rank = 1
The inner most statement (sub1) just grabs a total quantity for each cid/club combination. The second inner most statement (sub2) creates a row_number for each cid/club combination ordering by the quantity (top down). Then the outer most query chooses only records where that row_number() is 1.

SQL - I need to see how many users are associated with a specific set of ids

I'm trying to identify a list of users that all have the same set of IDs from another table.
I have users 1, 2, 3, and 4, all that can have multiple IDs from the list A, B, C, and D. I need to see how many users from list one have ONLY 3 IDs, and those three IDs must match (so how many users from list one have ONLY A, B, and C, but not D).
I can identify which users have which IDs, but I can't quite get how to get how many users specifically have a specific set of them
Here is the SQL that I'm using where the counts just aren't looking correct. I've identified that there are about 7k users with exactly 16 IDs (of any type), but when I try to use this sql to get a count of a specific set of 16, the count I get is 15k.
select
count(user_id)
from
(
SELECT
user_id
FROM user_id_type
where user_id_type not in ('1','2','3','4','5')
GROUP BY user_id
HAVING COUNT(user_id_type)='16'
)
So you want users with 3 IDs as long as one of the IDs is not D. How about;
select user
from table
group by user
having count(*) = 3 and max(ID) <> 'D'
The HAVING clause is useful in situations like this. This approach will work as long as the excluded ID is the max (or an easy change for min).
Following your comment, if the min/max(ID) approach isn't viable then you could use NOT IN;
select user
from table
where user not in (select user from table where ID = 'D')
group by user
having count(*) = 3
Following the updated question, if I've understood the mapping between the initial example and reality correctly then the query should be something like this;
SELECT user_id
FROM user_id_type
WHERE user_id not in (select user_id from user_id_type where user_id_type in ('1','2','3','4','5'))
GROUP BY user_id
HAVING COUNT(user_id_type)='16'
What is odd is that you appear to have both a table and a column in the table with the same name 'user_id_type'. This isn't the clearest of designs.

Return all Fields and Distinct Rows

Whats the best way to do this, when looking for distinct rows?
SELECT DISTINCT name, address
FROM table;
I still want to return all fields, ie address1, city etc but not include them in the DISTINCT row check.
Then you have to decide what to do when there are multiple rows with the same value for the column you want the distinct check to check against, but with different val;ues in the other columns. In this case how does the query processor know which of the multiple values in the other columns to output, if you don't care, then just write a group by on the distinct column, with Min(), or Max() on all the other ones..
EDIT: I agree with comments from others that as long as you have multiple dependant columns in the same table (e.g., Address1, Address2, City, State ) That this approach is going to give you mixed (and therefore inconsistent ) results. If each column attribute in the table is independant ( if addresses are all in an Address Table and only an AddressId is in this table) then it's not as significant an issue... cause at least all the columns from a join to the Address table will generate datea for the same address, but you are still getting a more or less random selection of one of the set of multiple addresses...
This will not mix and match your city, state, etc. and should give you the last one added even:
select b.*
from (
select max(id) id, Name, Address
from table a
group by Name, Address) as a
inner join table b
on a.id = b.id
When you have a mixed set of fields, some of which you want to be DISTINCT and others that you just want to appear, you require an aggregate query rather than DISTINCT. DISTINCT is only for returning single copies of identical fieldsets. Something like this might work:
SELECT name,
GROUP_CONCAT(DISTINCT address) AS addresses,
GROUP_CONCAT(DISTINCT city) AS cities
FROM the_table
GROUP BY name;
The above will get one row for each name. addresses contains a comma delimted string of all the addresses for that name once. cities does the sames for all the cities.
However, I don't see how the results of this query are going to be useful. It will be impossible to tell which address belongs to which city.
If, as is often the case, you are trying to create a query that will output rows in the format you require for presentation, you're much better off accepting multiple rows and then processing the query results in your application layer.
I don't think you can do this because it doesn't really make sense.
name | address | city | etc...
abc | 123 | def | ...
abc | 123 | hij | ...
if you were to include city, but not have it as part of the distinct clause, the value of city would be unpredictable unless you did something like Max(city).
You can do
SELECT DISTINCT Name, Address, Max (Address1), Max (City)
FROM table
Use #JBrooks answer below. He has a better answer.
Return all Fields and Distinct Rows
If you're using SQL Server 2005 or above you can use the RowNumber function. This will get you the row with the lowest ID for each name. If you want to 'group' by more columns, add them in the PARTITION BY section of the RowNumber.
SELECT id, Name, Address, ...
(select id, Name, Address, ...,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY id) AS RowNo
from table) sub
WHERE RowNo = 1

SQL Query to find if different values exist for a column

I have a temporary table with three columns
pay_id,
id_client_grp,
id_user
Basically i want to ensure that this table should have all the rows having same client group and same id_user if not i want to know which pay_id is the culprit and throw error to user.
Can somebody help me with a query.
Thanks,
Rishi
When you say 'culprit,' I assume you mean the pay_id(s) that are not like the others, assuming there is a majority.
The problem is all of the pay_id's could potentially become culprits once your SELECT COUNT(DISTINCT id_client_grp, id_user) returns > 1 record, if there is a relatively even distribution. It is difficult to program for this scenario, since you will need to determine what exactly a majority is.
Your best bet will be to return all distinct combinations of those 3 fields, then decide where to go from there based on your business logic.
So could this question be asked like this:
If I wanted to add a unique index on my table across the three columns: client group, id user, pay id, identify those that break the unique condition where we have non unique pay id for a client group and id user??
select a.id_client_grp, a.id_user, a.pay_id , a.count from (
/* this should return 1 row per client group and user, */
/* if the pay id is the same for all */
select id_client_grp, id_user, pay_id, count(1) as count
from table t
group by id_client_grp, id_user ) a
group by a.id_client_grp, a.id_user
/* if we have more than one row per client group and user, then we have a dupe, so report them all */
having count (1) > 1
If you want all the rows to have the same values for some set of columns (your question is not entirely clear to me as t9o what you want to be the same)
Do you know going in WHICH pay_id, id_client_grp all the rows should be? Or do you not care, as long as they are all the same?
If you know the values you are looking for, simply test for rows that are not set to those desired values
Select distinct id_user
From tempTable
Where pay_id <> #PayIdValue
Or id_client_grp <> #ClientGroupIDValue
If you don't care, and just want them all to be the same, and they're not, then you need to specify which of the more than one set of values IS the "culprit" as you said...
If you want some other question answered. please explain more clearly...
Based on yr comment, then, to determine if there is more than one id_client_grp, pay_id
Select Count(Distinct id_client_grp, pay_id)
From tempTable
If this = 1 then every record has the same values for these 2 fields.... Any other value indicates that three is more than one set of distinct values in the table.
SELECT DISTINCT p.pay_id,
t.[count]
FROM rishi_table p
INNER JOIN ( SELECT id_client_grp, id_user, COUNT(*) As 'count'
FROM rishi_table
GROUP BY id_client_grp, id_user
HAVING COUNT(*) > 1 ) t
ON p.id_client_grp = t.id_client_grp AND p.id_user = t.id_user
basically create a set with the dupes, and bounce that against the main table to get your offending list.
SELECT DISTINCT id_client_grp, id_user
should let you do something like
IF ##ROWCOUNT > 1 THEN
...
Or possibly SELECT COUNT(DISTINCT id_client_grp, id_user) ...
but that's more vendor-dependent as to its availability and proper syntax.