SQL: find duplicates, with a different field - sql

I have to find duplicates in an Access table, where one field is different.
I'll try to explain: assuming to have this data set
ID Country CountryB Customer
====================================================
1 Italy Austria James
2 Italy Austria James
3 USA Austria James
I have to find all the records with duplicated CountryB and Customer, but with different Country.
For instance, with the data above, the ID 1 and 2 are NOT duplicated (as they are from the same Country), while 1 and 3 (or 2 and 3) are.
The "best" query I got is the following one:
SELECT COUNT(*), CountryB, Customer FROM
(SELECT MIN(ID) as MinID, Country, CountryB, Customer FROM myTable GROUP BY Country, CountryB, Customer)
GROUP BY CountryB, Customer
HAVING COUNT(*)>1
I'm not sure if this is the smartest option, anyhow.
Furthermore, since I need to "mark" all the duplicates, I have to do something more, like this:
SELECT ID, a.Country, a.CountryB, a.Customer FROM myTable a
INNER JOIN
(
SELECT COUNT(*), CountryB, Customer FROM
(SELECT MIN(ID) as MinID, Country, CountryB, Customer FROM myTable GROUP BY Country, CountryB, Customer)
GROUP BY CountryB, Customer
HAVING COUNT(*)>1
) dt
ON a.Country=dt.Country and a.CountryB=dt.CountryB and a.Customer=dt.Customer
Any suggestion this approach is greatly appreciated.

I finally found a solution.
The correct solution is in this answer:
SELECT DISTINCT HAVING Count unique conditions
Adapted with this version, since I'm using Access 2010:
Count Distinct in a Group By aggregate function in Access 2007 SQL
Therefore, in my example table above, I can use this query to find duplicate records:
SELECT CountryB, Customer, Count(cd.Country)
FROM (SELECT DISTINCT Country, CountryB, Customer FROM myTable) AS cd
GROUP BY CountryB, Customer
HAVING COUNT(*) > 1
or this query to find all the IDs of the duplicated records:
SELECT ID FROM myTable a INNER JOIN
(
SELECT CountryB, Customer, Count(cd.Country)
FROM (SELECT DISTINCT Country, CountryB, Customer FROM myTable) AS cd
GROUP BY CountryB, Customer
HAVING COUNT(*) > 1
) dt
ON a.CountryB=dt.CountryB AND a.Customer=dt.Customer

Related

SQL: Select inside select?

I have a table of car accident in a major city, and the structure is like:
accident_table has the following columns:
id, caseno, date_of_occurrence, street, iucr, primary_type,
description, district, community_area, year, updated_on
I want to write a query that finds the street which has the most accidents for each district(I think the street count for each street is the number of accident that happened on that street).
Here is what I have:
SELECT DISTINCT on (street)
street,
district
FROM
(
SELECT
count(street) as street_cnt,
street,
district
FROM accident_table
)
WHERE street_count = (SELECT max(street_cnt))
It did not give me syntax error, but timed out, so I guess it took too long to run.
What's wrong and how to fix it?
Thanks,
Philip
First aggregate to get the count of accidents for each street. Then use the rank() window function to rank the streets within a district by the count of accidents in them. Then only select the ones that were ranked at the top.
SELECT x.district,
x.street,
x.accidents
FROM (SELECT a.district,
a.street,
count(*) accidents,
rank() OVER (PARTITION BY a.district
ORDER BY count(*) DESC) r
FROM accident_table a
GROUP BY a.district,
a.street) x
WHERE x.r = 1;
Your code looks like Postgres. In that database, you can express this without a subquery:
SELECT DISTINCT ON (a.district)
a.district, a.street, COUNT(*) as accidents
FROM accident_table a
GROUP BY a.district, a.street
ORDER BY a.district, COUNT(*) DESC;
That said, your problem is performance, which is probably not affected by subqueries. An index on accident_table(district, street) might help performance.

How to create an additional column with the percentages related to a count distinct statement

I'm trying to query each distinct medical speciality (e.g. oncologist, pediatrician, etc.) in a table and then count the number of times a claim (claim_id) is linked to it, which I've done using this:
select distinct specialization, count(distinct claim_id) AS Claim_Totals
from table1
group by specialization
order by Claim_Totals DESC
However, I also want to include an additional column which lists the % that each speciality makes up in the table (based on the number of claim_id related to it). So for instance, if there were 100 total claims and "cardiologist" had 25 claim_id records related to it, "oncologist" had 15, "general surgeon" had 10, and so forth, I want the output to look like this:
specialization | Claims_Totals | PERCENTAGE
___________________________________________
cardiologist 25 25%
oncologist 15 15%
general surgeon 10 10%
Could do this? I'm not familiar with Barbaros's syntax. If that works its more concise and better.
select specialization, count(distinct claim_id) AS Claim_Totals, count(distinct claim_id)/total_claims
from table1
INNER JOIN ( SELECT COUNT(DISTINCT claim_id)*1.0000 total_claims AS total_claims
FROM table1 ) TMP
ON 1 = 1
group by specialization
order by Claim_Totals DESC
select specialization,
count(distinct claim_id) AS claim_by_spec,
count(distinct claim_id)/
( SELECT COUNT(DISTINCT claim_id)*1.0000
FROM table1 ) AS percentage_calc
from table1
group by specialization
order by Claim_Totals DESC
You can use sum(count(distinct)) over() to get the overall claims and use it in the denominator to get the percentage.
select specialization
,count(distinct claim_id) AS Claim_Totals
,round(100*count(distinct claim_id)/sum(count(distinct claim_id)) over(),3) as percentage
from table1
group by specialization
You can use
,concat_ws('',count(distinct claim_id),'%') as percentage
or
,concat(count(distinct claim_id),'%') as percentage
as added to the select list's tail
Btw, distinct before specialization in the select list is redundant, since already included in the group by list.
Because you are using count(distinct), window functions are less useful. You can try:
select t1.specialization,
count(distinct t1.claim_id) AS Claim_Totals,
count(distinct t1.claim_id) / tt1.num_claims
from table1 t1 cross join
(select count(distinct claim_id) as num_claims
from table1
) tt1
group by t1.specialization
order by Claim_Totals DESC

Show duplicate rows(all columns of that row) where all columns are duplicate except one column

In below table, I need to select duplicate records where all columns are duplicate except Customer Type and Price for a particular week.
For e.g
Week Customer Product Customer Type Price
1 Alex Cycle Consumer 100
1 Alex Cycle Reseller 101
2 John Motor Consumer 200
3 John Motor Consumer 200
3 John Motor Reseller 201
I am using below query but this query doesn't show me both costumer type, it just shows me consumer count(*) for a combination.
select Week, Customer, product, count(distinct Customer Type)
from table
group by Week, Customer, product
having count(distinct Customer Type) > 1
I would like to see below result, that shows me duplicate values and not just the count(*) of duplicate row. I am trying to see customers assigned to multiple customer types in a particular week for a product and at the same time show me all columns. It doesn't matter if the price is different.
Week Customer Product Customer Type Price
1 Alex Cycle Consumer 100
1 Alex Cycle Reseller 101
3 John Motor Consumer 200
3 John Motor Reseller 201
Thanks
Shaki
WITH CustomerDistribution_CTE (WeekC ,CustomerC, ProductC)
AS
(
select Week, Customer, product
from Your_Table_Name group by Week, Customer,
product having count(distinct CustomerType) > 1
)
SELECT Y.*
FROM CustomerDistribution_CTE C
inner join Your_Table_Name Y on C.WeekC =Y.Week
and C.CustomerC =Y.Customer and C.productC =Y.product
Note :Please replace "Your_Table_Name" with exact table name and Try.
One way to achieve this, using generic SQL, is to use a "derived table" like this:
select x.*
from tablex x
inner join (
select Week, Customer, Product
from tablex
group by Week, Customer, Product
having count(*) > 1
) d on x.Week = d.Week and x.Customer = d.Customer and x.Product = d.Product
You can do that by using DISTINCT like
select DISTINCT Customer,Product,Customer_Type,Price from Your_Table_Name
will look for DISTINCT combination.
Note: This query if of SQL Server
From the expected result that you have pasted, it looks like you are not concerned about the week.
If you have a ID (incremental PK), it would be much simpler like below
select * from table where ID not in
(select max(ID) from table group by Customer, Product, CustomerType having count(*) > 1 )
This is tested on MySQL. Do you have a ID column?
In case you don't have a ID column, try the below:
select max(week) week, Customer, Product, CustomerType, max(price) from device group by Customer, Product, CustomerType;
I have not verified this one.
This will return your expected result set:
select *
from table
-- Teradata syntax to filter the result of an OLAP-function
-- (similar to HAVING after GROUP BY)
qualify
count(*)
over (partition by Week, Customer, product) > 1
For other DBMSes you will need to nest your query:
select *
from
(
select ...,
count(*)
over (partition by Week, Customer, product) as cnt
from table
) as dt
where cnt > 1
Edit:
After re-reading your description above Select might be not exactly what you want, because it will also return rows with a single type. Then switch to:
select *
from table
-- Teradata syntax to filter the result of an OLAP-function
-- (similar to HAVING after GROUP BY)
qualify -- at least two different types:
min(Customer_Type) over (partition by Week, Customer, product)
<> max(Customer_Type) over (partition by Week, Customer, product)

Find not unique rows in Oracle SQL

I have a question which looks easy but I can't figure it out.
I have the following:
Name Zipcode
ER 5354
OL 1234
AS 1234
BH 3453
BH 3453
HZ 1234
I want to find those rows where the ID does not define clearly one row.
So here I want to see:
OL 1234
AS 1234
HZ 1234
Or simply the zipcode enough.
I am sorry I forget to mention an important part. If the name is the same its not a problem, only if there are different names for the same zipcode.
So this means: BH 3453 does not return
I think this is what you want
select zipcode
from yourTable
group by zipcode
having count(*) > 1
It selects the zipcodes associated to more than one record
to answer your updated question:
select zipcode
from
(
select name, zipcode
from yourTable
group by name, zipcode
)
group by zipcode
having count(*) > 1
should do it. It might not be optimal in terms of performance in which case you could use window functions as suggested by #a1ex07
Try this:
select yt.*
from YOUR_TABLE yt
, (select zipcode
from YOUR_TABLE
group by zipcode
having count(*) > 1
) m
where yt.zipcode = m.zipcode
If you need just zipcode, use vc 74's solution. For all columns , solution based on window functions supposedly outperforms self join approach:
SELECT a.zipcode, a.name
FROM
(
SELECT zipcode, name, count(1) over(partition by zipcode) as cnt
FROM your_table
)a
WHERE a.cnt >1

Find duplicate records in a table using SQL Server

I am validating a table which has a transaction level data of an eCommerce site and find the exact errors.
I want your help to find duplicate records in a 50 column table on SQL Server.
Suppose my data is:
OrderNo shoppername amountpayed city Item
1 Sam 10 A Iphone
1 Sam 10 A Iphone--->>Duplication to be detected
1 Sam 5 A Ipod
2 John 20 B Macbook
3 John 25 B Macbookair
4 Jack 5 A Ipod
Suppose I use the below query:
Select shoppername,count(*) as cnt
from dbo.sales
having count(*) > 1
group by shoppername
will return me
Sam 2
John 2
But I don't want to find duplicate just over 1 or 2 columns. I want to find the duplicate over all the columns together in my data. I want the result as:
1 Sam 10 A Iphone
with x as (select *,rn = row_number()
over(PARTITION BY OrderNo,item order by OrderNo)
from #temp1)
select * from x
where rn > 1
you can remove duplicates by replacing select statement by
delete x where rn > 1
SELECT OrderNo, shoppername, amountPayed, city, item, count(*) as cnt
FROM dbo.sales
GROUP BY OrderNo, shoppername, amountPayed, city, item
HAVING COUNT(*) > 1
SQL> SELECT JOB,COUNT(JOB) FROM EMP GROUP BY JOB;
JOB COUNT(JOB)
--------- ----------
ANALYST 2
CLERK 4
MANAGER 3
PRESIDENT 1
SALESMAN 4
Just add all fields to the query and remember to add them to Group By as well.
Select shoppername, a, b, amountpayed, item, count(*) as cnt
from dbo.sales
group by shoppername, a, b, amountpayed, item
having count(*) > 1
To get the list of multiple records use following command
select field1,field2,field3, count(*)
from table_name
group by field1,field2,field3
having count(*) > 1
Try this instead
SELECT MAX(shoppername), COUNT(*) AS cnt
FROM dbo.sales
GROUP BY CHECKSUM(*)
HAVING COUNT(*) > 1
Read about the CHECKSUM function first, as there can be duplicates.
Try this
with T1 AS
(
SELECT LASTNAME, COUNT(1) AS 'COUNT' FROM Employees GROUP BY LastName HAVING COUNT(1) > 1
)
SELECT E.*,T1.[COUNT] FROM Employees E INNER JOIN T1 ON T1.LastName = E.LastName
with x as (
select shoppername,count(shoppername)
from sales
having count(shoppername)>1
group by shoppername)
select t.* from x,win_gp_pin1510 t
where x.shoppername=t.shoppername
order by t.shoppername
First of all, I doubt that the result it not accurate? Seem like there are Three 'Sam' from the original table. But it is not critical to the question.
Then here we come for the question itself. Based on your table, the best way to show duplicate value is to use count(*) and Group by clause. The query would look like this
SELECT OrderNo, shoppername, amountPayed, city, item, count(*) as RepeatTimes FROM dbo.sales GROUP BY OrderNo, shoppername, amountPayed, city, item HAVING COUNT(*) > 1
The reason is that all columns together from your table uniquely identified each record, which means the records will be considered as duplicate only when all values from each column are exactly the same, also you want to show all fields for duplicate records, so the group by will not miss any column, otherwise yes because you can only select columns that participate in the 'group by' clause.
Now I would like to give you any example for With...Row_Number()Over(...), which is using table expression together with Row_Number function.
Suppose you have a nearly same table but with one extra column called Shipping Date, and the value may change even the rest are the same. Here it is:
OrderNo shoppername amountpayed city Item Shipping Date
1 Sam 10 A Iphone 2016-01-01
1 Sam 10 A Iphone 2016-02-02
1 Sam 5 A Ipod 2016-03-03
2 John 20 B Macbook 2016-04-04
3 John 25 B Macbookair 2016-05-05
4 Jack 5 A Ipod 2016-06-06
Notice that row# 2 is not a duplicate one if you still take all columns as a unit. But what if you want to treat them as duplicate as well in this case? You should use With...Row_Number()Over(...), and the query would look like this:
WITH TABLEEXPRESSION
AS
(SELECT *,ROW_NUMBER() OVER (PARTITION BY OrderNo, shoppername, amountPayed, city, item ORDER BY [Shipping Date] as Identifier) --if you consider the one with late shipping date as the duplicate
FROM dbo.sales)
SELECT * FROM TABLEEXPRESSION
WHERE Identifier !=1 --or use '>1'
The above query will give result together with Shipping Date, for example:
OrderNo shoppername amountpayed city Item Shipping Date Identifier
1 Sam 10 A Iphone 2016-02-02 2
Note this one is different from the one with 2016-01-01, and the reason why 2016-02-02 has been filtered out is PARTITION BY OrderNo, shoppername, amountPayed, city, item ORDER BY [Shipping Date] as Identifier, and Shipping Date is NOT one of the column that need to be took care of for duplicate records, which means the one with 2016-02-02 still could be a perfect result for your question.
Now summarize it little bit, using count(*) and Group by clause together is the best choice when you only want to show all columns from Group byclause as the result, otherwise you will miss the columns that do not participate in group by.
While For With...Row_Number()Over(...), it is suitable in every scenario that you want to find duplicate records, however, it is little bit complicated to write the query and little bit over engineered compared to the former one.
If your purpose is to delete duplicate records from table, you have to use the later WITH...ROW_NUMBER()OVER(...)...DELETE FROM...WHERE one.
Hope this helps!
You can use below methods to find the output
with Ctec AS
(
select *,Row_number() over(partition by name order by Name)Rnk
from Table_A
)
select Name from ctec
where rnk>1
select name from Table_A
group by name
having count(*)>1
Select *
from dbo.sales
group by shoppername
having(count(Item) > 1)
Select EventID,count() as cnt
from dbo.EventInstances
group by EventID
having count() > 1
The following is running code:
SELECT abnno, COUNT(abnno)
FROM tbl_Name
GROUP BY abnno
HAVING ( COUNT(abnno) > 1 )