SQL - Way to find duplicate fields within multiple rows?

SQL - Way to find duplicate fields within multiple rows? - sql

I'm wondering if there's a way to return duplicates of parts of rows.
IDTable setup:
ID# | Customer | EventID#
1 | Steve | 123
2 | Steve | 123
3 | John | 987
4 | John | 924
Since Steve and 123 appear twice together, I want to treat that as a 'duplicate' even though they have two different ID#'s. And if there's a 'duplicate', ideally I'd like to only return columns: ID#, Customer & EventID#. So for the above IDTable example, only return:
1 | Steve | 123
2 | Steve | 123
By running the following, it counts each ID + Customer + EventID# separately and returns all Count values as 1 (I'm using SQL Server 2008):
SELECT ID#, Customer, EventID#, COUNT({fn CONCAT(Customer,EventID#)})
FROM IDTable
GROUP BY ID#, Customer, EventID#
HAVING COUNT({fn CONCAT(Customer,EventID#)}) > 1
If I take out the ID# from the Select, it'l work but then we won't know what the ID#'s are.
EDIT:
I'm joining in the select columns from other tables. I initially left those out for simplicity sake by when trying to apply solutions below I'm getting confused. Apologies! Here's what is more in line with what I'm using:
SELECT A.ID#, C.Customer, E.EventID#
FROM IDTable A
INNER JOIN CustomerTable C
ON C.AccountID = A.AccountID
INNER JOIN EventTable E
ON E.AccountType = C.AccountType
WHERE C.StatusID = 'Active'

Self-Join should do the trick:
SELECT A.ID#, A.Customer, A.EventID#
FROM Table A
INNER JOIN Table A2 ON A.Customer = A2.Customer
AND A.EventID# = A2.EventID#
AND A.ID# <> A2.ID#
Edit for your joins:
You can still use a self-join, just with derived tables like so:
SELECT A.ID#, A.Customer, A.EventID#
FROM (SELECT ID#, Customer, EventID#
FROM IDTable A
INNER JOIN CustomerTable C ON C.AccountID = A.AccountID
INNER JOIN EventTable E ON E.AccountType = C.AccountType
WHERE C.StatusID = 'Active') A
INNER JOIN (SELECT ID#, Customer, EventID#
FROM IDTable A
INNER JOIN CustomerTable C ON C.AccountID = A.AccountID
INNER JOIN EventTable E ON E.AccountType = C.AccountType
WHERE C.StatusID = 'Active') A2 ON A.Customer = A2.Customer
AND A.EventID# = A2.EventID#
AND A.ID# <> A2.ID#
And cleaner with #TEMP:
SELECT A.ID#, C.Customer, E.EventID#
INTO #TEMP
FROM IDTable A
INNER JOIN CustomerTable C
ON C.AccountID = A.AccountID
INNER JOIN EventTable E
ON E.AccountType = C.AccountType
WHERE C.StatusID = 'Active'
;
SELECT A.ID#, A.Customer, A.EventID#
FROM #TEMP A
INNER JOIN #TEMP A2 ON A.Customer = A2.Customer
AND A.EventID# = A2.EventID#
AND A.ID# <> A2.ID#

Most version of SQL support window functions. The easiest way to solve this is:
select id, customer, eventid#
from (select i.*, count(*) over (partition by customer, eventid#) as cnt
from idtable i
) i
where cnt > 1;

SELECT i.*
FROM
IDTable i
INNER JOIN
(SELECT Customer, EventID#
FROM IDTable
GROUP BY ID#, Customer, EventID#
HAVING COUNT(*) > 1) t
ON i.Customer = t.Customer
AND i.EventId# = t.EventId#
There may be other ways of doing this and if you tag your specific rdbms (sql-server, oracle, mysql etc.) I am sure you will get additional answers but here is one way to do it which is to use your query (without the ID column) to identify the duplicates and then match it back to your original table via a inner join.

From #Aaron_D's answer I'd not join subqueries, instead you can do:
SELECT A.ID#, C.Customer, E.EventID#
FROM IDTable A
INNER JOIN IDTable B ON A.ID# = B.ID#
AND A.AccountID <> B.AccountID
INNER JOIN CustomerTable C
ON C.AccountID = A.AccountID
INNER JOIN EventTable E
ON E.AccountType = C.AccountType
WHERE C.StatusID = 'Active'
Since both CustomerTable AND EventTable are in last instance derived from AccountID it will work fine and will be faster.

Related

How to make LEFT JOIN with row having max date?

I have two tables in Oracle DB
Person (
id
)
Bill (
id,
date,
amount,
person_id
)
I need to get person and amount from last bill if exist.
I trying to do it this way
SELECT
p.id,
b.amount
FROM Person p
LEFT JOIN Bill b
ON b.person_id = p.id AND b.date = (SELECT MAX(date) FROM Bill WHERE person_id = 1)
WHERE p.id = 1;
But this query works only with INNER JOIN. In case of LEFT JOIN it throws ORA-01799 a column may not be outer-joined to a subquery
How can I get amoun from the last bill using left join?

Please try the below avoiding sub query to be outer joined
SELECT
p.id,
b.amount
FROM Person p
LEFT JOIN(select * from Bill where date =
(SELECT MAX(date) FROM Bill b1 WHERE person_id = 1)) b ON b.person_id = p.id
WHERE p.id = 1;

What you are looking for is a way to tell in bills, for each person, what is the latest record, and that one is the one to join with. One way is to use row_number:
select * from person p
left join (select b.*,
row_number() over (partition by person_id order by date desc) as seq_num
from bills b) b
on p.id = b.person_id
and seq_num = 1

You cannot have a subquery inside an ON statement.
Instead you need to convert your LEFT JOIN statement into a whole subquery.
Not tested but this should work.
SELECT
p.id,
b.amount
FROM Person p
LEFT JOIN (
SELECT id FROM Bill
WHERE person_id = p.id
AND date = (SELECT date FROM Bill WHERE person_id = 1)) b
WHERE p.id = 1;
I'm not quite sure why you would want to filter for the date though.
Simply filtering for the person_id should do the trick

you should join Person and Bill to the result for max date in bill related to person_id
select Person.id, bill.amount
from Person
left join bill on bill.person_id = person.id
left join (
select person_id, max(date) as max_date
from bill
group by person_id ) t on t.person_id = Person.id and b.date = t.max_date

Hey you can do like this
SELECT
p.id,
b.amount
FROM Person p
LEFT JOIN Bill b
ON b.person_id = p.id AND b.date = (SELECT max(date) FROM Bill WHERE person_id = 1)
WHERE p.id = 1

SELECT
p.id,
b.amount
FROM Person p
LEFT JOIN Bill b
ON b.person_id = p.id
WHERE (SELECT max(date) FROM bill AS sb WHERE sb.person_id=p.id LIMIT 1)=b.date;

SELECT
p.id,
c.amount
FROM Person p
LEFT JOIN (select b.person_id as personid,b.amount as amount from Bill b where b.date1= (select max(date1) from Bill where person_id=1)) c
ON c.personid = p.id
WHERE p.id = 1;

try this
select * from person p
left join (select MAX(id) KEEP (DENSE_RANK FIRST ORDER BY date DESC)
from bills b) b
on p.id = b.person_id

I use GREATEST() function in join condition:
SELECT
p.id,
b.amount
FROM Person p
LEFT JOIN Bill b
ON b.person_id = p.id
AND b.date = GREATEST(b.date)
WHERE p.id = 1

This allows you to grab the whole row if necessary and grab the top x rows
SELECT p.id
,b.amount
FROM person p
LEFT JOIN
(
SELECT * FROM
(
SELECT date
,ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY date DESC) AS row_num
FROM bill
)
WHERE row_num = 1
) b ON p.id = b.person_id
WHERE p.id = 1
;

SQL Query Refactoring Possible?

Assume this is part of the result set
AND
Assume Dob,Name,Adress,Postcode,Telephone,EmailAddress are the same for each ID - and these columns are used in the group by clause
Sample data:
ID date Amount
---------------------------
12345 1/1/2017 100
12345 1/2/2017 200
12345 1/3/2017 300
With the outer query included I get the following which is what I want to achieve
ID date Amount
--------------------------
12345 1/1/2017 600
I want to confirm if there's a better way in terms of performance for this code. I feel like I could do a join, or a shorter version of the query but I can't get the logic right.
When I remove the outer query and do the MIN and SUM aggregate functions inside, the results doesn't group by correctly. It'll show more than one result for each id.
Also is it possible for a shorter group by?
Here's the partial version of the final code
SELECT
a.id, a.dob, a.claim_id,
a.name, a.Address, a.postcode,
a.Telephone, a.EmailAdress,
MIN(a.date), SUM(a.amount) as Amount
FROM
(SELECT DISTINCT
i.date, i.id, cl.name, cl.address,
cl.postcode, cl.telephone, cl.dob,
cl.EmailAdress, i.amount, cm.claim_id
FROM
testdb.dbo.invoice i
JOIN
testdb.dbo.claim cm with (nolock) ON i.id = cm.id
JOIN
testdb.dbo.clients cl with (nolock) ON cm.clientid = cl.id
JOIN
(....) c ON i.id = c.id
WHERE
.....) AS a
GROUP BY
a.id, a.dob, a.claim, a.name, a.Address,
a.postcode, a.Telephone, a.EmailAdress
ORDER BY
1

SELECT DISTINCT
i.date ,i.id ,cl.name ,cl.address
,cl.postcode ,cl.telephone,cl.dob
,cl.EmailAdress ,i.amount ,cm.claim_id
FROM
testdb.dbo.invoice i
JOIN
testdb.dbo.claim cm with (nolock) on i.id = cm.id
JOIN
testdb.dbo.clients cl with (nolock) on cm.clientid = cl.id
JOIN
( .... ) c on i.id = c.id
WHERE
.....
GROUP BY
i.id,i.dob,cm.claim_id,cl.name,cl.Address,cl.postcode,
cl.Telephone,cl.EmailAdress
ORDER BY 1
Is pretty much the previous code. With the outer query removed. I'm not sure what happened previously and as to why it still gave me multiple records(I'm not sure what differed now and then). But it isn't doing that anymore with this code.

Why not do the calculation inline and then join the detail tables afterwards,
something like:
SELECT
a.id, a.dob, claimDetails.claim_id,
a.name, a.Address, a.postcode,
a.Telephone, a.EmailAdress,
claimDetails.FirstDate, claimDetails.Amount
FROM a
LEFT JOIN
(
SELECT i.id, cm.claim_id, MIN(i.date) as FirstDate, SUM(i.amount) as Amount
FROM testdb.dbo.invoice i
JOIN testdb.dbo.claim cm ON i.id = cm.id
GROUP BY i.id, cm.claim_id
) claimDetails
ON claimDetails.id = a.id
LEFT JOIN Clients....

Getting oldest Date SQL Complexity

I have a problem which I cannot resolve no matter what without using code, instead of SQL SCRIPT.
I have 2 tables
Person
ID Name Type
1 A A1
2 B A2
3 C A3
4 D A4
5 E A6
PersonHomes
HOMEID Location PurchaseDate PersonID
1 CA 20160101 1
2 CT 20160202 1
3 DT 20160101 2
4 BT 20170102 3
5 CT 20160303 1
6 CA 20160101 2
PersonID is foreign key of Person Table
There are no other rowz in the tables
So, we have to show detail of EACH person WITH home
The rule to write output is
IF Person has SINGLE entry in PersonHomes then use it
IF Person has MORE than ONE entry in PersonHomes then we have to look at purchase date, IF they are different then USE the PersonHomes ROW with OLDEST date in it. AND DELETE OTHER ROWS OF HIM
IF Person has MORE than ONE entry in PersonHomes then we have to look at purchase date, and IF DATES are SAME then USE the ROW with LOWER ID AND DELETE THE OTHER ROWS of HIM
This is very easy to do in code but using SQL it is complex
What I tried was to
WITH PERSON (
SELECT * FROM Person)
SELECT * FROM PERSON
INNER JOIN PersonHomes ON Person.ID = PersonHomes.PersonID
WHERE PersonHomes.PersonID = CASE WHEN (COUNT (*) FROM PersonHomes...)
Then I think I can write SQL function ?
I am stuck, Please help!
SAMPLE OUTPUT for PERSON A
ID NAME Type HOMEID Location PurchaseDate
1 A A1 5 CT 20160303
For PERSON B
ID NAME Type HOMEID Location PurchaseDate
1 A A2 3 DT 20160101
Aiden

It is not so easy to get desired output with SQL. we should write more than one sql queries.
First I created a temp table which consists of home details:
select PersonID, count(*) as HomeCount, count(distinct PurchaseDate) as
PurchaseDateCount, min(PurchaseDate) oldestPurchaseDate, min(HOMEID) as
LowerHomeID into #PersonHomesAbstractTable from PersonHomes group by PersonID
Then for the output of your first rule:
select p.ID, p.NAME, p.Type, ph.HOMEID, ph.Location, ph.PurchaseDate from Person p
inner join #PersonHomesAbstractTable a on p.ID = a.PersonID
inner join PersonHomes ph on p.ID = ph.PersonID
where a.HomeCount = 1
For the output of your second rule:
select p.ID, p.NAME, p.Type, ph.HOMEID, ph.Location, ph.PurchaseDate
from Person p inner join #PersonHomesAbstractTable a on p.ID = a.PersonID
inner join PersonHomes ph on p.ID = ph.PersonID and
ph.PurchaseDate = a.oldestPurchaseDate
where a.HomeCount > 1 and a.PurchaseDateCount <> 1
And finally for the output of your third rule:
select p.ID, p.NAME, p.Type, ph.HOMEID, ph.Location, ph.PurchaseDate
from Person p inner join #PersonHomesAbstractTable a on p.ID = a.PersonID
inner join PersonHomes ph on p.ID = ph.PersonID and
ph.HOMEID = a.LowerHomeID
where a.HomeCount > 1 and a.PurchaseDateCount = 1
Of course there are some other ways, but now this way is come to my mind.
If you want to delete undesired rows, you can use scripts below:
delete from PersonHomes where HOMEID in
(
select ph.HOMEID from #PersonHomesAbstractTable a
inner join PersonHomes ph on a.PersonID = ph.PersonID and
ph.PurchaseDate <> a.oldestPurchaseDate
where a.HomeCount > 1 and a.PurchaseDateCount <> 1
union
select p.HOMEID from #PersonHomesAbstractTable a
inner join PersonHomes ph on a.PersonID = ph.PersonID and
ph.HOMEID <> a.LowerHomeID
where a.HomeCount > 1 and a.PurchaseDateCount = 1
)

You seem to have a prioritization query. I would solve this using row_number():
select ph.*
from (select ph.*,
row_number() over (partition by personid
order by purchasedate asc, homeid asc
) as seqnum
from personhomes ph
) ph
where seqnum = 1;
This doesn't actually change the data in the table. Although you say delete, it seems like you just want a result set with one home per person.

This is shortest approach got by Link
;WITH cte AS
(
SELECT *, RowN = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY AddressMoveDate DESC) FROM Address
)
DELETE FROM cte WHERE RowN > 1

Getting individual counts of last three distinct rows in column of data retrieved from multiple tables

I have a query which returns several rows of data (in datetime format) of a single column obtained by performing JOINS on multiple SQL Tables. The Data obtained is a DateTime type and now I just want the individual count of latest three dates probably the count of lat three distinct dates as it sorted from earliest to latest.
SQL Query
SELECT
ST.EffectiveDate
FROM Person.Contact C
INNER JOIN Sales.SalesPerson SP
ON C.ContactID = SP.SalesPersonID
FULL OUTER JOIN Sales.SalesTerritory ST
ON ST.TerritoryID = SP.TerritoryID
The above query returns around 200 rows of data but I want the count for each of three latest dates possibly bottom three

I would do this with top and group by:
SELECT TOP 3 ST.EffectiveDate, COUNT(*) as cnt
FROM Person.Contact C INNER JOIN
Sales.SalesPerson SP
ON C.ContactID = SP.SalesPersonID FULL OUTER JOIN
Sales.SalesTerritory ST
ON ST.TerritoryID = SP.TerritoryID
GROUP BY ST.EffectiveDate
ORDER BY ST.EffectiveDate DESC

added another query to get the latest 3 distinct dates
SELECT count(1)
FROM Person.Contact C
INNER JOIN Sales.SalesPerson SP
ON C.ContactID = SP.SalesPersonID
FULL OUTER JOIN Sales.SalesTerritory ST
ON ST.TerritoryID = SP.TerritoryID
WHERE ST.effectivedate in (select distinct top 3 effectivedate
from salesterritory
order by effectivedate desc)
Or if you need to see the counts for the 3 dates broken out
SELECT st.effectivedate, count(1)
FROM Person.Contact C
INNER JOIN Sales.SalesPerson SP
ON C.ContactID = SP.SalesPersonID
FULL OUTER JOIN Sales.SalesTerritory ST
ON ST.TerritoryID = SP.TerritoryID
WHERE ST.effctivedate in (select distinct top 3 effectivedate
from salesterritory
order by effectivedate desc)
GROUP BY st.effectivedate

You can also use the analytic RANK function. This query will number the latest date as 1, the next latest as 2, and so forth:
SELECT
ST.EffectiveDate,
ROW_NUMBER() OVER (ORDER BY ST.EffectiveDate DESC) AS DateRank
FROM Person.Contact C
INNER JOIN Sales.SalesPerson SP ON C.ContactID = SP.SalesPersonID
FULL OUTER JOIN Sales.SalesTerritory ST ON ST.TerritoryID = SP.TerritoryID
You can't use the ranked value in the WHERE clause, so you'll need to take the query above and make it a subquery or a common table expression (CTE).
Subquery version:
SELECT EffectiveDate, COUNT(*)
FROM (
SELECT
ST.EffectiveDate,
ROW_NUMBER() OVER (ORDER BY ST.EffectiveDate DESC) AS DateRank
FROM Person.Contact C
INNER JOIN Sales.SalesPerson SP ON C.ContactID = SP.SalesPersonID
FULL OUTER JOIN Sales.SalesTerritory ST ON ST.TerritoryID = SP.TerritoryID
) DateList
WHERE DateRank <= 3
GROUP BY EffectiveDate
CTE version:
WITH DateList AS (
SELECT
ST.EffectiveDate,
ROW_NUMBER() OVER (ORDER BY ST.EffectiveDate DESC) AS DateRank
FROM Person.Contact C
INNER JOIN Sales.SalesPerson SP ON C.ContactID = SP.SalesPersonID
FULL OUTER JOIN Sales.SalesTerritory ST ON ST.TerritoryID = SP.TerritoryID
)
SELECT EffectiveDate, COUNT(*)
FROM DateList
WHERE DateRank <= 3
GROUP BY EffectiveDate

If you're dealing SQL Server 2005 and above, you could even try this:
;with cte as
(
SELECT
ST.EffectiveDate
FROM Person.Contact C
INNER JOIN Sales.SalesPerson SP
ON C.ContactID = SP.SalesPersonID
FULL OUTER JOIN Sales.SalesTerritory ST
ON ST.TerritoryID = SP.TerritoryID
)
Select EffectiveDate, count(1)
from cte
where EffectiveDate in (select distinct top 3 effectivedate
from cte
order by EffectiveDate desc)
group by EffectiveDate
Though untested, it should work; it my be unnecessarily elaborate though.

How to calculate a total of values from other tables in a column of the select statement

I have three tables:
CustOrder: id, CreateDate, Status
DenominationOrder: id, DenID, OrderID
Denomination: id, amount
I want to create a view based upon all these tables but there should be an additional column i.e. Total should be there which can calculate the sum of the amount of each order.
e.g.
order 1 total denominations 3, total amount = 250+250+250=750
order 2 total denominations 2, total amount = 250+250=500
Is it possible?

I try to guess your table relations (and data too, you did not provide any sample):
SELECT co.id,
COUNT(do.DenID) AS `Total denominations`,
SUM(d.amount) AS `Total amount`
FROM CustOrder co
INNER JOIN DenominationOrder do ON co.id = do.OrderId
INNER JOIN Denomination d ON do.DenId = d.id
GROUP BY co.id

Try this:
SELECT o.CreateDate, COUNT(o.id), SUM(d.amount) AS 'Total Amount'
FROM CustOrder o
INNER JOIN DenominationOrder do ON o.id = do.OrderID
INNER JOIN Denomination d ON do.DenId = d.id
GROUP BY o.CreateDate
DEMO
Another way to do this, by using CTE, like this:
;WITH CustomersTotalOrders
AS
(
SELECT o.id, SUM(d.amount) AS 'TotalAmount'
FROM CustOrder o
INNER JOIN DenominationOrder do ON o.id = do.OrderID
INNER JOIN Denomination d ON do.DenId = d.id
GROUP BY o.id
)
SELECT o.id, COUNT(ot.id) AS 'Orders Count', ot.TotalAmount
FROM CustOrder o
INNER JOIN CustomersTotalOrders ot on o.id = ot.id
INNER JOIN DenominationOrder do ON ot.id = do.OrderID
INNER JOIN Denomination d ON do.DenId = d.id
GROUP BY o.id, ot.TotalAmount
This will give you:
id | Orders Count | Total Amount
-------+---------------+-------------
1 3 750
2 2 500
DEMO using CTE

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - Way to find duplicate fields within multiple rows? - sql

Most version of SQL support window functions. The easiest way to solve this is: select id, customer, eventid# from (select i., count() over (partition by customer, eventid#) as cnt from idtable i ) i where cnt > 1;

Related

How to make LEFT JOIN with row having max date?

SQL Query Refactoring Possible?

Getting oldest Date SQL Complexity

Getting individual counts of last three distinct rows in column of data retrieved from multiple tables

How to calculate a total of values from other tables in a column of the select statement

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - Way to find duplicate fields within multiple rows? - sql

Most version of SQL support window functions. The easiest way to solve this is: select id, customer, eventid# from (select i.*, count(*) over (partition by customer, eventid#) as cnt from idtable i ) i where cnt > 1;

Related

How to make LEFT JOIN with row having max date?

SQL Query Refactoring Possible?

Getting oldest Date SQL Complexity

Getting individual counts of last three distinct rows in column of data retrieved from multiple tables

How to calculate a total of values from other tables in a column of the select statement

Categories

Resources

Most version of SQL support window functions. The easiest way to solve this is: select id, customer, eventid# from (select i., count() over (partition by customer, eventid#) as cnt from idtable i ) i where cnt > 1;