Checking for (and Deleting) Complex Object Duplicates in SQL Server - sql

So I need to duplicate check a complex object, and then cascade delete dupes from all associated tables and I'm wondering if I can do it efficiently in SQL Server, or if I should go about it in my code. Structurally I have the following tables.
Claim
ClaimCaseSubTypes (mapping table for many to many relationship)
ClaimDiagnosticCodes (ditto)
ClaimTreatmentCodes (ditto)
Basically a Claim is only a duplicate if it is matching on 8 fields in itself AND has the same relationships in all the mapping tables.
For Example, the following records would be indicated as duplicates
Claim
Id CreateDate Other Fields
1 1/1/2015 matched
2 6/1/2015 matched
ClaimCaseSubTypes
ClaimId SubTypeId
1 34
1 64
2 34
2 64
ClaimDiagnosticCodes
ClaimId DiagnosticCodeId
1 1
2 1
ClaimTreatmentCodes
ClaimId TreatmentCodeId
1 5
1 6
2 6
2 5
And in this case I would want to keep 1 and delete 2 from the Claim table as well as any rows in the mapping tables with ClaimId of 2

This is the kind of problem that window functions are for:
;WITH cte AS (
SELECT c.ID,
ROW_NUMBER() OVER (PARTITION BY field1, field2, field3, ... ORDER BY c.CreateDate) As ClaimOrder
FROM Claim c
INNER JOIN other tables...
)
UPDATE Claim
SET IsDuplicate = IIF(cte.ClaimOrder = 1, 0, 1)
FROM Claim c
INNER JOIN cte ON c.ID = cte.ID
The fields that you include in the PARTITION BY indicates what fields need to be identical for two claims to be considered matched. The ORDER BY tell SQL Server assign the earliest claim the order of 1. Everything that doesn't have the order of 1 is a duplicate of something else.

Related

How to write sql clause where column not equal and any id associated

I want to setup a simple query that will filter out any row that contains "A" in the ItemID, but my issue is I also do NOT want to display any journal ID from a different row since it matched "A". I tried googling the solution, but I am sure I am not using the right keywords to find it. I am using microsoft sql 2008, but I am not a database admin so I am not to familiar. I tried using distinct, and I also tried group by, but in this situation it does not work.
This is a simplified version of the table that I am working with:
JournalID ItemID PrimaryKEY
1 A 1
1 B 2
2 A 3
2 C 4
3 B 5
4 D 6
And here is how I would like to make it look:
JournalID ItemID PrimaryKEY
3 B 5
4 D 6
This will exclude any rows where the ItemID is 'A' and also any rows that have the same JournalID as a row where a ItemID was 'A'.
SELECT JournalID, ItemID, PrimaryKEY
FROM TABLE
WHERE JournalID NOT IN (Select JournalID FROM TABLE WHERE ItemID = 'A')
Try this:
SELECT *
FROM table_name
WHERE JournalID
NOT IN (SELECT JournalID
FROM table_name
WHERE ItemID = 'A')

unable to use LIMIT when using correlated query

I have two tables in Postgres. I want to get the latest 3records data from table.
Below is the query:
select two.sid as sid,
two.sidname as sidname,
two.myPercent as mypercent,
two.saccur as saccur,
one.totalSid as totalSid
from table1 one,table2 two
where one.sid = two.sid;
The above query displays all records checking the condition one.sid = two.sid;I want to get only recent 3 records data(4,5,6) from table2.
I know in Postgres we can use limit to limit the rows to retrieve, but here in table2 for each ID I have multiple rows. So I guess I cannot use limit on table2 but should use on table1. Any suggestions?
table1:
sid totalSid
1 10
2 20
3 30
4 40
5 50
6 60
table2:
sid sidname myPercent saccur
1 aaaa 11 11t
1 bbb 13 13g
1 ccc 11 11g
1 qw 88 88k
//more data for 2,3,4,5....
6 xyz 89 895W
6 xyz1 90 90k
6 xyz2 91 91p
6 xyz3 92 92q
Given a changed understanding of the question a simple subquery and join should suffice.
We select everything from table1 limit to 3 records in sid order desc. This gives us the 3 most recent Sid's and then join to table2 to get the other SID relevant data. The assumption here is that SID is unique in table one and "most recent" would be those records having the highest SID.
SELECT two.sid as sid
, two.sidname as sidname
, two.myPercent as mypercent
, two.saccur as saccur
, one.totalSid as totalSid
FROM (SELECT * FROM table1 ORDER BY SID DESC LIMIT 3) one
INNER JOIN table2 two
ON one.sid = two.sid;
*note I removed a comma after one alias above.
and below we reinstated the ANSI 88 join syntax using , notation.
SELECT two.sid as sid
, two.sidname as sidname
, two.myPercent as mypercent
, two.saccur as saccur
, one.totalSid as totalSid
FROM (SELECT * FROM table1 ORDER BY SID DESC LIMIT 3) one
, table2 two
WHERE one.sid = two.sid;
This syntax basically says get the 3 most recent SIDs from table one and cross join (For each record in one match it to all records in two) that to all records in table two but then return only records that have the same SID on both sides. Modern compilers may be able to use Cost based optimization to improve performance here negating the need to do the entire cross join; however, order of operation says this is what the database would normally have to do. if one and two are both tables of substantial size, you can see the cross join could result in a very large temporary dataset

Count number of repeats in SQL

I tried to solve one problem but without success.
I have two list of number
{1,2,3,4}
{5,6,7,8,9}
And I have table
ID Number
1 1
1 2
1 7
1 2
1 6
2 8
2 7
2 3
2 9
Now I need to count how many times number from second list come after number from first list but I should count only one by one id
in example table above result should be 2
three matched pars but because we have only two different IDs result is 2 instead 3
Pars:
1 2
1 7
1 2
1 6
2 3
2 9
note. I work with MSSQL
Edit. There is one more column Date which determined order
Edit2 - Solution
i write this query
SELECT * FROM table t
left JOIN table tt ON tt.ID = t.ID
AND tt.Date > t.Date
AND t.Number IN (1,2,3,4)
AND tt.Number IN (6,7,8,9)
And after this I had a plan to group by id and use only one match for each id but execution take a lot time
Here is a query that would do it:
select a.id, min(a.number) as a, min(b.number) as b
from mytable a
inner join mytable b
on a.id = b.id
and a.date < b.date
and b.number in (5,6,7,8,9)
where a.number in (1,2,3,4)
group by a.id
Output is:
id a b
1 1 6
2 3 9
So the two pairs are output each on one line, with the value a belonging to the first group of numbers, and the value of column b to the second group.
Here is a fiddle
Comments on attempt (edit 2 to question)
Later you added a query attempt to your question. Some comments about that attempt:
You don't need a left join because you really want to have a match for both values. inner join has in general better performance, so use that.
The condition t.Number IN (1,2,3,4) does not belong in the on clause. In combination with a left join the result will include t records that violate this condition. It should be put in the where clause.
Your concern about performance may be warranted, but can be resolved by adding a useful index on your table, i.e. on (id, number, date) or (id, date, number)

Find rows in a table based on the existance of two different rows in a 1:N-related table

Say I have a table Clients, with a field ClientID, and that client has orders that are loaded in another table Orders, with foreign key ClientID to link both.
A client can have many orders (1:N), but orders have different types, described by the field TypeID.
Now, I want to select the clients that have orders of a number of types. For instance, the clients that have orders of type 1 and 2 (both, not one or the other).
How do I build this query? I'm really at lost here.
EDIT: Assume I'm on SQL Server.
This is query upon the assumption that TypeId can be either 1 or 2. This will return ClientId that have both a Type1 and Type2 no matter how many of them.
Select ClientId, COUNT(distinct TypeId) as cnt
from tblOrders o
group by ClientId
Having COUNT(distinct TypeId) >= 2
COUNT(distinct TypeId) is how this really works. It will count the distinct number of TypeId's for a particular ClientId. If you had say 5 different Types, then change the condition in the Having Clause to 5
This is a small sample DataSet
ClientId TypeId
1 1
1 2
1 2
2 2
2 1
3 1
3 1
Here is the resulting Query, it will exclude client 3 because it only has orders with Type1
Result Set
ClientId cnt
1 2
2 2
If you have many different TypeId's, but only want to check Type1 and Type2 put those Id's in a where clause
where TypeId in (1,2)
Here's one solution:
select * from clients c
where exists (select 1 from orders o where typeid = 1 and o.clientid = c.clientid)
and exists (select 1 from orders o where typeid = 2 and o.clientid = c.clientid)
and exists (select 1 from orders o where typeid = 3 and o.clientid = c.clientid)
-- additional types ...
You can use INTERSECT which will give the intersection of the resultsets.

Select a subgroup of records by one distinct column

Sorry if this has been answered before, but all the related questions didn't quite seem to match my purpose.
I have a table that looks like the following:
ID POSS_PHONE CELL_FLAG
=======================
1 111-111-1111 0
2 222-222-2222 0
2 333-333-3333 1
3 444-444-4444 1
I want to select only distinct ID values for an insert, but I don't care which specific ID gets pulled out of the duplicates.
For Example(a valid SELECT would be):
1 111-111-1111 0
2 222-222-2222 0
3 444-444-4444 1
Before I had the CELL_FLAG column, I was just using an aggregate function as so:
SELECT ID, MAX(POSS_PHONE)
FROM TableA
GROUP BY ID
But I can't do:
SELECT ID, MAX(POSS_PHONE), MAX(CELL_FLAG)...
because I would lose integrity within the row, correct?
I've seen some similar examples using CTEs, but once again, nothing that quite fit.
So maybe this is solvable by a CTE or some type of self-join subquery? I'm at a block right now, so I can't see any other solutions.
Just get your aggregation in a subquery and join to it:
SELECT a.ID, sub.Poss_Phone, CELL_FLAG
FROM TableA as a
INNER JOIN (SELECT ID, MAX(POSS_PHONE) as [Poss_Phone]
FROM TableA
GROUP BY ID) Sub
ON Sub.ID = a.ID and SUB.Poss_Phone = A.Poss_Phone
This will keep integrity between your non-aggregated fields but still give you the MAX(Poss_Phone) per ID.