Querying duplicates table into related sets

Querying duplicates table into related sets - sql

We have a process that creates a table of duplicate records based on some arbitrary rules (details not relevant).
Every record gets checked against all other records and if a suspected duplicate is found both it and the duplicate are stored in a dupes table to be manually reviewed.
This results in a table something like this:
dupId, originalId, duplicateId
1 1 2
2 1 3
3 1 4
4 2 3
5 2 4
6 3 4
7 5 6
8 5 7
9 6 7
10 8 9
You can see here record #1 has 3 other records it is similar to (#2,#3 and #4) and they are each similar to each other.
Record #5 has 2 duplicates (#6 and #7) and record #8 has only 1 (#9).
I want to query the duplicates into sets, so my results would look something like this:
setId recordId
1 1
1 2
1 3
1 4
2 5
2 6
2 7
3 8
3 9
But I am too old/slow/tired/rubbish and a bit out of my depth here.
Currently, when checking for duplicates if the record pairing is already in the table we don't insert it twice (i.e. you don't see both sides of the duplicate pairing) but can easily do so if it makes the querying simpler.
Any advice much appreciated!

Duplicates seems to be transitive, so you have all pairs. That is, the "original" id has the information you need.
But it is not included in the duplicates and you want that. So:
select dense_rank() over (order by originalid) as setid, duplicateid
from ((select originalid, duplicateid
from t
where not exists (select 1 from t t2 where t.originalid = t2.duplicateid)
) union all
(select distinct originalid, originalid
from t
where not exists (select 1 from t t2 where t.originalid = t2.duplicateid)
)
) i
order by setid;

Related

Recursive query with CTE

I need some help with one query.
So, I already have CTE with the next data:
ApplicationID
CandidateId
JobId
Row
1
1
1
1
2
1
2
2
3
1
3
3
4
2
1
1
5
2
2
2
6
2
5
3
7
3
2
1
8
3
6
2
9
3
3
3
I need to find one job per candidate in a way, that this job was distinct for table.
I expect that next data from query (for each candidate select the first available jobid that's not taken by the previous candidate):
ApplicationID
CandidateId
JobId
Row
1
1
1
1
5
2
2
2
8
3
6
2
I have never worked with recursive queries in CTE, having read about them, to be honest, I don't fully understand how this can be applied in my case. I ask for help in this regard.

The following query returns the expected result.
WITH CTE AS
(
SELECT TOP 1 *,ROW_NUMBER() OVER(ORDER BY ApplicationID) N,
CONVERT(varchar(max), CONCAT(',',JobId,',')) Jobs
FROM ApplicationCandidateCTE
ORDER BY ApplicationID
UNION ALL
SELECT a.*,ROW_NUMBER() OVER(ORDER BY a.ApplicationID),
CONCAT(Jobs,a.JobId,',') Jobs
FROM ApplicationCandidateCTE a JOIN CTE b
ON a.ApplicationID > b.ApplicationID AND
a.CandidateId > b.CandidateId AND
CHARINDEX(CONCAT(',',a.JobId,','), b.Jobs)=0 AND
b.N = 1
)
SELECT * FROM CTE WHERE N = 1;
However, I have the following concerns:
The recursive CTE may extract too many rows.
The concatenated JobId may exceed varchar(max).
See dbfiddle.

Combining values from one column by the key value from another column

I need to combine all values by one column depends on the key from another column. Can someone help me to get out of this problem please?
here is the short example of my problem.
CUST_ID CUST_REL_ID
100 1
100 2
100 3
100 4
200 5
200 6
200 7
CUST_ID CUST_REL_ID
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
...
5 5
5 6
5 7

I think you just want a self-join:
select t1.cust_rel_id, t2.cust_rel_id
from t t1 join
t t2
on t1.cust_id = t2.cust_id
order by t1.cust_rel_id, t2.cust_rel_id;
I don't understand your naming conventions. The column called cust_id in the result set looks nothing like the column called cust_id in the source data. But this appears to be what you want to do.

Output all duplicate rows (SQL Server)

I have a table which holds what I consider duplicate rows. the values in these records may not be exactly the same, but it’s been calculated that they’re possible duplicates by fuzzy logic. For example:
RecordCD key_in key_out
---------------------------
1 1 2
2 2 2
3 3 3
4 4 6
5 5 5
6 6 6
7 7 7
8 8 11
9 9 9
10 10 10
11 11 11
key_in column has a unique ID of the record.
key_out column has a possible duplicate if it’s not equal to key_in
I need my output to look like this and list all of the possible duplicates:
RecordCD key_in key_out
---------------------------
1 1 2
2 2 2
4 4 6
6 6 6
8 8 11
11 11 11
but I’m struggling to construct a query that would do that.
Thanks.

I think this is what you want:
select t.*
from t
where exists (select 1
from t t2
where t2.key_out = t.key_out and t2.key_in <> t.key_in
)
order by t.key_out;
Here is a db<>fiddle.

It seems like if there is a mismatch between key_in, key_out you want to pull all rows where key_in has either value`
I would create a temp table with all values in rows with mismatched key_in, key_out, call this value bad_match
If either of your key_in, key_out values match this value, include it in output
select mytable.* from mytable
where key_in in
(select key_in bad_match from mytable where key_in <> key_out
union all
select key_out from mytable where key_in <> key_out);
This sample builds your schema and returns the desired output

Using VBA ADO on Excel, I am looking for an Update query to get an univocal, 1 to 1 match

I have a table similar to the following (tab1):
ID Descr Related_ID
1 a 2
2 a 1
3 a 1
4 a 1
5 b 6
6 b 5
7 b 5
8 b 5
my query is something like:
Update [tab1] as T1 INNER JOIN [tab1] as t2
ON t1.[Descr] = t2.[Descr]
SET t1.[Related_ID] = t2.[ID]
WHERE t1.[ID] <> t1.[Related_ID]
the idea is to create couple/pair between the rows of the table to get for each line one on only one ID of another line.
What the query is actually doing is overwriting each time Related_ID, with the latter table2 ID.
Instead of just use any ID only once.
The result I would like is like:
ID Descr Related_ID
1 a 2
2 a 1
3 a 4
4 a 3
5 b 6
6 b 5
7 b 8
8 b 7
I would like to create a couple / pair of element where the related_id of one is the ID of the other and viceversa.
Could you help me please?
PS:
A. I am using vba adodb on excel not on access
B. I need to maintain a good degree of efficiency, if I would implement sub query such as WHERE not [ID] IN (SELECT [ID] FROM [tab2] WHERE ...)
the efficiency will be highly impacted (from few seconds to half an hour) as the table has hundreds of rows
thanks

PSQL get duplicate row

I have table like this-
id object_id product_id
1 1 1
2 1 1
4 2 2
6 3 2
7 3 2
8 1 2
9 1 1
I want to delete all rows except these-
1 1 1
4 2 2
6 3 2
9 1 2
Basically there are duplicates and I want to remove them but keep one copy intact.
what would be the most efficient way for this?

If this is a one-off then you can simply identify the records you want to keep like so:
SELECT MIN(id) AS id
FROM yourtable
GROUP BY object_id, product_id;
You want to check that this works before you do the next thing and actually throw records out. To actually delete those duplicate records you do:
DELETE FROM yourtable WHERE id NOT IN (
SELECT MIN(id) AS id
FROM yourtable
GROUP BY object_id, product_id
);
The MIN(id) obviously always returns the record with the lowest id for a set of (object_id, product_id). Change as desired.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Querying duplicates table into related sets - sql

Related

Recursive query with CTE

Combining values from one column by the key value from another column

Output all duplicate rows (SQL Server)

Using VBA ADO on Excel, I am looking for an Update query to get an univocal, 1 to 1 match

PSQL get duplicate row

Categories

Resources