Efficient query for finding duplicate records

Efficient query for finding duplicate records - sql

I need to query a table for duplicate deposit records, where two deposits made at one cash terminal, for the same amount, within a certain time window, are considered duplicate records. I've started working on a query now, but I would appreciate any advice or suggestions on doing this 'properly'.

Generally, you'd do a self join to the same table, and put your "duplicate" criteria in the join conditions.
E.g.
SELECT
*
FROM
Transactions t1
inner join
Transactions t2
on
t1.Terminal = t2.Terminal and
t1.Amount = t2.Amount and
DATEDIFF(minute,t2.TransactionDate,t1.TransactionDate) between 0 and 10 and
t1.TransactionID > t2.TransactionID /* prevent matching the same row */

Simple aggregate
SELECT
col1, col2, col3, ...
FROM
MyTable
GROUP BY
col1, col2, col3, ...
HAVING
COUNT(*) >= 2
Don't include your identity/key/PK column: this will be unique per row and mess up the aggregate.
To get a row to remove or keep, do a MAX or MIN on that
SELECT
col1, col2, col3, ...,
MAX(IDCol) AS RowToDelete,
MIN(IDCol) AS RowToKeep
FROM
MyTable
GROUP BY
col1, col2, col3, ...
HAVING
COUNT(*) >= 2
Of course, with 3 duplicates then do a "keep".
Edit:
For rows within a time window, use a self join or window/ranking function

Related

Find differences in both tables without multiple joins

After comparing two data sets I'd like to extract information such as:
Rows that are only present in table A
Rows that are only present in table B
Non-key value differences after join
What's the preferred way to go about this? Is there a way to do this without having to do LEFT and RIGHT joins separately?

It sounds you want a FULL OUTER JOIN. That gives you all rows from both tables, joining ones that match keys. Then you can see which rows are present in only one table, and compare values for rows that are in both.

I would typically use group by for this, so I'm not sure what the reference to multiple joins is.
select col1, col2, col3, sum(in_a) as a_cnt, sum(in_b) as b_cnt
from ((select col1, col2, col3, 1 as in_a, 0 as in_b
from a
) union all
(select col1, col2, col3, 0 as in_a, 1 as in_b
from b
)
) ab;

Counting matching rows of two same tables and counting rows of the table

I have the same table structure called "table1" under two different schemas "schema1" and "schema2". "table1" contains columns "col1, col2, col3". Initialy I want see whether there are records having the same entries of col1 and col2 in the table schema1.table1 and schema2.table1. But I had mistyped schema2.table1 as schema1.table1. And now I am confused by the query result.
SELECT COUNT(*) FROM schema1.table1 AS s1t, schema1.table1 AS s2t
WHERE s1t.col1 = s2t.col1 AND s1t.col2 = s2t.col2;
I got
count
-------
530
(1 row)
However, SELECT COUNT(*) FROM schema1.table1; shows that there are 17815 rows.
Why would the first query show there are only 530 satisfied records? Shouldn't it be 17815 as well?

You can try to use FULL OUTER JOIN to see even mismatched rows, including null values for columns(col1 and 2). This way, at least(more than or equal to) 17815 rows return
SELECT COUNT(*)
FROM schema1.table1 AS s1t
FULL OUTER JOIN schema1.table1 AS s2t
ON s1t.col1 = s2t.col1 AND s1t.col2 = s2t.col2
In your case, only matched rows return for those columns (col1 and 2).

You are joining the table to itself. That is really strange.
In any case, your join is going to filter out any rows where col1 or col2 are NULL.
In addition, the self-join might multiply the number of rows if there are duplicates (with respect to the two columns) in the table.
It is really unclear why you would be doing this, but the above explains the results you are seeing.
If you want to compare the results in the two schemas allowing for duplicates and missing values, I recommend union all/group by:
select col1, col2, sum(cnt1) as cnt1, sum(cnt2) as cnt2
from ((select col1, col2, count(*) as cnt1, 0 as cnt2
from schema1.table1
group by col1, col2
) union all
(select col1, col2, 0 as cnt1, count(*) as cnt2
from schema2.table1
group by col1, col2
)
) t12
group by col1, col2
having sum(cnt1) <> sum(cnt2);
This returns pairs where the counts are not the same in the two tables. It even works for NULL values. If you ran this on the same table, no rows would be returned.

Most efficient way to find distinct records, retaining unique ID

I have a large dataset stored in a SQL server table, with 1 unique ID, and many attributes. I need to select the distinct attribute records, along with one of the unique IDs associated with that unique combination.
Example dataset:
ID|Col1|Col2|Col3...
1|big|blue|ball
2|big|red|ball
3|big|blue|ball
4|small|red|ball
Example Goal (2,3,4 would also have been acceptable) :
ID|Col1|Col2|Col3...
1|big|blue|ball
2|big|red|ball
4|small|red|ball
I have tried a few different methods, but all of them seem to be taking very long (hours), so I was wondering if there was a more efficient approach. Failing this, my next idea is to partition the table.
I have tried:
Using Where exists, e.g.
SELECT * from Table as T1
where exists (select *
from table as T2
where
ISNULL(T1.ID,'') <> ISNULL(T2.ID,'')
AND ISNULL([T1].[Col1],'') = ISNULL([T2].[Col1],'')
AND ISNULL([T1].[Col2],'') = ISNULL([T2].[Col2],'')
)
MAX(ID) and Group By Attributes.
GROUP BY Attributes, having count > 1.

How about just using group by?
select min(id), col1, col2, col3
from t
group by col1, col2, col3;
This will probably take a while. This might be more efficient:
select t.*
from t
where t.id = (select min(t2.id)
from t t2
where t.col1 = t2.col1 and t.col2 = t2.col2 and . . .
);
This requires an index on t(col1, col2, col3, . . ., id). Given your request, that is on all columns.
In addition, this will not work for columns that are NULL. Some databases support the ANSI standard is not distinct from for null-safe comparisons. If yours does, then it should use the index for this construct as well.

SELECT Id,Col1,Col2,Col3 FROM (
SELECT Id,Col1,Col2,Col3,ROW_NUMBER() OVER (Partition By Col1,Col2,Col3 Order By ID,Col1,Col2,Col3) valid
from Table as T1) t
WHERE valid=1
Hope this helps...

merge two queries with different where and different grouping into 1

Sorry, I asked this question just before and got some good answers but then I realised I made a mistake with the query in question, if I change the question in the original post that could make the answers invalid so I'm posting again with the right query this time, please forgive me, I hope this is acceptable.
DECLARE #Temp TABLE
(MeasureDate, col1, col2, type)
INSERT INTO #Temp
SELECT MeasureDate, col1, col2, 1
FROM Table1
WHERE Col3 = 1
INSERT INTO #Temp
SELECT MeasureDate, col1, col2, 3
FROM Table1
WHERE Col3 = 1
AND Col4 = 7000
SELECT SUM(col1) / SUM(col2) AS Percentage, MeasureDate, Type
FROM #Temp
GROUP BY MeasureDate, Type
I do two inserts into the temp table, 2nd insert with an extra WHERE but same columns same table, but different type, then I do SUM(col1) / SUM(col2) on the temp table to return the result I need per MeasureDate and type. Is there a way to merge all these inserts and selects into one statement so I don't use a temp table and do a single select from Table1? Or even if I still need the temp table, merge the selects into one select instead of two separate selects? Stored procedure works fine as it is, just looking for a way to shorten it.
Thanks.

Sure can. I might start with combining the two queries from your inserts using UNION ALL (this variation of UNION will not remove duplicates), wrapped up in a CTE from which you can perform your final query:
WITH MeasureData(MeasureDate, col1, col2, type) AS (
SELECT MeasureDate, col1, col2, 1
FROM Table1
WHERE Col3 = 1
UNION ALL
SELECT MeasureDate, col1, col2, 3
FROM Table1
WHERE Col3 = 1
AND Col4 = 7000
)
SELECT SUM(col1) / SUM(col2) AS Percentage, MeasureDate, Type
FROM MeasureData
GROUP BY MeasureDate, Type
That's it, no more table variable or insert statements.

No real need for a UNION, you can handle this with a CASE statement:
SELECT SUM(col1) / SUM(col2) AS Percentage, MeasureDate, Type
FROM (
SELECT MeasureDate, col1, col2, case when Col4 = 7000 then 3 else 1 end type
FROM Table1
WHERE Col3 = 1
) t
GROUP BY MeasureDate, Type
Edit, as Gordon correctly points out, for Type = 1, this query wouldn't produce the same results. Here's a variation on Gordon's good answer that might be easier to visually understand using a CROSS JOIN and IF logic:
SELECT T1.MeasureDate,
T.Type,
SUM(IF(T.Type=1,Col1,IF(T.Type=3 AND T1.Col4=7000,T1.Col1,0))) /
SUM(IF(T.Type=1,Col2,IF(T.Type=3 AND T1.Col4=7000,T1.Col2,0))) AS Percentage
FROM Table1 T1
CROSS JOIN (SELECT 1 Type UNION SELECT 3) T
WHERE T1.Col3 = 1
GROUP BY T1.MeasureDate, T.Type
Condensed SQL Fiddle

Your method is double counting cases where col3 = 1 and col4 = 7000. Here is a method that takes this into account, without union on the overall table:
select t.type, SUM(t1.col1) / SUM(t1.col2) AS Percentage, t1.MeasureDate, t.Type
from table1 t1 join
(select 1 as type union all
select 3 as type
) t
on t.type = 1 or t1.col4 = 7000
where t1.col3 = 1
group by measuredate, type;

Hive: Select all rows with a range from the max of a column

So I am trying to write a query in Hive that will then be automated. The idea is I have a table that shows Requests with a timestamp field called updated. So there are alot of rows with the date and time at which the Request was made. Regardless of when the query is run I want to get the Requests from the last 7 days.
I tried:
SELECT col1, col2, col3, count(*) cnt
FROM table
WHERE updated BETWEEN date_sub(SELECT MAX(updated) AS maxdate FROM table, 7)
AND SELECT MAX(updated) AS maxdate FROM table
GROUP BY col1, col2, col3
HAVING cnt > 10
I have looked over this and It seems like it should do what I am looking for, however I get:
ParseException line 4:79 cannot recognize input near 'select' 'max' '(' in function specification
Any help on this error or a suggested diffrent approach would be great.

Can you try this query, if the data type of column "updated" is datatime in all tables:
SELECT col1, col2, col3, count(*) cnt
FROM table
WHERE updated BETWEEN (SELECT MAX(updated)-7 AS maxdate FROM table)
AND (SELECT MAX(updated) AS maxdate FROM table)
GROUP BY col1, col2, col3
HAVING count(*) > 10

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Efficient query for finding duplicate records - sql

I need to query a table for duplicate deposit records, where two deposits made at one cash terminal, for the same amount, within a certain time window, are considered duplicate records. I've started working on a query now, but I would appreciate any advice or suggestions on doing this 'properly'.

Related

Find differences in both tables without multiple joins

Counting matching rows of two same tables and counting rows of the table

Most efficient way to find distinct records, retaining unique ID

merge two queries with different where and different grouping into 1

Hive: Select all rows with a range from the max of a column

Categories

Resources