How to find out the duplicate records

How to find out the duplicate records - sql

Using Sql Server 2000
I want to find out the duplicate record in the table
Table1
ID Transaction Value
001 020102 10
001 020103 20
001 020102 10 (Duplicate Records)
002 020102 10
002 020103 20
002 020102 10 (Duplicate Records)
...
...
Transaction and value can be repeat for different id's, not for the same id...
Expected Output
Duplicate records are...
ID Transaction Value
001 020102 10
002 020102 10
...
...
How to make a query for view the duplicate records.
Need Query help

You can use
SELECT
ID, Transaction, Value
FROM
Table1
GROUP BY
ID, Transaction, Value
HAVING count(ID) > 1

Select Id, Transaction, Value, Count(id)
from table
group by Id, Transaction, Value
having count(id) > 1
This query will show you the count of times the ID has been repeated with each entry of the Id. If you don't need it you can simply remove the Count(Id) column from the select clause.

Self join (with additional PK or Timestamp or...)
I can see that people've provided solution with grouping but none has provided the self join solution. The only problem is that you'd need some other row descriptor that should be unique for each record. Be it primary key, timestamp or anything else... Suppose that the unique column's name is Uniq this would be the solution:
select distinct ID, [Transaction], Value
from Records r1
join Records r2
on ((r2.ID = r1.ID) and
(r2.[Transaction] = r1.[Transaction]) and
(r2.Value = r1.Value) and
(r2.Uniq != r1.Uniq))
The last join column makes it possible to not join each row to itself but only to other duplicates...
To find out which one works best for you, you can check their execution plan and execute some testing.

You can do this:
SELECT ID, Transaction, Value
FROM Table
GROUP BY ID, Transaction, Value
HAVING COUNT(*) > 1
To delete the duplicates, if you have no primary key then you need to select the distinct values into a separate table, delete everything from this one, then copy the distinct records back:
SELECT ID, Transaction, Value
INTO #tmpDeduped
FROM Table
GROUP BY ID, Transaction, Value
DELETE FROM Table
INSERT Table
SELECT * FROM #tmpDeduped

Related

How to create a new table that only keeps rows with more than 5 data records under the same id in Bigquery

I have a table like this:
Id
Date
Steps
Distance
1
2016-06-01
1000
1
There are over 1000 records and 50 Ids in this table, most ids have about 20 records, and some ids only have 1, or 2 records which I think are useless.
I want to create a table that excludes those ids with less than 5 records.
I wrote this code to find the ids that I want to exclude:
SELECT
Id,
COUNT(Id) AS num_id
FROM `table`
GROUP BY
Id
ORDER BY
num_id
Since there are only two ids I need to exclude, I use WHERE clause:
CREATE TABLE `` AS
SELECT
*
FROM ``
WHERE
Id <> 2320127002
AND Id <> 7007744171
Although I can get the result I want, I think there are better ways to solve this kind of problem. For example, if there are over 20 ids with less than 5 records in this table, what shall I do? Thank you.

Consider this:
CREATE TABLE `filtered_table` AS
SELECT *
FROM `table`
WHERE TRUE QUALIFY COUNT(*) OVER (PARTITION BY Id) >= 5
Note: You can remove WHERE TRUE if it runs successfully without it.

Optimisation of sql query for deleting duplicate items from large table

Could anyone please help me optimise one of the queries which is taking more than 20 minutes to run against 3 Million data.
Table Structure
-----------------------------------------------------------------------------------------
|id [INT Auto Inc]| name_id (uuid) | name (varchar)| city (varchar) | name_type(varchar)|
-----------------------------------------------------------------------------------------
Query
The purpose of the query is to eliminate the duplicate, here duplicate means having same name_id and name.
DELETE
FROM records
WHERE id NOT IN
(SELECT DISTINCT
ON (name_id, name) id
FROM records);

I would write your delete using exists logic:
DELETE
FROM records r1
WHERE EXISTS (SELECT 1 FROM records r2
WHERE r2.name_id = r1.name_id AND r2.name = r2.name AND
r2.id < r1.id);
This delete query will spare the duplicate having the smallest id value. To speed this up, you may try adding the following index:
CREATE INDEX idx ON records (name_id, name, id);

You probably already have a primary key on the identity column, then you can use it to exclude redundant rows by id in the following way:
WITH cte AS (
SELECT MIN(id) AS id FROM records GROUP BY name_id, name)
DELETE FROM records
WHERE NOT EXISTS (SELECT id FROM cte WHERE id=records.id)
Even without the index, this should work relatively fast, probably because of merge join strategy.

SQL Server Sum multiple rows into one - no temp table

I would like to see a most concise way to do what is outlined in this SO question: Sum values from multiple rows into one row
that is, combine multiple rows while summing a column.
But how to then delete the duplicates. In other words I have data like this:
Person Value
--------------
1 10
1 20
2 15
And I want to sum the values for any duplicates (on the Person col) into a single row and get rid of the other duplicates on the Person value. So my output would be:
Person Value
-------------
1 30
2 15
And I would like to do this without using a temp table. I think that I'll need to use OVER PARTITION BY but just not sure. Just trying to challenge myself in not doing it the temp table way. Working with SQL Server 2008 R2
Simply put, give me a concise stmt getting from my input to my output in the same table. So if my table name is People if I do a select * from People on it before the operation that I am asking in this question I get the first set above and then when I do a select * from People after the operation, I get the second set of data above.

Not sure why not using Temp table but here's one way to avoid it (tho imho this is an overkill):
UPDATE MyTable SET VALUE = (SELECT SUM(Value) FROM MyTable MT WHERE MT.Person = MyTable.Person);
WITH DUP_TABLE AS
(SELECT ROW_NUMBER()
OVER (PARTITION BY Person ORDER BY Person) As ROW_NO
FROM MyTable)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;
First query updates every duplicate person to the summary value. Second query removes duplicate persons.
Demo: http://sqlfiddle.com/#!3/db7aa/11

All you're asking for is a simple SUM() aggregate function and a GROUP BY
SELECT Person, SUM(Value)
FROM myTable
GROUP BY Person
The SUM() by itself would sum up the values in a column, but when you add a secondary column and GROUP BY it, SQL will show distinct values from the secondary column and perform the aggregate function by those distinct categories.

Last id value in a table. SQL Server

Is there a way to know the last nth id field of a table, without scanning it completely? (just go to the end of table and get id value)
table
id fieldvalue
1 2323
2 4645
3 556
... ...
100000000 1232
So for example here n = 100000000 100 Million
--------------EDIT-----
So which one of the queries proposed would be more efficient?

SELECT MAX(id) FROM <tablename>

Assuming ID is the IDENTITY for the table, you could use SELECT IDENT_CURRENT('TABLE NAME').
See here for more info.
One thing to note about this approach: If you have INSERTs that fail but increment the IDENTITY counter, then you will get back a result that is higher than the result returned by SELECT MAX(id) FROM <tablename>

You can also use system tables to get all last values from all identity columns in system:
select
OBJECT_NAME(object_id) + '.' + name as col_name
, last_value
from
sys.identity_columns
order by last_value desc

In case when table1 rows are inserted first, and then rows to table2 which depend on ids from the table1, you can use SELECT:
INSERT INTO `table2` (`some_id`, `some_value`)
VALUES ((SELECT some_id
FROM `table1`
WHERE `other_key_1` = 'xxx'
AND `other_key_2` = 'yyy'),
'some value abc abc 123 123 ...');
Of course, this can work only if there are other identifiers that can uniquely identify rows from table1

First of all, you want to access the table in DESCENDING order by ID.
Then you would select the TOP N records.
At this point, you want the last record of the set which hopefully is obvious. Assuming that the id field is indexed, this would at most retrieve the last N records of the table and most likely would end up being optimized into a single record fetch.

Select Ident_Current('Your Table Name') gives the last Id of your table.

how to query with child relations to same table and order this correctly

Take this table:
id name sub_id
---------------------------
1 A (null)
2 B (null)
3 A2 1
4 A3 1
The sub_id column is a relation to his own table, to column ID.
subid --- 0:1 --- id
Now I have the problem to make a correctly SELECT query to show that the child rows (which sub_id is not null) directly selected under his parent row. So this must be a correctly order:
1 A (null)
3 A2 1
4 A3 1
2 B (null)
A normal SELECT order the id. But how or which keyword help me to order this correctly?
JOIN isn't possible I think because I want to get all the rows separated. Because the rows will be displayed on a Gridview (ASP.Net) with EntityDataSource but the child rows must be displayed directly under his parent.
Thank you.

Look at Managing Hierarchical Data in MySQL.
Since recursion is an expensive operation because basicly you're firing multiple queries to your database you could consider using the Nested Set Model. In short you're assigning numbers to ranges in your table. It's a long article but it worth reading it. I've used it during my internship as a solution not to have 1000+ queries, But bring it down to 1 query.
Your handling 'overhead' now lies at the point of updating the table by adding, updating or deleting records. Since you then have to update all the records with a bigger 'right-value'. But when you're retrieving the data, it all goes with 1 query :)

select * from table1 order by name, sub_id will in this case return your desired result but only because the parents names and the child name are similar. If you're using SQL 2005 a recursive CTE will work:
WITH recurse (id, Name, childID, Depth)
AS
(
SELECT id, Name, ISNULL(childID, id) as id, 0 AS Depth
FROM table1 where childid is null
UNION ALL
SELECT table1.id, table1.Name, table1.childID, recurse.Depth + 1 AS Depth FROM table1
JOIN recurse ON table1.childid = recurse.id
)
SELECT * FROM recurse order by childid, depth

SELECT
*
FROM
table
ORDER BY
COALESCE(id,sub_id), id
btw, this will work only for one level.. any thing more than that requires recursive/cte function

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to find out the duplicate records - sql

You can use SELECT ID, Transaction, Value FROM Table1 GROUP BY ID, Transaction, Value HAVING count(ID) > 1

Select Id, Transaction, Value, Count(id) from table group by Id, Transaction, Value having count(id) > 1 This query will show you the count of times the ID has been repeated with each entry of the Id. If you don't need it you can simply remove the Count(Id) column from the select clause.

Related

How to create a new table that only keeps rows with more than 5 data records under the same id in Bigquery

Optimisation of sql query for deleting duplicate items from large table

SQL Server Sum multiple rows into one - no temp table

Last id value in a table. SQL Server

how to query with child relations to same table and order this correctly

Categories

Resources