delete rows if duplicate exist - sql

I want to delete duplicate rows in my table. It should keep the first one, or the one on top and if there is any other duplicates it should be deleted.
As the image shows there is two 12 and two 13. So keep the first 12 in the database and if there is any other delete them same goes for 13 or any ID.
My Idea:
DELETE from [Table]
WHERE [ID]
HAVING COUNT(TABLE.ID) > 1;

You could try using a CTE and ROW_NUMBER.
A common table expression (CTE) can be thought of as a temporary
result set that is defined within the execution scope of a single
SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. A CTE is
similar to a derived table in that it is not stored as an object and
lasts only for the duration of the query. Unlike a derived table, a
CTE can be self-referencing and can be referenced multiple times in
the same query.
Something like
;WITH DeleteRows AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY ID) RowID
FROM [Table]
)
DELETE from DeleteRows
WHERE RowID > 1
SQL Fiddle DEMO

Related

Postgres: select for update using CTE

I always had a query in the form of:
UPDATE
users
SET
col = 1
WHERE
user_id IN (
SELECT
user_id
FROM
users
WHERE
...
LIMIT 1
FOR UPDATE
);
And I was pretty sure that it generates a lock on the affected row until the update is done.
Now I wrote the same query using CTE and doing
WITH query AS (
select
user_id
FROM
users
WHERE
...
LIMIT 1
FOR UPDATE
)
UPDATE
users
SET
col = 1
WHERE
user_id IN (
SELECT
user_id
FROM
query
);
I’m actually having some doubts that it is applying a row lock because of the results I get, but I couldn’t find anything documented about this.
Can someone make it clear? Thanks
Edit:
I managed to find this:
If specific tables are named in FOR UPDATE or FOR SHARE, then only rows coming from those tables are locked; any other tables used in the SELECT are simply read as usual. A FOR UPDATE or FOR SHARE clause without a table list affects all tables used in the statement. If FOR UPDATE or FOR SHARE is applied to a view or sub-query, it affects all tables used in the view or sub-query. However, FOR UPDATE/FOR SHARE do not apply to WITH queries referenced by the primary query. If you want row locking to occur within a WITH query, specify FOR UPDATE or FOR SHARE within the WITH query.
https://www.postgresql.org/docs/9.0/sql-select.html#SQL-FOR-UPDATE-SHARE
So I guess it should work only if the for update is in the with and not in the query that is using the with?

about CTE and physically deleting records from table

For deleting duplicate records I have found below query on stack overflow which works fine.In this query we are deleting records from "a" and not from tblEmployee.So my question is how duplicate records get physically deleted from physical table though we don't have any unique or primary key.
WITH a as (
SELECT Firstname,ROW_NUMBER() OVER(PARTITION by Firstname, empID ORDER BY Firstname)
AS duplicateRecCount
FROM dbo.tblEmployee
)
--Now Delete Duplicate Records
DELETE
FROM a
WHERE duplicateRecCount > 1
Inorder to understand this, let's consider one of the differences between temp tables and CTE.
When we use Temporary tables, this temp table will be saved in a Tempdb database. So, it is just a copy of your table tblEmployee. No matter what changes you make to temp table, it won't affect tblEmployee.
But, when you use cte, it is actually pointing to the same table itself. That is why, if you delete from cte, it will affect tblEmployee also.
CTE is nothing but a disposable view.
You appear to be using an updatable CTE in SQL Server. In this case, the CTE acts the same as a view.
The view is updatable because it refers to only one table and does not have aggregation. Hence, the effect of the CTE is merely to add columns to the table, tables which can be referenced in the DELETE statement.
The conditions for an updatable view are explained in the documentation. These conditions are the same as for your CTE.

Getting a specific number of rows from Database using RowNumber; Inconsistent results

Here is my SQL query:
select * from TABLE T where ROWNUM<=100
If i execute this and then re-execute this, I don't get the same result. Why?
Also, on a sybase system if i execute
set rowcount 100
select * from TABLE
even on re-execution i get the same result?
Can someone explain why? and provide possible solution for RowNum
Thanks
If you don't use ORDER BY in your query you get the results in natural order.
Natural order is whatever is fastest for the database at the moment.
A possible solution is to ORDER BY your primary key, if it's an INT
SELECT TOP 100 START AT 0 * FROM TABLE
ORDER BY TABLE.ID;
If your primary key is not a sequentially incrementing integer and you don't have another column to order by (such as a timestamp) you may need to create an extra column SORT_ORDER INT and increment in automatically on insert using either an Autoincrement column or a sequence and an insert trigger, depending on the database.
Make sure to create an index on that column to speed up the query.
You need to specify an ORDER BY. Queries without explicit ORDER BY clause make no guarantee about the order in which the rows are returned. And from this result set you take the first 100 rows. As the order in which the rows can be different every time, so can be your first 100 rows.
You need to use ORDER BY first, followed by ROWNUM. You will get inconsistent results if you don't follow this order.
select * from
(
select * from TABLE T ORDER BY rowid
) where ROWNUM<=100

Deleting duplicate rows in a database without using rowid or creating a temp table

Many years ago, I was asked during a phone interview to delete duplicate rows in a database. After giving several solutions that do work, I was eventually told the restrictions are:
Assume table has one VARCHAR column
Cannot use rowid
Cannot use temporary tables
The interviewer refused to give me the answer. I've been stumped ever since.
After asking several colleagues over the years, I'm convinced there is no solution. Am I wrong?!
And if you did have an answer, would a new restriction suddenly present itself? Since you mention ROWID, I assume you were using Oracle. The solutions are for SQL Server.
Inspired by SQLServerCentral.com http://www.sqlservercentral.com/scripts/T-SQL/62866/
while(1=1) begin
delete top (1)
from MyTable
where VarcharColumn in
(select VarcharColumn
from MyTable
group by VarcharColumn
having count(*) > 1)
if ##rowcount = 0
exit
end
Deletes one row at a time. When the second to last row of a set of duplicates disappears then the remaining row won't be in the subselect on the next pass through the loop. (BIG Yuck!)
Also, see http://www.sqlservercentral.com/articles/T-SQL/63578/ for inspiration. There RBarry Young suggests a way that might be modified to store the deduplicated data in the same table, delete all the original rows, then convert the stored deduplicated data back into the right format. He had three columns, so not exactly analogous to what you are doing.
And then it might be do-able with a cursor. Not sure and don't have time to look it up. But create a cursor to select everything out of the table, in order, and then a variable to track what the last row looked like. If the current row is the same, delete, else set the variable to the current row.
This is a completely Jacked up way to do it, but given the assanine requirements, here is a workable solution assuming SQL 2005 or later:
DELETE from MyTable
WHERE ROW_NUMBER() over(PARTITION BY [MyField] order by MyField)>1
I would put a unique number of fixed size in the VARCHAR column for the duplicated rows, then parse out the number and delete all but the minimum row. Maybe that's what his VARCHAR constraint is for. But that stinks because it assumes that your unique number will fit. Lame question. You didn't want to work there anyway. ;-)
Assume you are implementing the DELETE statement for a SQL engine. how will you delete two rows from a table that are exactly identical? You need something to distinguish one from the other!
You actually cannot delete entirely duplicate rows (ALL columns being equal) under the following constraints(as provided to you)
No use of ROWID or ROWNUM
No Temporary Table
No procedural code
It can, however be done even if one of the conditions is relaxed. Here are solutions using at least one of the three conditions
Assume table is defined as below
Create Table t1 (
col1 vacrchar2(100),
col2 number(5),
col3 number(2)
);
Duplicate rows identification:
Select col1, col2, col3
from t1
group by col1, col2, col3
having count(*) >1
Duplicate rows can also be identified using this:
select c1,c2,c3, row_number() over (partition by (c1,c2,c3) order by c1,c2,c3) rn from t1
NOTE: The row_number() analytic function cannot be used in a DELETE statement as suggested by JohnFx at least in Oracle 10g.
Solution using ROWID
Delete from t1 where row_id > ( select min(t1_inner.row_id) from t1 t1_innner where t1_inner.c1=t1.c1 and t1_inner.c2=t1.c2 and t1_inner.c3=t1.c3))
Solution using temp table
create table t1_dups as (
//write query here to find the duplicate rows as liste above//
)
delete from t1
where t1.c1,t1.c2,t1.c3 in (select * from t1.dups)
insert into t1(
select c1,c2,c3 from t1_dups)
Solution using procedural code
This will use an approach similar to the case where we use a temp table.
create table temp as
select c1,c2
from table
group by c1,c2
having(count(*)>1 or count(*)=1);
Now drop the base table .
Rename the temp table to base table.
Mine was resolved using this query:
delete from where in (select from group by having count(*) >1)
in PLSQL

Insert into temp values (select.... order by id)

I'm using an Informix (Version 7.32) DB. On one operation I create a temp table with the ID of a regular table and a serial column (so I would have all the IDs from the regular table numbered continuously). But I want to insert the info from the regular table ordered by ID something like:
CREATE TEMP TABLE tempTable (id serial, folio int );
INSERT INTO tempTable(id,folio)
SELECT 0,folio FROM regularTable ORDER BY folio;
But this creates a syntax error (because of the ORDER BY)
Is there any way I can order the info then insert it to the tempTable?
UPDATE: The reason I want to do this is because the regular table has about 10,000 items and in a jsp file, it has to show every record, but it would take to long, so the real reason I want to do this is to paginate the output. This version of Informix doesn't have Limit nor Skip. I can't renumber the serial because is in a relationship, and this is the only solution we could get a fixed number of results on one page (for example 500 results per page). In the Regular table has skipped id's (called folio) because they have been deleted. if i were to put
SELECT * FROM regularTable WHERE folio BETWEEN X AND Y
I would get maybe 300 in one page, then 500 in the next page
You can do this by breaking up the SQL into two temp tables:
CREATE TEMP TABLE tempTable1 (
id serial,
folio int);
SELECT folio FROM regularTable ORDER BY folio
INTO TEMP tempTable2;
INSERT INTO tempTable1(id,folio) SELECT 0,folio FROM tempTable2;
In Informix when using a SELECT as a sub-clause in an INSERT statement, you are limited
to a subset of the SELECT syntax.
The following SELECT clauses are not supported in this case:
INTO TEMP
ORDER BY
UNION.
Additionally, the FROM clause of the SELECT can not reference the same table as referenced by the INSERT (not that this matters in your case).
It's been years since I worked on Informix, but perhaps something like this will work:
INSERT INTO tempTable(id,folio)
SELECT 0, folio
FROM (
SELECT folio FROM regularTable ORDER BY folio
);
You might try it iterating a cursor over the SELECT ... ORDER BY and doing the INSERTs within the loop.
It makes no sense to order the rows as you insert into a table. Relational databases do not allow you to specify the order of rows in a table.
Even if you could, SQL does not guarantee a query will return rows in any order, such as the order you inserted them. You must specify an ORDER BY clause to guarantee an order for a query result.
So it would do you no good to change the order in which you insert the rows.
As stated by Bill, there's not a lot of point ordering the input, you really need to order the output. In the simplistic example you've provided, it just makes no sense, so I can only assume that the real problem you're trying to solve is more complex - deduplication perhaps?
The functionality you're after is CREATE SEQUENCE, but I'm pretty sure it's not available in such an old version of Informix.
If you really need to do what you're asking, you could look into UNLOADing the data in the required order, and then LOADing it again. That would ensure the SERIAL values get allocated sequentially.
Would something like this work?
SELECT
folio
FROM
(
SELECT
ROWNUM n,
folio
FROM
regularTable
ORDER BY
folio
)
WHERE
n BETWEEN 501 AND 1000
It may not be terribly efficient if the table grows larger or you're fetching later "pages", but 10K rows is pretty small.
I don't recall if Informix has a ROWNUM concept, I use Oracle.