Related
as shown below, i have a large db-table with several columns. i want to be able to select the last inserted row in that table..how can i achieve that in postgresql
output
"timeofinsertion","selectedsiteid","devenv","threshold","ec50ewco","dose","apprateofproduct","concentrationofactingr","unclass_intr_nzccs","unclass_inbu_nzccs","vl_intr_nzccs","percentage_vl_per_total_nzccs_intr","vl_inbu_nzccs","percentage_vl_per_total_nzccs_inbu","totalvlnzccsinsite","percentage_total_vlnzccs_per_site","l_intr_nzccs","percentage_l_per_total_nzccs_intr","l_inbu_nzccs","percentage_l_per_total_nzccs_inbu","totallnzccsinsite","percentage_total_lnzccs_per_site","m_intr_nzccs","percentage_m_per_total_nzccs_intr","m_inbu_nzccs","percentage_m_per_total_nzccs_inbu","totalmnzccsinsite","percentage_total_mnzccs_per_site","h_intr_nzccs","percentage_h_per_total_nzccs_intr","h_inbu_nzccs","percentage_h_per_total_nzccs_inbu","totalhnzccsinsite","percentage_total_hnzccs_per_site","unclass_intr_zccs","unclass_inbu_zccs","vl_intr_zccs","percentage_vl_per_total_zccs_intr","vl_inbu_zccs","percentage_vl_per_total_zccs_inbu","totalvlzccsinsite","percentage_total_vlzccs_per_site","l_intr_zccs","percentage_l_per_total_zccs_intr","l_inbu_zccs","percentage_l_per_total_zccs_inbu","totallzccsinsite","percentage_total_lzccs_per_site","m_intr_zccs","percentage_m_per_total_zccs_intr","m_inbu_zccs","percentage_m_per_total_zccs_inbu","totalmlzccsinsite","percentage_total_mzccs_per_site","h_intr_zccs","percentage_h_per_total_zccs_intr","h_inbu_zccs","percentage_h_per_total_zccs_inbu","totalhzccsinsite","percentage_total_hzccs_per_site","totalunclassnzccs","totalunclasszccs","totalnzccsintr","totalnzccsinbu","totalnzccsinsite","totalzccsintr","totalzccsinbu","totalzccsinsite","totalvlinsite","percentageof_total_vl_insite_per_site","totallinsite","percentageof_total_l_insite_per_site","totalminsite","percentageof_total_m_insite_per_site","totalhinsite","percentageof_total_h_insite_per_site","total_unclass_with_nodatacells_excluded","total_unclass_with_nodatacells_included","total_with_nodatacells_excluded","total_with_nodatacells_included"
"3-2-2023 10:0:3:745762","202311011423",test,1,"3.125","0.75","75","100","0","0","0","0","0","0","0","0.0","0","0","0","0","0","0.0","0","0","0","0","0","0.0","0","0","0","0","0","0.0","0","0","0","0.0","32","91.4","32","82.1","0","0.0","3","8.6","3","7.7","4","100.0","0","0.0","4","10.3","0","0.0","0","0.0","0","0.0","0","0","0","0","0","4","35","39","32","82.1","3","7.7","4","10.3","0","0.0","0","0","39","39"
If you have a column to sort by, you can use:
select *
from the_table
order by timeofinsertion desc
limit 1;
If you want to get the complete row that you have just inserted, it might be easier to use the returning clause with your INSERT statement:
insert into the_table (timeofinsertion, selectedsiteid, ...)
values (current_timestamp, ....)
returning *;
In MSSQL server you can you SCOPE_IDENTITY it return last inserted Identity Column Value.
Otherwise you can you order by desc with top 1 of Identity column
Check Example :
Create Table tbl_Name
(
RowId Int Not Null Primary Key Identity(1,1),
Name Varchar(100)
)
INsert Into tbl_Name (Name) Values ('My Name')
Select * From tbl_Name Where RowId = SCOPE_IDENTITY()
Select Top 1 * From tbl_Name Order by RowId desc
I have a table I want to insert into based on two other tables.
In the table I'm inserting into, I need to find the Max value and then do +1 every time to basically create new IDs for each of the 2000 values I'm inserting.
I tried something like
MAX(column_name) + 1
But it didn't work. I CANNOT make the column an IDENTITY and ideally the increment by one should happen in the INSERT INTO ... SELECT ... statement.
Many Thanks!
You can declare a variable with the last value from the table and put it on the insert statement, like this:
DECLARE #Id INT
SET #Id = (SELECT TOP 1 Id FROM YoutTable ORDER BY DESC)
INSERT INTO YourTable VALUES (#Id, Value, Value)
If its mysql, you could do something like this..
insert into yourtable
select
#rownum:=#rownum+1 'colname', t.* from yourtable t, (SELECT #rownum:=2000) r
The example to generate rownumber taken from here
If its postgresql, you could use
insert into yourtable
select t.*,((row_number() over () ) + 2000) from yourtable t
Please note the order for the select is different on both the queries, you may need to adjust your insert statement accordingly.
Use a sequence, that's what they are for.
create sequence table_id_sequence;
Then adjust the sequence to the current max value:
select setval('table_id_sequence', (select max(id_column) from the_table));
The above only needs to be done once.
After the sequence is set up, always use that for any subsequent inserts:
insert into (id_column, column_2, column_3)
select nextval('table_id_sequence'), column_2, column_3
from some_other_table;
If you will never have any any concurrent inserts into that table (but only then) you can get away with using max() + 1
insert into (id_column, column_2, column_3)
select row_number() over () + mx.id, column_2, column_3
from some_other_table
cross join (
select max(id_column) from the_table
) as mx(id);
But again: the above is NOT safe for concurrent inserts.
The sequence solution is also going to perform better (especially if the target table grows in size)
I am trying to delete some duplicate data in my redshift table.
Below is my query:-
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name)
delete from duplicates
Where Duplicate > 1 ;
This query is giving me an error.
Amazon Invalid operation: syntax error at or near "delete";
Not sure what the issue is as the syntax for with clause seems to be correct.
Has anybody faced this situation before?
Redshift being what it is (no enforced uniqueness for any column), Ziggy's 3rd option is probably best. Once we decide to go the temp table route it is more efficient to swap things out whole. Deletes and inserts are expensive in Redshift.
begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;
If space isn't an issue you can keep the old table around for a while and use the other methods described here to validate that the row count in the original accounting for duplicates matches the row count in the new.
If you're doing constant loads to such a table you'll want to pause that process while this is going on.
If the number of duplicates is a small percentage of a large table, you might want to try copying distinct records of the duplicates to a temp table, then delete all records from the original that join with the temp. Then append the temp table back to the original. Make sure you vacuum the original table after (which you should be doing for large tables on a schedule anyway).
If you're dealing with a lot of data it's not always possible or smart to recreate the whole table. It may be easier to locate, delete those rows:
-- First identify all the rows that are duplicate
CREATE TEMP TABLE duplicate_saleids AS
SELECT saleid
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
GROUP BY saleid
HAVING COUNT(*) > 1;
-- Extract one copy of all the duplicate rows
CREATE TEMP TABLE new_sales(LIKE sales);
INSERT INTO new_sales
SELECT DISTINCT *
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Remove all rows that were duplicated (all copies).
DELETE FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Insert back in the single copies
INSERT INTO sales
SELECT *
FROM new_sales;
-- Cleanup
DROP TABLE duplicate_saleids;
DROP TABLE new_sales;
COMMIT;
Full article: https://elliot.land/post/removing-duplicate-data-in-redshift
That should have worked. Alternative you can do:
With
duplicates As (
Select *, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name)
delete from table_name
where id in (select id from duplicates Where Duplicate > 1);
or
delete from table_name
where id in (
select id
from (
Select id, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name) x
Where Duplicate > 1);
If you have no primary key, you can do the following:
BEGIN;
CREATE TEMP TABLE mydups ON COMMIT DROP AS
SELECT DISTINCT ON (record_indicator) *
FROM table_name
ORDER BY record_indicator --, other_optional_priority_field DESC
;
DELETE FROM table_name
WHERE record_indicator IN (
SELECT record_indicator FROM mydups);
INSERT INTO table_name SELECT * FROM mydups;
COMMIT;
This method will preserve permissions and the table definition of the original_table.
The most upvoted answer does not preserve permissions on the table or the original definition of the table.
In real world production environment this method is how you should be doing as this is safest and easiest way to execute in production environment.
This will have a DOWN TIME in PROD.
Create Table with unique rows
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
Backup the original_table
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
Truncate the original_table
TRUNCATE original_table;
Insert records from unique_table into original_table
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
To avoid DOWN TIME run the below queries in a TRANSACTION and instead of TRUNCATE use DELETE
BEGIN transaction;
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
DELETE FROM original_table;
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
END transaction;
Simple answer to this question:
Firstly create a temporary table from the main table where value of row_number=1.
Secondly delete all the rows from the main table on which we had duplicates.
Then insert the values of temporary table into the main table.
Queries:
Temporary table
select id,date into #temp_a
from
(select *
from (select a.*,
row_number() over(partition by id order by etl_createdon desc) as rn
from table a
where a.id between 59 and 75 and a.date = '2018-05-24')
where rn =1)a
deleting all the rows from the main table.
delete from table a
where a.id between 59 and 75 and a.date = '2018-05-24'
inserting all values from temp table to main table
insert into table a select * from #temp_a.
The following deletes all records in 'tablename' that have a duplicate, it will not deduplicate the table:
DELETE FROM tablename
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename
) t
WHERE t.rnum > 1);
Postgres administrative snippets
Your query does not work because Redshift does not allow DELETE after the WITH clause. Only SELECT and UPDATE and a few others are allowed (see WITH clause)
Solution (in my situation):
I did have an id column on my table events that contained duplicate rows and uniquely identifies the record. This column id is the same as your record_indicator.
Unfortunately I was unable to create a temporary table because I ran into the following error using SELECT DISTINCT:
ERROR: Intermediate result row exceeds database block size
But this worked like a charm:
CREATE TABLE temp as (
SELECT *,ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS rownumber
FROM events
);
resulting in the temp table:
id | rownumber | ...
----------------
1 | 1 | ...
1 | 2 | ...
2 | 1 | ...
2 | 2 | ...
Now the duplicates can be deleted by removing the rows having rownumber larger than 1:
DELETE FROM temp WHERE rownumber > 1
After that rename the tables and your done.
with duplicates as
(
select a.*, row_number (over (partition by first_name, last_name, email order by first_name, last_name, email) as rn from contacts a
)
delete from contacts
where contact_id in (
select contact_id from duplicates where rn >1
)
i have table like
CREATE TABLE meta.fk_payment1
(
id serial NOT NULL,
settlement_ref_no character varying,
order_type character varying,
fulfilment_type character varying,
seller_sku character varying,
wsn character varying,
order_id character varying,
order_item_id bigint,
....
);
i am inserting data from csv file where all column are same instead of
id column
In case when csv file uploaded more then one time the data will be duplicate .
but id will not and id is primary key.
so I want to remove all duplicate row without using primary key .
I have to do this on single table
You can do like this
e.g.
DELETE FROM table_name
WHERE ctid NOT IN
(SELECT MAX(dt.ctid)
FROM table_name As dt
GROUP BY dt.*);
run this query
DELETE FROM meta.fk_payment1
WHERE ctid NOT IN
(SELECT MAX(dt.ctid)
FROM meta.fk_payment1 As dt
GROUP BY dt.*);
Copy distinct data to work table fk_payment1_copy. The simplest way to do that is to use into
SELECT max(id),settlement_ref_no ...
INTO fk_payment1_copy
from fk_payment1
GROUP BY settlement_ref_no ...
delete all rows from fk_payment1
delete from fk_payment1
and copy data from fk_payment1_copy table to fk_payment1
insert into fk_payment1
select id,settlement_ref_no ...
from fk_payment1_copy
if the table isn't very large you can do:
-- create temporary table and select distinct into it.
CREATE TEMP TABLE tmp_table AS
SELECT DISTINCT column_1, column_2
FROM original_table ORDER BY column_1, column_2;
-- clear the original table
TRUNCATE original_table;
-- copy data back in again
INSERT INTO original_table(column_1, column_2)
SELECT * FROM tmp_table ORDER BY column_1, column_2;
-- clean up
DROP TABLE tmp_table
for larger tables remove the TEMP command from the tmp_table creation
this solution comes in handy when working with JPA (Hibernate) produced #ElementCollection which are created without primary key.
A bit unsure about the primary key part in the question, but in any case id doesn't need to be a primary key, it just needs to be unique. As it should be since it's serial. So if it has unique values, you can do it this way:
DELETE FROM fk_payment1 f WHERE EXISTS
(SELECT * FROM fk_payment1 WHERE id<f.id
AND settlement_ref_no=f.settlement_ref_no
AND ...)
Just need to add all columns in the select query. This way all rows that have the same values (except id) and are after this row (sorted by id) will be deleted.
(Also, naming a table with fk_ prefix makes it look like a foreign key.)
So there is a slick way right in PG wiki. https://wiki.postgresql.org/wiki/Deleting_duplicates
This query does that for all rows of tablename having the same column1, column2, and column3.
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
I was testing this on de-duping 600k rows, leading to 200k unique rows. The solution using group by and NOT IN took 3h+, this takes like 3s.
I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1