Eliminating Duplicate Records in a DB2 Table - sql

How do delete duplicate records in a DB2 table? I want to be left with a single record for each group of dupes.

Create another table "no_dups" that has exactly the same columns as the table you want to eliminate the duplicates from. (You may want to add an identity column, just to make it easier to identify individual rows).
Insert into "no_dups", select distinct column1, column2...columnN from the original table. The "select distinct" should only bring back one row for every duplicate in the original table. If it doesn't you may have to alter the list of columns or have a closer look at your data, it may look like duplicate data but actually is not.
When step 2 is done, you will have your original table, and "no_dups" will have all the rows without duplicates. At this point you can do any number of things - drop and rename tables, or delete all from the original and insert into the original, select * from no_dups.
If you're running into problems identifying duplicates, and you've added an identity column to "no_dups," you should be able to delete rows one by one using the identity column value.

Related

Copying rows to the same table without listing the columns

I need to create a stored proc which has to copy existing rows based on a condition into the same table and deactivate existing records by setting active flag column value to 0.
I tried insert into [tablename] select * from [tablename] where [condition] obviously, I got an error due to primary key constraint, mentioning column list excluding primary key column in the select will work but there are multiple tables and some tables have around 300 columns. I don't want to give a long list in the select. As I use SQL Server there is a solution that I found here on SO by getting column list from information_schema.columns and preparing a dynamic query. However, I am not satisfied with any of those solutions is there any other way that I can do this?

SQLITE3 Insert into the most recent row where 1 column is an exact match

I have this fairly large DB. It contains lots of column. One of the will have a value that I need to select, but the DB has several of that value. How can I insert into a column in the row thats the newest in the DB, with a matching column.
Without knowing the ins and outs of your database, I think you would likely want to select the largest id you have in the auto incrementing row. For instance:
SELECT MAX(UNIQUE_ID) FROM TABLE WHERE MATCHING_COLUMN = MATCHING_VALUE
From there you can take your unique ID and insert into that row.

Remove duplicate SQL rows by looking at all columns

I have this table, where every column is a VARCHAR (or equivalent):
field001 field002 field003 field004 field005 .... field500
500 VARCHAR columns. No primary keys. And no column is guaranteed to be unique. So the only way to know for sure if two rows are the same is to compare the values of all columns.
(Yes, this should be in TheDailyWTF. No, it's not my fault. Bear with me here).
I inserted a duplicate set of rows by mistake, and I need to find them and remove them.
There's 12 million rows on this table, so I'd rather not recreate it.
However, I do know what rows were mistakenly inserted (I have the .sql file).
So I figured I'd create another table and load it with those. And then I'd do some sort of join that would compare all columns on both tables and then delete the rows that are equal from the first table. I tried a NATURAL JOIN as that looked promising, but nothing was returned.
What are my options?
I'm using Amazon Redshift (so PostgreSQL 8.4 if I recall), but I think this is a general SQL question.
You can treat the whole row as a single record in Postgres (and thus I think in Redshift).
The following works in Postgres, and will keep one of the duplicates
delete from the_table
where ctid not in (select min(ctid)
from the_table
group by the_table); --<< Yes, the group by is correct!
This is going to be slow!
Grouping over so many columns and then deleting with a NOT IN will take quite some time. Especially if a lot of rows are going to be deleted.
If you want to delete all duplicate rows (not keeping any of them), you can use the following:
delete from the_table
where the_table in (select the_table
from the_table
group by the_table
having count(*) > 1);
You should be able to identify all the mistakenly inserted rows using CREATEXID.If you group by CREATEXID on your table as below and get the count you should be able to understand how many rows were inserted in your transaction and remove them using DELETE command.
SELECT CREATEXID,COUNT(1)
FROM yourtable
GROUP BY 1;
One simplistic solution is to recreate the table, e.g.
CREATE TABLE my_temp_table (
-- add column definitions here, just like the original table
);
INSERT INTO my_temp_table SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
or even
CREATE TABLE my_temp_table AS SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
It is a trick but probably it helps.
Each row in the table containing the transaction ID in which it row was inserted/updated: System Columns. It is xmin column. So using it you can to find the transaction ID in which you inserted the wrong data. Then just delete the rows using
delete from my_table where xmin = <the_wrong_transaction_id>;
PS: Be careful and try it on the some test table first.

Oracle SQL merge tables without specifying columns

I have a table people with less than 100,000 records and I have taken a backup of this table using the following:
create table people_backup as select * from people
I add some new records to my people table over time, but eventually I want to merge the records from my backup table into people. Unfortunately I cannot simply DROP my table as my new records will be lost!
So I want to update the records in my people table using the records from people_backup, based on their primary key id and I have found 2 ways to do this:
MERGE the tables together
use some sort of fancy correlated update
Great! However, both of these methods use SET and make me specify what columns I want to update. Unfortunately I am lazy and the structure of people may change over time and while my CTAS statement doesn't need to be updated, my update/merge script will need changes, which feels like unnecessary work for me.
Is there a way merge entire rows without having to specify columns? I see here that not specifying columns during an INSERT will direct SQL to insert values by order, can the same methodology be applied here, is this safe?
NB: The structure of the table will not change between backups
Given that your table is small, you could simply
DELETE FROM table t
WHERE EXISTS( SELECT 1
FROM backup b
WHERE t.key = b.key );
INSERT INTO table
SELECT *
FROM backup;
That is slow and not particularly elegant (particularly if most of the data from the backup hasn't changed) but assuming the columns in the two tables match, it does allow you to not list out the columns. Personally, I'd much prefer writing out the column names (presumably those don't change all that often) so that I could do an update.

Updating rows in order with SQL

I have a table with 4 columns. The first column is unique for each row, but it's a string (URL format).
I want to update my table, but instead of using "WHERE", I want to update the rows in order.
The first query will update the first row, the second query updates the second row and so on.
What's the SQL code for that? I'm using Sqlite.
Edit: My table schema
CREATE table (
url varchar(150),
views int(5),
clicks int(5)
)
Edit2: What I'm doing right now is a loop of SQL queries
update table set views = 5, click = 10 where url = "http://someurl.com";
There is around 4 million records in the database. It's taking around 16 seconds in my server to make the update. Since the loop update the row in order, so the first query update the first row; I'm thinking if updating the rows in order could be faster than using the WHERE clause which needs to browse 4 million rows.
You can't do what you want without using WHERE as this is the only way to select rows from a table for reading, updating or deleting. So you will want to use:
UPDATE table SET url = ... WHERE url = '<whatever>'
HOWEVER... SqlLite has an extra feature - the autogenerated column, ROWID. You can use this column in queries. You don't see this data by default, so if you want the data within it you need to explicitly request it, e.g:
SELECT ROWID, * FROM table
What this means is that you may be able to do what you want referencing this column directly:
UPDATE table SET url = ... WHERE ROWID = 1
you still need to use the WHERE clause, but this allows you to access the rows in insert order without doing anything else.
CAVEAT
ROWID effectively stores the INSERT order of the rows. If you delete rows from the table, the ROWIDs for remaining rows will NOT change - hence it is possible to have gaps in the ROWID sequence. This is by design and there is no workaround short of re-creating the table and re-populating the data.
PORTABILITY
Note that this only applies to SQLite - you may not be able to do the same thing with other SQL engines should you ever need to port this. It would be MUCH better to add an EXPLICIT auto-number column (aka an IDENTITY field) that you can use and manage.