Changing table field to UNIQUE - sql

I want to run the following sql command:
ALTER TABLE `my_table` ADD UNIQUE (
`ref_id` ,
`type`
);
The problem is that some of the data in the table would make this invalid, therefore altering the table fails.
Is there a clever way in MySQL to delete the duplicate rows?

SQL can, at best, handle this arbitrarily. To put it another way: this is your problem.
You have data that currently isn't unique. You want to make it unique. You need to decide how to handle the duplicates.
There are a variety of ways of handling this:
Modifying or deleting duplicate rows by hand if the numbers are sufficiently small;
Running statements to update or delete duplicate that meet certain criteria to get to a point where the exceptions can be dealt with on an individual basis;
Copying the data to a temporary table, emptying the original and using queries to repopulate the table; and
so on.
Note: these all require user intervention.
You could of course just copy the table to a temporary table, empty the original and copy in the rows just ignoring those that fail but I expect that won't give you the results that you really want.

if you don't care which row gets deleted, use IGNORE:
ALTER IGNORE TABLE `my_table` ADD UNIQUE (
`ref_id` ,
`type`
);

What you can do is add a temporary identity column to your table. With that you can write query to identify and delete the duplicates (you can modify the query little bit to make sure only one copy from the set of duplicate rows are retained).
Once this is done, drop the temporary column and add unique constraint to your original column.
Hope this helps.

What I've done in the past is export the unique set of data, drop the table, recreate it with the unique columns and import the data.
It is often faster than trying to figure out how to delete the duplicate data.

There is a good KB article that provides a step-by-step approach to finding and removing rows that have duplicate values. It provides two approaches - a one-off approach for finding and removing a single row and a broader solution to solving this when many rows are involved.
http://support.microsoft.com/kb/139444

Here is a snippet I used to delete duplicate rows in one of the tables
BEGIN TRANSACTION
Select *,
rank() over (Partition by PolicyId, PlanSeqNum, BaseProductSeqNum,
CoInsrTypeCd, SupplierTypeSeqNum
order by CoInsrAmt desc) as MyRank
into #tmpTable
from PlanCoInsr
select distinct PolicyId,PlanSeqNum,BaseProductSeqNum,
SupplierTypeSeqNum, CoInsrTypeCd, CoInsrAmt
into #tmpTable2
from #tmpTable where MyRank=1
truncate table PlanCoInsr
insert into PlanCoInsr
select * from #tmpTable2
drop table #tmpTable
drop table #tmpTable2
COMMIT

This worked for me:
ALTER TABLE table_name ADD UNIQUE KEY field_name (field_name)

You will have to find some other field that is unique because deleting on ref_id and type alone will delete them all.
To get the duplicates:
select ref_id, type from my_table group by ref_id, type having count(*)>1
Xarpb has some clever tricks (maybe too clever): http://www.xaprb.com/blog/2007/02/06/how-to-delete-duplicate-rows-with-sql-part-2/

Related

How to delete 3 billion rows from 2 related tables

I have a table with 5 billion rows (table1) another table with 3 billion rows in table 2. These 2 tables are related. I have to delete 3 billion rows from table 1 and its related rows from table 2. Table1 is child of table 2. I tried using the for all method from plsql it didn't help much. Then I thought of using oracle partition strategy. Since I am not a DBA I would like to know if partioning of a existing table is possible on primary key column for a selected number of id's? My primary key is 64 bit auto generated number.
It is hard to partition the objects online(it can be done using dbms_redefinition). And not necessary(with the details you gave).
Best ideea would be to recreate the objects without the undesired rows.
For example some simple code would be like:
create table undesired_data as (select undesired rows from table1);
Create table1_new as (select * from table1 where key not in (select key from undesired_data));
Create table2_new as (select * from table2 where key not in (select key from undesired_data));
rename table1 to table1_old;
rename table2 to table2_old;
rename table1_new to table1;
rename table2_new to table2;
recreate constraints;
check if everything is ok;
drop table1_old and table2_old;
This can be done offlining consumers, but would be very small downtime for them if scripts are ok(you should test them in a test environment).
Sounds very dubious.
If it is real use-case then you don't delete you create another table, well defined, including partitioned and you fill it using insert /*+ append */ into MyNewTable select ....
The most common practice is to define partitions on dates (record create date, event date etc.).
Again, if this is a real use-case I strongly recommend that you will reach for real help, not seek for advice on the internet and not doing it yourself.

Remove duplicate SQL rows by looking at all columns

I have this table, where every column is a VARCHAR (or equivalent):
field001 field002 field003 field004 field005 .... field500
500 VARCHAR columns. No primary keys. And no column is guaranteed to be unique. So the only way to know for sure if two rows are the same is to compare the values of all columns.
(Yes, this should be in TheDailyWTF. No, it's not my fault. Bear with me here).
I inserted a duplicate set of rows by mistake, and I need to find them and remove them.
There's 12 million rows on this table, so I'd rather not recreate it.
However, I do know what rows were mistakenly inserted (I have the .sql file).
So I figured I'd create another table and load it with those. And then I'd do some sort of join that would compare all columns on both tables and then delete the rows that are equal from the first table. I tried a NATURAL JOIN as that looked promising, but nothing was returned.
What are my options?
I'm using Amazon Redshift (so PostgreSQL 8.4 if I recall), but I think this is a general SQL question.
You can treat the whole row as a single record in Postgres (and thus I think in Redshift).
The following works in Postgres, and will keep one of the duplicates
delete from the_table
where ctid not in (select min(ctid)
from the_table
group by the_table); --<< Yes, the group by is correct!
This is going to be slow!
Grouping over so many columns and then deleting with a NOT IN will take quite some time. Especially if a lot of rows are going to be deleted.
If you want to delete all duplicate rows (not keeping any of them), you can use the following:
delete from the_table
where the_table in (select the_table
from the_table
group by the_table
having count(*) > 1);
You should be able to identify all the mistakenly inserted rows using CREATEXID.If you group by CREATEXID on your table as below and get the count you should be able to understand how many rows were inserted in your transaction and remove them using DELETE command.
SELECT CREATEXID,COUNT(1)
FROM yourtable
GROUP BY 1;
One simplistic solution is to recreate the table, e.g.
CREATE TABLE my_temp_table (
-- add column definitions here, just like the original table
);
INSERT INTO my_temp_table SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
or even
CREATE TABLE my_temp_table AS SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
It is a trick but probably it helps.
Each row in the table containing the transaction ID in which it row was inserted/updated: System Columns. It is xmin column. So using it you can to find the transaction ID in which you inserted the wrong data. Then just delete the rows using
delete from my_table where xmin = <the_wrong_transaction_id>;
PS: Be careful and try it on the some test table first.

Oracle SQL merge tables without specifying columns

I have a table people with less than 100,000 records and I have taken a backup of this table using the following:
create table people_backup as select * from people
I add some new records to my people table over time, but eventually I want to merge the records from my backup table into people. Unfortunately I cannot simply DROP my table as my new records will be lost!
So I want to update the records in my people table using the records from people_backup, based on their primary key id and I have found 2 ways to do this:
MERGE the tables together
use some sort of fancy correlated update
Great! However, both of these methods use SET and make me specify what columns I want to update. Unfortunately I am lazy and the structure of people may change over time and while my CTAS statement doesn't need to be updated, my update/merge script will need changes, which feels like unnecessary work for me.
Is there a way merge entire rows without having to specify columns? I see here that not specifying columns during an INSERT will direct SQL to insert values by order, can the same methodology be applied here, is this safe?
NB: The structure of the table will not change between backups
Given that your table is small, you could simply
DELETE FROM table t
WHERE EXISTS( SELECT 1
FROM backup b
WHERE t.key = b.key );
INSERT INTO table
SELECT *
FROM backup;
That is slow and not particularly elegant (particularly if most of the data from the backup hasn't changed) but assuming the columns in the two tables match, it does allow you to not list out the columns. Personally, I'd much prefer writing out the column names (presumably those don't change all that often) so that I could do an update.

Change the order of database columns

I want to change the order of column e.g. name is first column of my table and there are 10 other columns in table I want to insert a new column in 2nd position after name column.
How is this possible?
1 - It's not possible without rebuilding the table, as Martin rightly points out.
2 - It's a good practice anyways to specify what fields you want and in what order in your SELECT statements as n8wrl points out.
3 - If you really really need a fixed order on your fields, could create a view that selects the fields you want in the order you want.
Like the rows in the table, there is no meaning to the order of the columns. In fact, it is best to specify the order you want the columns in your select statements rather than using select *, so you can 'insert' new columns wherever you want just by writing your SELECT statements accordingly.
Its possible to change the order. In some instances it really matters. have a personal experience.
Anyway..this query works fine.
ALTER TABLE user MODIFY Name VARCHAR(150) AFTER address;
You can achieve this by following these steps:
remove all foreign keys and primary key of the original table.
rename the original table.
using CTAS, create the original table in the order you want.
drop the old table.
apply all constraints back to the original table.

Ordering the results of a query from a temp table

I have a SQL query where I am going to be transferring a fair amount of response data down the wire, but I want to get the total rowcount as quickly as possible to facilitate binding in the UI. Basically I need to get a snapshot of all of the rows that meet a certain criteria, and then be able to page through all of the resulting rows.
Here's what I currently have:
SELECT --primary key column
INTO #tempTable
FROM --some table
--some filter clause
ORDER BY --primary key column
SELECT ##ROWCOUNT
SELECT --the primary key column and some others
FROM #tempTable
JOIN -- some table
DROP TABLE #tempTable
Every once in a while, the query results end up out of order (presumably because I am doing an unordered select from the temp table).
As I see it, I have a couple of options:
Add a second order by clause to the select from the temp table.
Move the order by clause to the second select and let the first select be unordered.
Create the temporary table with a primary key column to force the ordering of the temp table.
What is the best way to do this?
Use number 2. Just because you have a primary key on the table does not mean that the result set from select statement will be ordered (even if what you see actually is).
There's no need to order the data when putting it in the temp table, so take that one out. You'll get the same ##ROWCOUNT value either way.
So do this:
SELECT --primary key column
INTO #tempTable
FROM --some table
--some filter clause
SELECT ##ROWCOUNT
SELECT --the primary key column and some others
FROM #tempTable
JOIN -- some table
ORDER BY --primary key column
DROP TABLE #tempTable
Move the order by from the first select to the second select.
A database isn't a spreadsheet. You don't put the data into a table in a particular order.
Just make sure you order it properly when you get it back out.
Personally I would select out the data in the order you want to eventually have it. So in your first select, have your order by. That way it can take advantage of any existing indexes and access paths.