which delete statement is better for deleting millions of rows - sql

I have table which contains millions of rows.
I want to delete all the data which is over a week old based on the value of column last_updated.
so here are my two queries,
Approach 1:
Delete from A where to_date(last_updated,''yyyy-mm-dd'')< sysdate-7;
Approach 2:
l_lastupdated varchar2(255) := to_char(sysdate-nvl(p_days,7),'YYYY-MM-DD');
insert into B(ID) select ID from A where LASTUPDATED < l_lastupdated;
delete from A where id in (select id from B);
which one is better considering performance, safety and locking?

Assuming the delete removes a significant fraction of the data & millions of rows, approach three:
create table tmp
Delete from A where to_date(last_updated,''yyyy-mm-dd'')< sysdate-7;
drop table a;
rename tmp to a;
https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:2345591157689
Obviously you'll need to copy over all the indexes, grants, etc. But online redefinition can help with this https://oracle-base.com/articles/11g/online-table-redefinition-enhancements-11gr1
When you get to 12.2, there's another simpler option: a filtered move.
This is an alter table move operation, with an extra clause stating which rows you want to keep:
create table t (
c1 int
);
insert into t values ( 1 );
insert into t values ( 2 );
commit;
alter table t
move including rows where c1 > 1;
select * from t;
C1
2
While you're waiting to upgrade to 12.2+ and if you don't want to use the create-as-select method for some reason then approach 1 is superior:
Both methods delete the same rows from A* => it's the same amount of work to do the delete
Option 1 has one statement; Option 2 has two statements; 2 > 1 => option 2 is more work
*Statement level consistency means you might get different results running the processes. Say another session tries to update an old row that your process will remove.
With just the delete, the update will be blocked until the delete finishes. At which point the row's gone, so the update does nothing.
Whereas if you do the insert first, the other session can update & commit the row before the insert completes. So the update "succeeds". But the delete will then remove it! Which can lead to some unhappy customers...

Your stored dateformat seems suitable for proper sorting, so you could go the other way round and convert sysdate to string:
--this is false today
select * from dual where '2019-06-05' < to_char(sysdate-7, 'YYYY-MM-DD');
--this is true today
select * from dual where '2019-05-05' < to_char(sysdate-7, 'YYYY-MM-DD');
So it would be:
Delete from A where last_updated < to_char(sysdate-7, ''yyyy-mm-dd'');
It has the benefit that your default index (if there is any) will be used.
It has the disadvantage on relying on the String/Varchar ordering which might be changed i.e. bei NLS changes (if i remember right), so in any case you should do a little testing before...
In the long term, you should of cource alter the colum to a proper date-datatype, but I guess that doesn't help you right now ;)

If you are trying to delete most of the rows in the table, I would advise you go with a different approach, namely:
create <new table name> as
select *
from <old table name>
where <predicates for the data you want to keep>;
then
drop table <old table name>;
and finally you can rename the new table back to the old table.
You could always partition the new table (i.e. create the new table with a separate statement containing the partitioning clauses, and then have an insert as select into the new table from the old table).
That way, when you need to delete rows, it's a simple matter of dropping the relevant partition(s).

Related

UPDATE two columns with new value under large size table

We have table like :
mytable (pid, string_value, int_value)
This table has more than 20M rows in total. Now we have a feature try to mark all the rows from this tables as invalid. So we need update the table columns: string_Value = NULL and int_value = 0 which indicate this is invalid row ( we still want to keep the pid as it is important to us)
So what is the best way?
I use the following SQL:
UPDATE Mytable
SET string_value = NULL,
int_value = 0;
but this query takes more than 4 minutes in my test env. Is there any better way we can improve it?
Updating all the rows can be quite expensive. Often, it is faster to empty the table and reload it.
In generic SQL this looks like:
create table mytable_temp as
select pid
from mytable;
truncate table mytable; -- back it up first!
insert into mytable (pid, string_value, int_value)
select pid, null, 0
from mytable_temp;
The creation of the temporary table may use different syntax, depending on our database.
Updates can take time to complete. Another way of achieving this is to follow the following steps:
Add new columns with the values you need set as the default value
Drop the original columns
Rename the new columns with the names of the original columns.
You can then drop the default values on the new columns.
This needs to be tested as different DBMSs allow different levels of table alters (i.e. not all DMBSs allow a drop default or a drop column).

Copy a table data from one database to another database SQL

I have had a look at similar problems, however none of the answers helped in my case.
Just a little bit of background. I have Two databases, both have the same table with the same fields and structure. Data already exists in both tables. I want to overwrite and add to the data in db1.table from db2.table the primary ID is causing a problem with the update.
When I use the query:
USE db1;
INSERT INTO db2.table(field_id,field1,field2)
SELECT table.field_id,table.field1,table.field2
FROM table;
It works to a blank table, because none of the primary keys exist. As soon as the primary key exists it fails.
Would it be easier for me to overwrite the primary keys? or find the primary key and update the fields related to the field_id? Im really not sure how to go ahead from here. The data needs to be migrated every 5min, so possibly a stored procedure is required?
first you should try to add new records then update all records.you can create a procedure like below code
PROCEDURE sync_Data(a IN NUMBER ) IS
BEGIN
insert into db2.table
select *
from db1.table t
where t.field_id not in (select tt.field_id from db2.table tt);
begin
for t in (select * from db1.table) loop
update db2.table aa
set aa.field1 = t.field1,
aa.field2 = t.field2
where aa.field_id = t.field_id;
end loop;
end;
END sync_Data
Set IsIdentity to No in Identity Specification on the table in which you want to move data, and after executing your script, set it to Yes again
I ended up just removing the data in the new database and sending it again.
DELETE FROM db2.table WHERE db2.table.field_id != 0;
USE db1;
INSERT INTO db2.table(field_id,field1,field2)
SELECT table.field_id,table.field1,table.field2
FROM table;
Its not very efficient, but gets the job done. I couldnt figure out the syntax to correctly do an UPDATE or to change the IsIdentity field within MariaDB, so im not sure if they would work or not.
The overhead of deleting and replacing non-trivial amounts of data for an entire table will be prohibitive. That said I'd prefer to update in place (merge) over delete /replace.
USE db1;
INSERT INTO db2.table(field_id,field1,field2)
SELECT t.field_id,t.field1,t.field2
FROM table t
ON DUPLICATE KEY UPDATE field1 = t.field1, field2 = t.field2
This can be used inside a procedure and called every 5 minutes (not recommended) or you could build a trigger that fires on INSERT and UPDATE to keep the tables in sync.
INSERT INTO database1.tabledata SELECT * FROM database2.tabledata;
But you have to keep length of varchar length larger or equal to database2 and keep the same column name

Delete new record if same data exists

I want to delete new record if the same record created before.
My columns are date, time and MsgLog. If date and time are same, I want to delete new one.
I need help .
You can check in the table whether that value exists or not in the column using a query. If it exists, you can show message that a record already exists.
To prevent such kind of erroneous additions you can add restriction to your table to ensure unique #Date #Time pairs; if you don't want to change data structure (e.g. you want to add records with such restrictions once or twice) you can exploit insert select counstruction.
-- MS SQL version, check your DBMS
insert into MyTable(
Date,
Time,
MsgLog)
select #Date,
#Time,
#MsgLog
where not exists(
select 1
from MyTable
where (#Date = Date) and
(#Time = Time)
)
P.S. want to delete new one equals to do not insert new one
You should create a unique constraint in the DB level to avoid invalid data no matter who writes to your DB.
It's always important to have your schema well defined. That way you're safe that no matter how many apps are using your DB or even in case someone just writes some inserts manually.
I don't know which DB are you using but in MySQL can use to following DDL
alter table MY_TABLE add unique index(date, time);
And in Oracle you can :
alter table MY_TABLE ADD CONSTRAINT constaint_name UNIQUE (date, time);
That said, you can also (not instead of) do some checks before inserting new values to avoid dealing with exceptions or to improve performance by avoiding making unnecessary access to your DB (length \ nulls for example could easily be dealt with in the application level).
You can avoid deleting by checking for duplicate while inserting.
Just modify your insert procedure like this, so no duplicates will entered.
declare #intCount as int;
select #intCount =count(MsgLog) where (date=#date) and (time =#time )
if #intCount=0
begin
'insert procedure
end
> Edited
since what you wanted is you need to delete the duplicate entries after your bulk insert. Think about this logic,
create a temporary table
Insert LogId,date,time from your table to the temp table order by date,time
now declare four variables, #preTime,#PreDate,#CurrTime,#CurrDate
Loop for each items in temp table, like this
while
begin
#pkLogID= ' Get LogID for the current row
select #currTime=time,#currDate=date from tblTemp where pkLogId=#pkLogID 'Assign Current values
'Delete condition check
if (#currDate=#preDate) and (#currTime=#preTime)
begin
delete from MAINTABLE where pkLogId=#pkLogID
end
select #preDate=#currDate,#preTime=#currTime 'Assign current values as preValues for next entries
end
The above strategy is we sorted all entries according to date and time, so duplicates will come closer, and we started to compare each entry with its previous, when match found we deleting the duplicate entry.

Prevent inserting overlapping date ranges using a SQL trigger

I have a table that simplified looks like this:
create table Test
(
ValidFrom date not null,
ValidTo date not null,
check (ValidTo > ValidFrom)
)
I would like to write a trigger that prevents inserting values that overlap an existing date range. I've written a trigger that looks like this:
create trigger Trigger_Test
on Test
for insert
as
begin
if exists(
select *
from Test t
join inserted i
on ((i.ValidTo >= t.ValidFrom) and (i.ValidFrom <= t.ValidTo))
)
begin
raiserror (N'Overlapping range.', 16, 1);
rollback transaction;
return
end;
end
But it doesn't work, since my newly inserted record is part of both tables Test and inserted while inside a trigger. So the new record in inserted table is always joined to itself in the Test table. Trigger will always revert transation.
I can't distinguish new records from existing ones. So if I'd exclude same date ranges I would be able to insert multiple exactly-same ranges in the table.
The main question is
Is it possible to write a trigger that would work as expected without adding an additional identity column to my Test table that I could use to exclude newly inserted records from my exists() statement like:
create trigger Trigger_Test
on Test
for insert
as
begin
if exists(
select *
from Test t
join inserted i
on (
i.ID <> t.ID and /* exclude myself out */
i.ValidTo >= t.ValidFrom and i.ValidFrom <=t.ValidTo
)
)
begin
raiserror (N'Overlapping range.', 16, 1);
rollback transaction;
return
end;
end
Important: If impossible without identity is the only answer, I welcome you to present it along with a reasonable explanation why.
I know this is already answered, but I tackled this problem recently and came up with something that works (and performs well doing a singleton seek for each inserted row). See the example in this article:
http://michaeljswart.com/2011/06/enforcing-business-rules-vs-avoiding-triggers-which-is-better/
(and it doesn't make use of an identity column)
Two minor changes and everything should work just fine.
First, add a where clause to your trigger to exclude the duplicate records from the join. Then you won't be comparing the inserted records to themselves:
select *
from testdatetrigger t
join inserted i
on ((i.ValidTo >= t.ValidFrom) and (i.ValidFrom <= t.ValidTo))
Where not (i.ValidTo=t.Validto and i.ValidFrom=t.ValidFrom)
Except, this would allow for exact duplicate ranges, so you will have to add a unique constraint across the two columns. Actually, you may want a unique constraint on each column, since any two ranges that start (or finish) on the same day are by default overlapping.

Create a unique primary key (hash) from database columns

I have this table which doesn't have a primary key.
I'm going to insert some records in a new table to analyze them and I'm thinking in creating a new primary key with the values from all the available columns.
If this were a programming language like Java I would:
int hash = column1 * 31 + column2 * 31 + column3*31
Or something like that. But this is SQL.
How can I create a primary key from the values of the available columns? It won't work for me to simply mark all the columns as PK, for what I need to do is to compare them with data from other DB table.
My table has 3 numbers and a date.
EDIT What my problem is
I think a bit more of background is needed. I'm sorry for not providing it before.
I have a database ( dm ) that is being updated everyday from another db ( original source ) . It has records form the past two years.
Last month ( july ) the update process got broken and for a month there was no data being updated into the dm.
I manually create a table with the same structure in my Oracle XE, and I copy the records from the original source into my db ( myxe ) I copied only records from July to create a report needed by the end of the month.
Finally on aug 8 the update process got fixed and the records which have been waiting to be migrated by this automatic process got copied into the database ( from originalsource to dm ).
This process does clean up from the original source the data once it is copied ( into dm ).
Everything look fine, but we have just realize that an amount of the records got lost ( about 25% of july )
So, what I want to do is to use my backup ( myxe ) and insert into the database ( dm ) all those records missing.
The problem here are:
They don't have a well defined PK.
They are in separate databases.
So I thought that If I could create a unique pk from both tables which gave the same number I could tell which were missing and insert them.
EDIT 2
So I did the following in my local environment:
select a.* from the_table#PRODUCTION a , the_table b where
a.idle = b.idle and
a.activity = b.activity and
a.finishdate = b.finishdate
Which returns all the rows that are present in both databases ( the .. union? ) I've got 2,000 records.
What I'm going to do next, is delete them all from the target db and then just insert them all s from my db into the target table
I hope I don't get in something worst : - S : -S
The danger of creating a hash value by combining the 3 numbers and the date is that it might not be unique and hence cannot be used safely as a primary key.
Instead I'd recommend using an autoincrementing ID for your primary key.
Just create a surrogate key:
ALTER TABLE mytable ADD pk_col INT
UPDATE mytable
SET pk_col = rownum
ALTER TABLE mytable MODIFY pk_col INT NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
or this:
ALTER TABLE mytable ADD pk_col RAW(16)
UPDATE mytable
SET pk_col = SYS_GUID()
ALTER TABLE mytable MODIFY pk_col RAW(16) NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
The latter uses GUID's which are unique across databases, but consume more spaces and are much slower to generate (your INSERT's will be slow)
Update:
If you need to create same PRIMARY KEYs on two tables with identical data, use this:
MERGE
INTO mytable v
USING (
SELECT rowid AS rid, rownum AS rn
FROM mytable
ORDER BY
co1l, col2, col3
)
ON (v.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET pk_col = rn
Note that tables should be identical up to a single row (i. e. have same number of rows with same data in them).
Update 2:
For your very problem, you don't need a PK at all.
If you just want to select the records missing in dm, use this one (on dm side)
SELECT *
FROM mytable#myxe
MINUS
SELECT *
FROM mytable
This will return all records that exist in mytable#myxe but not in mytable#dm
Note that it will shrink all duplicates if any.
Assuming that you have ensured uniqueness...you can do almost the same thing in SQL. The only problem will be the conversion of the date to a numeric value so that you can hash it.
Select Table2.SomeFields
FROM Table1 LEFT OUTER JOIN Table2 ON
(Table1.col1 * 31) + (Table1.col2 * 31) + (Table1.col3 * 31) +
((DatePart(year,Table1.date) + DatePart(month,Table1.date) + DatePart(day,Table1.date) )* 31) = Table2.hashedPk
The above query would work for SQL Server, the only difference for Oracle would be in terms of how you handle the date conversion. Moreover, there are other functions for converting dates in SQL Server as well, so this is by no means the only solution.
And, you can combine this with Quassnoi's SET statement to populate the new field as well. Just use the left side of the Join condition logic for the value.
If you're loading your new table with values from the old table, and you then need to join the two tables, you can only "properly" do this if you can uniquely identify each row in the original table. Quassnoi's solution will allow you to do this, IF you can first alter the old table by adding a new column.
If you cannot alter the original table, generating some form of hash code based on the columns of the old table would work -- but, again, only if the hash codes uniquely identify each row. (Oracle has checksum functions, right? If so, use them.)
If hash code uniqueness cannot be guaranteed, you may have to settle for a primary key composed of as many columns are required to ensure uniqueness (e.g. the natural key). If there is no natural key, well, I heard once that Oracle provides a rownum for each row of data, could you use that?