SQL Query to delete records older than two years - sql

I need to clean out a very bloated SQL database by deleting records that are older than two years from a number of tables. What is the most efficient way of doing this?.

Do you have any way to determine how "old" a record is? (i.e., is there a column in the table that represents either the age of the row or a date that can be used to calculate the age?). If so, it should be a simple
DELETE FROM Table WHERE Age > 2
For example, if you have a DateTime column called CreateDate, you could do this:
DELETE FROM Table WHERE DATEADD(year, 2, CreateDate) < getdate()

In addition to Adam Robinson's good answer: When performing this type of operation:
Run a SELECT query with the DELETE's WHERE clause first to make sure you're getting "the right data"
Do a full backup
Run the thing in "off" hours so as not to affect users too much

I've seen dba do this in a few different companies and it always seems to use the following format:
Backup the table
Drop any indexes
Select the rows you want to keep into a temp table
Truncate the original table
Insert (into your source table) from you temp table
Recreate the indexes
The benefit to this approach is that this update doesnt write to the logs so they don't get blown by thousands of delete entries. It's also faster.
The drawback is that the update doesn't write to the logs so your only option is to restore your backup.
You should think about putting house keeping in place. If the above, is too scary, then you could also use the house keeping to winnow the database over a matter of time.
In MSSQL, you could create a job to run daily which deletes the first 1000 rows of your query. To steal Adam's query -
DELETE TOP 1000 FROM Table WHERE DATEADD(year, 2, CreateDate) < getdate()
This would be very safe and would get rid of your data in three months or so safely and would them also maintain the size of the db in the future.
Your database will get to use this space in the future but if you want to recover the space you will need to shrink the database. Read around if you are interested - whether it is worth it depends on the amount of space to recover versus the total size of the db.

Related

Can we delete some rows from a partition instead of looping over all records of the big table?

I'm new to SQL and databases world, and I faced this situation:
I have a partitioned table by day : every day a partition is created and collects all rows added in that day.
But now we are trying to reduce the amount of data since the size of the DB is getting bigger, so we decided to delete some rows based on some conditions.
what we are trying to do is: delete some rows of unused data only of last 2 days.
so my question is :
Can we delete some rows from a partition? if so, does it delete data from the actual table and frees some space?
example :
delete from MyTable where condition1 and time >= (sysdate -2) ;
-- is it the same as (from a performance perspective)
delete from Mytable partition (MyTble_Partition) where condition1;
does a fragmentation or rebuild of indexes after delete of some rows is needed in this case?
Please correct me if I'm saying stupid things.
I will be grateful for any guidance , Thanks in advance.
Main rule: you almost never have the reason to access to partition explicitly.
Your predicates (join\where conditions) must provide all needed information to database for correct targeting to only needed partitions.
If you want to delete some data from last 2 days then YES it's ok to pass time>= predicate. Only needed partition will be scanned by Oracle.
You don't need to rebuild indexes. This work will be done by DBMS.
Your next question - clearing the data to have more space. This is a bit tricky.
You must imagine every partition of your table as "independed table" in DB.
In many aspects that is.
When you do DELETE you don't get any free space on your hard drive. You just get some "free space" in your table. You can use this space for further INSERTS.
But (attention!) you must to know that you really want to add some records in "that old day" partition in future.
If not then you got no profit from DELETE. At all.
Also read this article to understand how to free real disk space after table rows deleting

Create a Historical Auditing Table

Currently we have an AuditLog table that holds over 11M records. Regardless on the indexes and statistics any query referencing this table takes a long time. Most reports don't check for Audit records past a year but we would still like to keep these records. Whats the best way to handle this?
I was thinking of keeping the AuditLog table to hold all records less than or equal to a year old. Then move any records greater than a year old to an AuditLogHistory table. Maybe just running a batch file every night to move these records over and then update the indexes and statistics of the AuditLog table. Is this an okay way to complete this task? Or what other way should I be storing older records?
The records brought back from the AuditLog table hit a linked server and check in 6 different db's to see if a certain member exists in them based on a condition. I don't have access to make any changes to the linked server db's so can only optimize what I have which is the Auditlog. Hitting the linked server db's uses up over 90% of the queries cost. So I'm just trying to limit what I can.
First, I find it hard to believe that you cannot optimize a query on a table with 11 million records. You should investigate the indexes that you have relative to the queries that are frequently run.
In any case, the answer to your question is "partitioning". You would partition by the date column and be sure to include this condition in all queries. That will reduce the amount of data and probably speed the processing.
The documentation is a good place to start for learning about partitioning.

How to merge 500 million table with another 500 million table

I have to merge two 500M+ row tables.
What is the best method to merge them?
I just need to display the records from these two SQL-Server tables if somebody searches on my webpage.
These are fixed tables, no one will ever change data in these tables once they are live.
create a view myview as select * from table1 union select * from table2
Is there any harm using the above method?
If I start merging 500M rows it will run for days and if machine reboots it will make the database go into recovery mode, and then I have to start from the beginning again.
Why Am I merging these table?
I have a website which provides a search on the person table.
This table have columns like Name, Address, Age etc
We got 500 million similar .txt files which we loaded into some other
table.
Now we want the website search page to query both tables to see if
a person exists in the table.
We get similar .txt files of 100 million or 20 million, which we load
to this huge table.
How we are currently doing it?
We import the .txt files into separate tables ( some columns are different
in .txt)
Then we arrange the columns and do the data type conversions
Then insert this staging table into the liveCopy huge table ( in
test environment)
We have SQL server 2008 R2
Can we use table partitioning for performance benefits?
Is it ok to create monthly small tables and create a view on top of
them?
How can indexing be done in this case?
We only load new data once in a month and do the select
Does replication help?
Biggest issue I am facing is managing huge tables.
I hope I explained the situation .
Thanks & Regards
1) Usually developers, to achieve more performance, are splitting large tables into smaller ones and call this as partitioning (horizontal to be more precise, because there is also vertical one). Your view is a sample of such partitions joined. Of course, it is mostly used to split a large amount of data into range of values (for example, table1 contains records with column [col1] < 0, while table2 with [col1] >= 0). But even for unsorted data it is ok too, because you get more room for speed improvements. For example - parallel reads if put tables to different storages. So this is a good choice.
2) Another way is to use MERGE statement supported in SQL Server 2008 and higher - http://msdn.microsoft.com/en-us/library/bb510625(v=sql.100).aspx.
3) Of course you can copy using INSERT+DELETE, but in this case or in case of MERGE command used do this in a small batches. Smth like:
SET ROWCOUNT 10000
DECLARE #Count [int] = 1
WHILE #Count > 0 BEGIN
... INSERT+DELETE/MERGE transcation...
SET #Count = ##ROWCOUNT
END
If your purpose is truly just to move the data from the two tables into one table, you will want to do it in batches - 100K records at a time, or something like that. I'd guess you crashed before because your T-Log got full, although that's just speculation. Make sure to throw in a checkpoint after each batch if you are in Full recovery mode.
That said, I agree with all the comments that you should provide why you are doing this - it may not be necessary at all.
You may want to have a look at an Indexed View.
In this way, you can set up indexes on your view and get the best performance out of it. The expensive part of using Indexed Views is in the CRUD operations - but for read performance it would be your best solution.
http://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/
https://www.simple-talk.com/sql/learn-sql-server/sql-server-indexed-views-the-basics/
If the two tables are linked one to one, then you are wasting the cpu time a lot for each read. Especially that you mentioned that the tables don't change at all. You should have only one table in this case.
Try creating a new table including (at least) the two columns from the two tables.
You can do this by:
Select into newTable
from A left join B on A.x=B.y
or (if some people don't have the information of the text file)
Select into newTable
from A inner join B on A.x=B.y
And note that you have to have made index on the join fields at least (to speed up the process).
More details about the fields may help giving more precise answer as well.

Incremental extraction from DB2

What would be the most efficient way to select only rows from DB2 table that are inserted/updated since the last select (or some specified time)? There is no field in the table that would allow us to do this easily.
We are extracting data from the table for purposes of reporting, and now we have to extract the whole table every time, which is causing big performance issues.
I found example on how to select only rows changed in last day:
SELECT * FROM ORDERS
WHERE ROW CHANGE TIMESTAMP FOR ORDERS >
CURRENT TIMESTAMP - 24 HOURS;
But, I am not sure how efficient this would be, since the table is enormous.
Is there some other way to select only rows that are changed, that might be more efficient that this?
I also found solution called ParStream. This seems as something that can speed up demanding queries on the data, but I was unable to find any useful documentation about it.
I propose these options:
You can use Change Data Capture, and this will replay automatically the modifications to another data source.
Normally, a select statement does not assure the order of the rows. That means that you cannot use a select without a time reference in order to retrieve the most recent. Thus, you have to have a time column in order to retrieve the most recent. You can keep track of the most recent row in a global variable, and the next time retrieve the rows with a time bigger than that variable. If you want to increase performance, you can put the table in append mode, and in this way the new rows will be physically together. Keeping an index on this time column could be expensive to maintain, but it will speed (no table scan) when you need to extract the rows.
If your server is DB2 for i, use database journaling. You can extract after images of inserted records by time period or journal entry number from the journal receiver(s). The data entries can then be copied to your target file.

Very large "delete from" statement with "where" close, should I optimize?

I have a classic "sales" database that contains millions of rows in certain tables. On each of these large tables, I have an associated "delete" trigger and "backup" table.
This backup table keeps "deleted" rows for the last 7 days : the trigger starts by copying deleted rows into that backup table, then will perform a delete in the backup in this fashion :
CREATE TRIGGER dbo.TRIGGER
ON dbo.EXAMPLE_DATA
FOR DELETE AS
INSERT INTO EXAMPLE_BACKUP
select getDate(), *
from deleted
DELETE from EXAMPLE_BACKUP
where modified < dateadd(dd, -7, getDate())
The structure of the backup table is similar to the original data table (keys, values). The only difference is that I add in the backup tables a "modified" field, which I integrate to the key.
A colleague of mine told me I should use "a loop" because my delete statement will cause timeouts/issues as soon as the backup table contains several millions of rows. Will that delete actually blow up at some point ? Should I do something in a different manner ?
It looks like Sybase 12.5 supports table partitioning; if your design is such that the data can be retained for exactly 7 days (using a hard breakpoint), you could partition your table on the day of the year, and construct a view to represent the current data. As the clock ticks past a certain day, you could truncate the older partitions explicitly.
Just a thought.
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc20020_1251/html/databases/X15880.htm
Otherwise, deleting in a loop is a reasonable method for deleting large subsets of data without blowing up your transaction log. Here's one method using SQL Server:
http://sqlserverperformance.wordpress.com/2011/08/13/gradually-deleting-data-in-sql-server/