TSQL - Delete all records in a database older than x days - sql

I have a business requirement to delete all records in from multiple databases older than 1 year. Is there a way to do this at the database level, or must each individual table must be scripted?
I see there are various ways to do this at the table level, but I was wondering if it can be done at the database level?
Table level: (for anyone looking for how to delete at the table level):
sql delete all rows older than 30 days
Delete all rows with timestamp older than x days

The essence of this will be using SQL to write SQL. Here's a starting point:
SELECT DISTINCT REPLACE(REPLACE(REPLACE(
'DELETE FROM {s}.{t} WHERE {c} < DATEADD(YEAR, -1, GetUtcDate())'
'{s}', table_name),
'{t}', table_name),
'{c}', column_name)
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
column_name like '%Created%'
It queries the info schema for a list of all the columns containing the word Created (assumption: that your tables contain columns with names like CreatedAt, CreatedOn, CreatedDate, Created, RecordCreated etc; adjust for your scenario) and puts the table/column names into a string that is an SQL that deletes data
You run this, then you copy the results out of the grid and into another query window and run it. It is thus an SQL that writes SQL
After you get down with the concept, you could look to automate it by having some program or dynamic SQL select the delete queries and run them
You can also "level up" if you have multiple databases, by making a query that hits sys.databases to find out all the database names that need querying, and then write an SQL that hits the DB list and generates variations of this sql that hits a specific DB, that is an SQL that hits specific tables for deletion
It's like Inception; be sure you understand the outermost layer of the onion before you start digging deeper
Bear in mind also that your data may be subject to constraints that mean you either have to:
delete them in a certain order or
keep re-running the delete until no more deletes are done (the first time you try a parent-then-child delete it may fail becuase children exist, the second time you might have deleted all the children so the parent can be removed - if any children remain this route fails, but maybe you want it to fail) or
configure your constraints to delete cascade so that deleting a record from the parent deletes records from the children
Warning: this is a seriously destructive action, and you could easily end up removing more than you intended or leaving your database in a state where referential integrity is broken and your app stops working.
This is not something that is just flippently designed in a quick 5 minutes of an SO question; this is a weeks+ long project to make sure you're only deleting relevant stuff. For example configuring the keys for delete cascade and then accidentally removing the "Application Status: Approved" enum record just because it was put into the DB more than a year ago could wipe out the entire database, erasing every approved application (even those approved yesterday), every account that came from it, all their transactions etc.
Think very carefully about this

Related

Oracle: Data consistency across multiple tables to be displayed

I have 3 reports based on 3 different tables, which ideally should match each other in audit.
They are updated sequentially once in a day.
The problem here is when one of the table is updated and second one is in progress, the customer sees data discrepancy between the reports for some time.
We tried the solution where in we commit after all 3 tables are updated but we started having issue on undo tbsp. The application have many other things running on.
I am looking for a solution where in we can restrict the user to show data to a specific point, and he must see updated data only after all 3 tables are refreshed/updated.
I think you can use select * for update for all 3 tables befor start updating procedure.
In that case users can select data and will see only not changed data till update session will not finish and make commit.
You can use a flashback query to show data as-of a point in time:
select * from table1 as of timestamp timestamp '2021-12-10 12:00:00';
The application would need to determine the latest time when the tables were synchronized - perhaps with a log table that records when the update process last started. However, the flashback query also uses the UNDO tablespace. But the query should at least use less UNDO since some of the committed transactions will now free up some space.

Saving delete information to temp tables using into

Is there a way of using into with delete keyword to save the deleted rows in a temp table in sql server?
Without using a trigger, the OUTPUT clause lets you do it
-- create if needed
SELECT * INTO #KeepItSafe FROM TheTable WHERE 0 = 1;
--redirect the deleted rows to the temp table using OUTPUT
DELETE TheTable OUTPUT DELETED.* INTO #KeepItSafe WHERE ...;
I guess my main question is what are you trying to do?
1 - Save a copy of the data before deleting?
Both the SELECT/INTO followed by DELETE or DELETE OUTPUT patterns will work fine.
But why keep the data??
2 - If you are looking to audit the table (inserts, updates, deletes), look at my how to prevent unwanted transactions slide deck w/code - http://craftydba.com/?page_id=880.
The audit table can hold information from multiple tables since the data is saved as XML. Therefore, you can un-delete if necessary. It tracks who and what made the change.
3 - If it is a one time delete, I use the SELECT INTO followed by the DELETE. Wrap the DELETE in a TRANSACTION. Check the copied data in the saved table. That way, unexpected deletions can be rolled back.
4 - Last but not least, if you are never going to purge the date from audit table or the copy table, why not mark the row as deleted but keep it for ever?
Many systems like people soft use effective dating to show if a record is no longer active. In the BI world this is called a type 2 dimensional table (slowly changing dimensions).
See the data warehouse institute article.
http://www.bidw.org/datawarehousing/scd-type-2/
Each record has a begin and end date. All active records have a end date of null.
Again, all the above solutions work.
The main question is what is the business requirements?
Sincerely
John
The Crafty Dba

T-SQL Optimize DELETE of many records

I have a table that can grew to millions records (50 millions for example). On each 20 minutes records that are older than 20 minutes are deleted.
The problems is that if the table has so many records such deletion can take a lot of time and I want to make it faster.
I can not do "truncate table" because I want to remove only records that are older than 20 minutes. I suppose that when doing the "delete" and filtering the information that need to be delete, the server is creating log file or something and this take much time?
Am I right? Is there a way to stop any flag or option to optimize the delete, and then to turn on the stopped option?
To expand on the batch delete suggestion, i'd suggest you do this far more regularly (every 20 seconds perhaps) - batch deletions are easy:
WHILE 1 = 1
BEGIN
DELETE TOP ( 4000 )
FROM YOURTABLE
WHERE YourIndexedDateColumn < DATEADD(MINUTE, -20, GETDATE())
IF ##ROWCOUNT = 0
BREAK
END
Your inserts may lag slightly whilst they wait for the locks to release but they should insert rather than error.
In regards to your table though, a table with this much traffic i'd expect to see on a very fast raid 10 array / perhaps even partitioned - are your disks up to it? Are your transaction logs on different disks to your data files? - they should be
EDIT 1 - Response to your comment
TO put a database into SIMPLE recovery:
ALTER DATABASE Database Name SET RECOVERY='SIMPLE'
This basically turns off transaction logging on the given database. Meaning in the event of data loss you would need loose all data since your last full backup. If you're OK with that, well this should save a lot of time when running large transactions. (NOTE that as the transaction is running, the logging still takes place in SIMPLE - to enable the rolling back of the transaction).
If there are tables within your database where you cant afford to loose data you'll need to leave your database in FULL recovery mode (i.e. any transaction gets logged (and hopefully flushed to *.trn files by your servers maintenance plans). As i stated in my question though, there is nothing stopping you having two databases, 1 in FULL and 1 in SIMPLE. the FULL database would be fore tables where you cant afford to loose any data (i.e. you could apply the transaction logs to restore data to a specific time) and the SIMPLE database would be for these massive high-traffic tables that you can allow data loss on in the event of a failure.
All of this is relevant assuming your creating full (*.bak) files every night & flushing your log files to *.trn files every half hour or so).
In regards to your index question, it's imperative your date column is indexed, if you check your execution plan and see any "TABLE SCAN" - that would be an indicator of a missing index.
Your date column i presume is DATETIME with a constraint setting the DEFAULT to getdate()?
You may find that you get better performance by replacing that with a BIGINT YYYYMMDDHHMMSS and then apply a CLUSTERED index to that column - note however that you can only have 1 clustered index per table, so if that table already has one you'll need to use a Non-Clustered index. (in case you didnt know, a clustered index basically tells SQL to store the information in that order, meaning that when you delete rows > 20 minutes SQL can literally delete stuff sequentially rather than hopping from page to page.
The log problem is probably due to the number of records deleted in the trasaction, to make things worse the engine may be requesting a lock per record (or by page wich is not so bad)
The one big thing here is how you determine the records to be deleted, i'm assuming you use a datetime field, if so make sure you have an index on the column otherwise it's a sequential scan of the table that will really penalize your process.
There are two things you may do depending of the concurrency of users an the time of the delete
If you can guarantee that no one is going to read or write when you delete, you can lock the table in exclusive mode and delete (this takes only one lock from the engine) and release the lock
You can use batch deletes, you would make a script with a cursor that provides the rows you want to delete, and you begin transtaction and commit every X records (ideally 5000), so you can keep the transactions shorts and not take that many locks
Take a look at the query plan for the delete process, and see what it shows, a sequential scan of a big table its never good.
Unfortunately for the purpose of this question and fortunately for the sake of consistency and recoverability of the databases in SQL server, putting a database into Simple recovery mode DOES NOT disable logging.
Every transaction still gets logged before committing it to the data file(s), the only difference would be that the space in the log would get released (in most cases) right after the transaction is either rolled back or committed in the Simple recovery mode, but this is not going to affect the performance of the DELETE statement in one way or another.
I had a similar problem when I needed to delete more than 70% of the rows from a big table with 3 indexes and a lot of foreign keys.
For this scenario, I saved the rows I wanted in a temp table, truncated the original table and reinserted the rows, something like:
SELECT * INTO #tempuser FROM [User] WHERE [Status] >= 600;
TRUNCATE TABLE [User];
INSERT [User] SELECT * FROM #tempuser;
I learned this technique with this link that explains:
DELETE is a a fully logged operation , and can be rolled back if something goes wrong
TRUNCATE Removes all rows from a table without logging the individual row deletions
In the article you can explore other strategies to resolve the delay in deleting many records, that one worked to me

Very large "delete from" statement with "where" close, should I optimize?

I have a classic "sales" database that contains millions of rows in certain tables. On each of these large tables, I have an associated "delete" trigger and "backup" table.
This backup table keeps "deleted" rows for the last 7 days : the trigger starts by copying deleted rows into that backup table, then will perform a delete in the backup in this fashion :
CREATE TRIGGER dbo.TRIGGER
ON dbo.EXAMPLE_DATA
FOR DELETE AS
INSERT INTO EXAMPLE_BACKUP
select getDate(), *
from deleted
DELETE from EXAMPLE_BACKUP
where modified < dateadd(dd, -7, getDate())
The structure of the backup table is similar to the original data table (keys, values). The only difference is that I add in the backup tables a "modified" field, which I integrate to the key.
A colleague of mine told me I should use "a loop" because my delete statement will cause timeouts/issues as soon as the backup table contains several millions of rows. Will that delete actually blow up at some point ? Should I do something in a different manner ?
It looks like Sybase 12.5 supports table partitioning; if your design is such that the data can be retained for exactly 7 days (using a hard breakpoint), you could partition your table on the day of the year, and construct a view to represent the current data. As the clock ticks past a certain day, you could truncate the older partitions explicitly.
Just a thought.
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc20020_1251/html/databases/X15880.htm
Otherwise, deleting in a loop is a reasonable method for deleting large subsets of data without blowing up your transaction log. Here's one method using SQL Server:
http://sqlserverperformance.wordpress.com/2011/08/13/gradually-deleting-data-in-sql-server/

Finding changed records in a database table

I have a problem that I haven't been able to come up with a solution for yet. I have a database (actually thousands of them at customer sites) that I want to extract data from periodically. I'd like to do a full data extract one time (select * from table) then after that only get rows that have changed.
The challenge is that there aren't any updated date columns in most of the tables that could be used to constrain the SQL query. I can't use a trigger based approach nor change the application that writes to the database since it's another group that develops the app and they are way backed up already.
I may be able to write to the database tables when doing the data extract, but would prefer not to do that. Does anyone have any ideas for how we might be able to do this?
You will have to programatically mark the records. I see suggestions of an auto-incrementing field but that will only get newly inserted records. How will you track updated or deleted records?
If you only want newly inserted that an autoincrementing field will do the job; in subsequent data dumps grab every thing since the last value of the autoincrment field and then recrod the current value.
If you want updates the minimum I can see is to have a last_update field and probably a trigger to populare it. If the last_update is later the the last data dump grab that record. This will get inserts and updates but not deletes.
You could try something like a 'instead of delete' trigger if your RDBMS supports it and NULL the last_update field. On subsequent data dumps grap all recoirds where this field is NULL and then delete them. But there would be problems with this (e.g. how to stop the app seeing them between the logical and physical delete)
The most fool proof method I can see is aset of history (audit) tables and ech change gets written to them. Then you select your data dump from there.
By the way do you only care about know the updates have happened? What about if 2 (or more) updates have happened. The history table is the only way that I can see you capturing this scenario.
This should isolate rows that have changed since your last backup. Assuming DestinationTable is a copy of SourceTable even on the key fields; if not you could list out the important fields.
SELECT * FROM SourceTable
EXCEPT
SELECT * FROM DestinationTable