Best place to enforce retention policies for tables in SQL databases? - sql

I've got a table that records information about customer contacts. The table is defined as only "recent" contacts and I would like to delete all records for contacts older than 3 weeks.
For example, the table is:
create table recent_contact {
recent_contact_id int identity (1,1) primary key,
contact_text nvarchar(4000),
created datetime
}
create index createdIndex
on recent_contact (created)
All inserts to this table will happen via a stored procedure that just does an INSERT statement.
My question is about cleanup. I would like to delete all items older than 3 weeks. So far I have thought of 2 ways to accomplish cleanup.
have a background database job run periodically (e.g. every 5 hours) that will scan the above table and delete anything older than 3 weeks.
In the insert() stored procedure call, add the logic to clear out old data. This should only add constant time overhead since the table is indexed on [created] and each item is inserted once and deleted only once. So on average this sproc will do 1 insert and 1 delete.
// insert
insert into recent_contacts (contact_text, created)
values (#text, #createDate)
declare #threeWeeksAgo datetime
set #threeWeeksAgo = DATEADD(DAY, -21, GETDATE())
// remove old items
delete from recent_contacts
where created < #threeWeeksAgo
Of the two options, I went with option 2) because I felt it was a more elegant solution and wouldn't require a separate cleanup job. My coworker told me that this was bad practice and that retention policy should always be in a separate job that runs periodically. I.e. he thought option 1) was the better option.
I'm wondering what other people think? Generally speaking, what are the best practices for enforcing data retention policies?

Do 1). Option 2) is a misguided idea. There is no reason to shun the periodic job, but there are plenty of reasons to avoid punishing every single insert with the cost of looking up stale entries, and even more punishing for INSERTs to randomly hit a spike in response time because it was the unlucky winner of the lottery ticket to clean up some entries. A scheduled job on the other hand can be scheduled at convenient hours. And, last but not least, consider that your 'clever' design requires an INSERT in order for maintenance to occur.
In time you will learn that due to index tipping point issues the cleanup of data post retention period is actually a very tricky problem and many developer bodies pave that road. You will also discover that time series like a clustered index by the time column, not the least because of obsolete data cleanup issues.

I’d go with 1) because:
It is best to have a dedicated process for cleaning out old data. With 2), you have two processes intertwined in one routine, and if (when) one process changes, you’ll have to only modify that part of the code without messing up the other part.
Similar, what happens if it somehow breaks? With two processes, if something goes back, you’ve potentially doubled the necessary trouble-shooting effort.
What happens if, for whatever reason (outages, holidays, slow season) no one INSERTS a new row? Your data falls outside of the retention window, yet remains on the system.
Depending on size of code base and overall volume of data (which I’d guess is pretty small), these are more quibbles than anything else (unless volume grows significantly over time...) Even so, using "safer" tactics now builds good habits and practices, so that if some day you do have to work with high-volume systems, you're more likely to produce suitably robust code on the first pass.

Related

Oracle - Failover table or query manipulation

In a DWH environment for performance reasons I need to materialize a view into a table with approx. 100 columns and 50.000.000 records. Daily ~ 60.000 new records are inserted and ~80.000 updates on existing records are performed. By decision I am not allowed to use materialized views because the architect claims this leads to performance issues. I can't argue the case anymore, it's an irrevocable decision and I have to accept.
So I would like to make a daily full load in the night e.g. truncate and insert. But if the job fails the table may not be empty but must contain the data from the last successful population.
Therefore I thought about something like a failover table, that will be used instead if anything wents wrong:
IF v_load_job_failed THEN failover_table
ELSE regular_table
Is there something like a failover table that will be used instead of another table depending on a predefined condition? Something like a trigger that rewrites or manipulates a select-query before execution?
I know that is somewhat of a dirty workaround.
If you have space for (brief) period of time of double storage, I'd recommend
1) Clone existing table (all indexes, grants, etc) but name with _TMP
2) Load _TMP
3) Rename base table to _BKP
4) Rename _TMP to match Base table
5) Rename _BKP to _TMP
6) Truncate _TMP
ETA: #1 would be "one time"; 2-6 would be part of daily script.
This all assumes the performance of (1) detecting all new records and all updated records and (2) using MERGE (INSERT+UPDATE) to integrate those changed records into base table is "on par" with full load.
(Personally, I lean toward the full load approach anyway; on the day somebody tweaks a referential value that's incorporated into the view def and changes the value for all records, you'll find yourself waiting on a week-long update of 50,000,000 records. Such concerns are completely eliminated with full-load approach)
All that said, it should be noted that if MV is defined correctly, the MV-refresh approach is identical to this approach in every way, except:
1) Simpler / less moving pieces
2) More transparent (SQL of view def is attached to MV, not buried in some PL/SQL package or .sql script somewhere)
3) Will not have "blip" of time, between table renames, where queries / processes may not see table and fail.
ETA: It's possible to pull this off with "partition magic" in a couple of ways that avoid a "blip" of time where data or table is missing.
You can, for instance, have an even-day and odd-day partition. On odd-days, insert data (no commit), then truncate even-day (which simultaneously drops old day and exposes new). But is it worth the complexity? You need to add a column to partition by, and deal with complexity of reruns - if you're logic isn't tight, you'll wind up truncating the data you just loaded. This does, however, prevent a blip
One method that does avoid any "blip" and is a little less "whoops" prone:
1) Add "DUMMY" column that always has value 1.
2) Create _TMP table (also with "DUMMY" column) and partition by DUMMY column (so all rows go to same partition)
-- Daily script --
3) Load _TMP table
4) Exchange partition of _TMP table with main base table WITHOUT VALIDATION INCLUDING INDEXES
It bears repeating: all of these methods are equivalent if resource usage to MV-refresh; they're just more complex and tend to make developers feel "savvy" for solving problems that have already been solved.
Final note - addressing David Aldridge - first and foremost, daily refresh tables SHOULD NOT have logging enabled. In recovery scenario, just make sure you have step to run refresh scripts once base tables are restored.
Performance-wise, mileage is going to vary on this; but in my experience, the complexity of identifying and modifying changed/inserted rows can get very sticky (at some point, somebody will do something to base data that your script did not take into account; either yielding incorrect results or performance obstacles). DWH environments tend to be geared to accommodate processes like this with little problem. Unless/until the full refresh proves to have overhead above&beyond what the system can tolerate, it's generally the simplest "set-it-and-forget-it" approach.
On that note, if data can be logically separated into "live rows which might be updated" vs "historic rows that will never be updated", you can come up with a partitioning scheme and process that only truncates/reloads the "live" data on a daily basis.
A materialized view is just a set of metadata with an underlying table, and there's no reason why you cannot maintain a table in a manner similar to a materialized view's internal mechanisms.
I'd suggest using a MERGE statement as a single query rather than a truncate/insert. It will either succeed in its entirety or rollback to leave the previous data intact. 60,000 new records and 80,000 modified records is not much.
I think that you cannot go far wrong if you at least start with a simple, single SQL statement and then see how that works for you. If you do decide to go with a multistep process then ensure that it automatically recovers itself at any stage where it might go wrong part way through -- that might turn out to be the tricky bit.

T-SQL Optimize DELETE of many records

I have a table that can grew to millions records (50 millions for example). On each 20 minutes records that are older than 20 minutes are deleted.
The problems is that if the table has so many records such deletion can take a lot of time and I want to make it faster.
I can not do "truncate table" because I want to remove only records that are older than 20 minutes. I suppose that when doing the "delete" and filtering the information that need to be delete, the server is creating log file or something and this take much time?
Am I right? Is there a way to stop any flag or option to optimize the delete, and then to turn on the stopped option?
To expand on the batch delete suggestion, i'd suggest you do this far more regularly (every 20 seconds perhaps) - batch deletions are easy:
WHILE 1 = 1
BEGIN
DELETE TOP ( 4000 )
FROM YOURTABLE
WHERE YourIndexedDateColumn < DATEADD(MINUTE, -20, GETDATE())
IF ##ROWCOUNT = 0
BREAK
END
Your inserts may lag slightly whilst they wait for the locks to release but they should insert rather than error.
In regards to your table though, a table with this much traffic i'd expect to see on a very fast raid 10 array / perhaps even partitioned - are your disks up to it? Are your transaction logs on different disks to your data files? - they should be
EDIT 1 - Response to your comment
TO put a database into SIMPLE recovery:
ALTER DATABASE Database Name SET RECOVERY='SIMPLE'
This basically turns off transaction logging on the given database. Meaning in the event of data loss you would need loose all data since your last full backup. If you're OK with that, well this should save a lot of time when running large transactions. (NOTE that as the transaction is running, the logging still takes place in SIMPLE - to enable the rolling back of the transaction).
If there are tables within your database where you cant afford to loose data you'll need to leave your database in FULL recovery mode (i.e. any transaction gets logged (and hopefully flushed to *.trn files by your servers maintenance plans). As i stated in my question though, there is nothing stopping you having two databases, 1 in FULL and 1 in SIMPLE. the FULL database would be fore tables where you cant afford to loose any data (i.e. you could apply the transaction logs to restore data to a specific time) and the SIMPLE database would be for these massive high-traffic tables that you can allow data loss on in the event of a failure.
All of this is relevant assuming your creating full (*.bak) files every night & flushing your log files to *.trn files every half hour or so).
In regards to your index question, it's imperative your date column is indexed, if you check your execution plan and see any "TABLE SCAN" - that would be an indicator of a missing index.
Your date column i presume is DATETIME with a constraint setting the DEFAULT to getdate()?
You may find that you get better performance by replacing that with a BIGINT YYYYMMDDHHMMSS and then apply a CLUSTERED index to that column - note however that you can only have 1 clustered index per table, so if that table already has one you'll need to use a Non-Clustered index. (in case you didnt know, a clustered index basically tells SQL to store the information in that order, meaning that when you delete rows > 20 minutes SQL can literally delete stuff sequentially rather than hopping from page to page.
The log problem is probably due to the number of records deleted in the trasaction, to make things worse the engine may be requesting a lock per record (or by page wich is not so bad)
The one big thing here is how you determine the records to be deleted, i'm assuming you use a datetime field, if so make sure you have an index on the column otherwise it's a sequential scan of the table that will really penalize your process.
There are two things you may do depending of the concurrency of users an the time of the delete
If you can guarantee that no one is going to read or write when you delete, you can lock the table in exclusive mode and delete (this takes only one lock from the engine) and release the lock
You can use batch deletes, you would make a script with a cursor that provides the rows you want to delete, and you begin transtaction and commit every X records (ideally 5000), so you can keep the transactions shorts and not take that many locks
Take a look at the query plan for the delete process, and see what it shows, a sequential scan of a big table its never good.
Unfortunately for the purpose of this question and fortunately for the sake of consistency and recoverability of the databases in SQL server, putting a database into Simple recovery mode DOES NOT disable logging.
Every transaction still gets logged before committing it to the data file(s), the only difference would be that the space in the log would get released (in most cases) right after the transaction is either rolled back or committed in the Simple recovery mode, but this is not going to affect the performance of the DELETE statement in one way or another.
I had a similar problem when I needed to delete more than 70% of the rows from a big table with 3 indexes and a lot of foreign keys.
For this scenario, I saved the rows I wanted in a temp table, truncated the original table and reinserted the rows, something like:
SELECT * INTO #tempuser FROM [User] WHERE [Status] >= 600;
TRUNCATE TABLE [User];
INSERT [User] SELECT * FROM #tempuser;
I learned this technique with this link that explains:
DELETE is a a fully logged operation , and can be rolled back if something goes wrong
TRUNCATE Removes all rows from a table without logging the individual row deletions
In the article you can explore other strategies to resolve the delay in deleting many records, that one worked to me

SQL Server - On-Insert Trigger vs. Per-Minute Job

My question is mainly concerned with "what's best for performance", but kinda "philosophically" speaking as well (if it makes a difference)... so let's jump right in.
[TableA].[ColumnB] stores a value that needs to exist in [TableC].[ColumnD]. Right off the bat, no answers involving Foreign-keys - just assume that they're "not allowed" in this environment for whatever reason.
But due to "circumstances x,y,z", [TableA].[ColumnB] sometimes gets values that do not exist in [TableC].[ColumnD], because, let's say, [TableA] gets populated from an object that exists in running code as a "serialized blob", an in-memory representation of the data, and the [ColumnB] values got populated before those values were deleted from [TableC].[ColumnD] by some other process. ANYWAY, this is for example's sake, so don't get bogged down in the "why does this condition happen", just accept that it does.
To "fix" the problem, which method is best of these two: 1. make a Trigger that fires on-INSERT on [TableA], to Update [ColumnB] to the value that it should be (and assume I have a "mapping" of bad-to-good values). Or, 2. run a scheduled-Job every hour/minute/whatever that runs Update queries to change all possible "bad" values to their corresponding "good" values.
To put it more generally, what's better for performance and/or what is best practice: a Trigger, or a periodic Scheduled-Job? In context, let's say [TableA] is typically on the order of hundreds of thousands of rows, with Inserts happening 10-100 records at-a-time, as frequently as every few minutes to as rarely as a few times per day.
On-insert.
Doing triggers is like callbacks- They're more logically sound, and they spread any lag into every query. Doing continual checks (called polling or cron-jobs), you end up with more severe moments of lag every now and then. In almost all cases, using triggers/callbacks are the better way to go as having 1ms of lag added to every query is better than 100ms of lag at seemingly random intervals.
Use of triggers is generally discouraged, but your load is light and your case seems to be a natural trigger case. Consider using instead-of trigger to avoid two operations on the same row (one insert instead of insert and update). It may be the simplest and most reliable solution (as long as you have written reliable code in the trigger that won't cause the whole operation to crash).
Since you are considering a batch job, you are not concerned with timing issues. I.e it's OK with your application that tables may be out of sync for 1 minute or even 1 hour. That's the major difference with the trigger approach, which will guarantee that tables are in sync all the time. Potential timing issues would make me uncomfortable. On the plus side, you won't be at risk of crashing the original insert operation with your trigger.
If you go this route, please consider Change Tracking feature. Change tracking will indicate which rows have been inserted since the last time you checked, so you won't have to scan the whole table for new records. Alternatively, if your TableA has an INDENITY primary or unique key, you can implement similar design without change tracking functionality.
Triggers are both best performance and practice, as they maintain referential integrity as well as allowing the server to optimise for performance.
You didn't say what version of SQL Server you were using, but if it's 2008+, you can use Change Data Capture to keep track of data changes to your "primary" table. Then, periodically, you can run a batch over the change table and do whatever processing is required over that small set.

Deleting rows from a contended table

I have a DB table in which each row has a randomly generated primary key, a message and a user. Each user has about 10-100 messages but there are 10k-50k users.
I write the messages daily for each user in one go. I want to throw away the old messages for each user before writing the new ones to keep the table as small as possible.
Right now I effectively do this:
delete from table where user='mk'
Then write all the messages for that user. I'm seeing a lot of contention because I have lots of threads doing this at the same time.
I do have an additional requirement to retain the most recent set of messages for each user.
I don't have access to the DB directly. I'm trying to guess at the problem based on some second hand feedback. The reason I'm focusing on this scenario is that the delete query is showing a lot of wait time (again - to the best of my knowledge) plus it's a newly added bit of functionality.
Can anyone offer any advice?
Would it be better to:
select key from table where user='mk'
Then delete individual rows from there? I'm thinking that might lead to less brutal locking.
If you do this everyday for every user, why not just delete every record from the table in a single statement? Or even
truncate table whatever reuse storage
/
edit
The reason why I suggest this approach is that the process looks like a daily batch upload of user messages preceded by a clearing out of the old messages. That is, the business rules seems to me to be "the table will hold only one day's worth of messages for any given user". If this process is done for every user then a single operation would be the most efficient.
However, if users do not get a fresh set of messages each day and there is a subsidiary rule which requires us to retain the most recent set of messages for each user then zapping the entire table would be wrong.
No, it is always better to perform a single SQL statement on a set of rows than a series of "row-by-row" (or what Tom Kyte calls "slow-by-slow") operations. When you say you are "seeing a lot of contention", what are you seeing exactly? An obvious question: is column USER indexed?
(Of course, the column name can't really be USER in an Oracle database, since it is a reserved word!)
EDIT: You have said that column USER is not indexed. This means that each delete will involve a full table scan of up to 50K*100 = 5 million rows (or at best 10K * 10 = 100,000 rows) to delete a mere 10-100 rows. Adding an index on USER may solve your problems.
Are you sure you're seeing lock contention? It seems more likely that you're seeing disk contention due to too many concurrent (but unrelated updates). The solution to that is simply to reduce the number of threads you're using: Less disk contention will mean higher total throughput.
I think you need to define your requirements a bit clearer...
For instance. If you know all of the users who you want to write messages for, insert the IDs into a temp table, index it on ID and batch delete. Then the threads you are firing off are doing two things. Write the ID of the user to a temp table, Write the message to another temp table. Then when the threads have finished executing, the main thread should
DELETE * FROM Messages INNER JOIN TEMP_MEMBERS ON ID = TEMP_ID
INSERT INTO MESSAGES SELECT * FROM TEMP_messges
im not familiar with Oracle syntax, but that is the way i would approach it IF the users messages are all done in rapid succession.
Hope this helps
TALK TO YOUR DBA
He is there to help you. When we DBAs take access away from the developers for something such as this, it is assumed we will provide the support for you for that task. If your code is taking too long to complete and that time appears to be tied up in the database, your DBA will be able to look at exactly what is going on and offer suggestions or possibly even solve the problem without you changing anything.
Just glancing over your problem statement, it doesn't appear you'd be looking at contention issues, but I don't know anything about your underlying structure.
Really, talk to your DBA. He will probably enjoy looking at something fun instead of planning the latest CPU deployment.
This might speed things up:
Create a lookup table:
create table rowid_table (row_id ROWID ,user VARCHAR2(100));
create index rowid_table_ix1 on rowid_table (user);
Run a nightly job:
truncate table rowid_table;
insert /*+ append */ into rowid_table
select ROWID row_id , user
from table;
dbms_stats.gather_table_stats('SCHEMAOWNER','ROWID_TABLE');
Then when deleting the records:
delete from table
where ROWID IN (select row_id
from rowid_table
where user = 'mk');
Your own suggestion seems very sensible. Locking in small batches has two advantages:
the transactions will be smaller
locking will be limited to only a few rows at a time
Locking in batches should be a big improvement.

Physical vs. logical (hard vs. soft) delete of database record? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
What is the advantage of doing a logical/soft delete of a record (i.e. setting a flag stating that the record is deleted) as opposed to actually or physically deleting the record?
Is this common practice?
Is this secure?
Advantages are that you keep the history (good for auditing) and you don't have to worry about cascading a delete through various other tables in the database that reference the row you are deleting. Disadvantage is that you have to code any reporting/display methods to take the flag into account.
As far as if it is a common practice - I would say yes, but as with anything whether you use it depends on your business needs.
EDIT: Thought of another disadvantange - If you have unique indexes on the table, deleted records will still take up the "one" record, so you have to code around that possibility too (for example, a User table that has a unique index on username; A deleted record would still block the deleted users username for new records. Working around this you could tack on a GUID to the deleted username column, but it's a very hacky workaround that I wouldn't recommend. Probably in that circumstance it would be better to just have a rule that once a username is used, it can never be replaced.)
Are logical deletes common practice? Yes I have seen this in many places. Are they secure? That really depends are they any less secure then the data was before you deleted it?
When I was a Tech Lead, I demanded that our team keep every piece of data, I knew at the time that we would be using all that data to build various BI applications, although at the time we didn't know what the requirements would be. While this was good from the standpoint of auditing, troubleshooting, and reporting (This was an e-commerce / tools site for B2B transactions, and if someone used a tool, we wanted to record it even if their account was later turned off), it did have several downsides.
The downsides include (not including others already mentioned):
Performance Implications of keeping all that data, We to develop various archiving strategies. For example one area of the application was getting close to generating around 1Gb of data a week.
Cost of keeping the data does grow over time, while disk space is cheap, the amount of infrastructure to keep and manage terabytes of data both online and off line is a lot. It takes a lot of disk for redundancy, and people's time to ensure backups are moving swiftly etc.
When deciding to use logical, physical deletes, or archiving I would ask myself these questions:
Is this data that might need to be re-inserted into the table. For example User Accounts fit this category as you might activate or deactivate a user account. If this is the case a logical delete makes the most sense.
Is there any intrinsic value in storing the data? If so how much data will be generated. Depending on this I would either go with a logical delete, or implement an archiving strategy. Keep in mind you can always archive logically deleted records.
It might be a little late but I suggest everyone to check Pinal Dave's blog post about logical/soft delete:
I just do not like this kind of design [soft delete] at all. I am firm believer of the architecture where only necessary data should be in single table and the useless data should be moved to an archived table. Instead of following the isDeleted column, I suggest the usage of two different tables: one with orders and another with deleted orders. In that case, you will have to maintain both the table, but in reality, it is very easy to maintain. When you write UPDATE statement to the isDeleted column, write INSERT INTO another table and DELETE it from original table. If the situation is of rollback, write another INSERT INTO and DELETE in reverse order. If you are worried about a failed transaction, wrap this code in TRANSACTION.
What are the advantages of the smaller table verses larger table in above described situations?
A smaller table is easy to maintain
Index Rebuild operations are much faster
Moving the archive data to another filegroup will reduce the load of primary filegroup (considering that all filegroups are on different system) – this will also speed up the backup as well.
Statistics will be frequently updated due to smaller size and this will be less resource intensive.
Size of the index will be smaller
Performance of the table will improve with a smaller table size.
I'm a NoSQL developer, and on my last job, I worked with data that was always critical for someone, and if it was deleted by accident in the same day that was created, I were not able to find it in the last backup from yesterday! In that situation, soft deletion always saved the day.
I did soft-deletion using timestamps, registering the date the document was deleted:
IsDeleted = 20150310 //yyyyMMdd
Every Sunday, a process walked on the database and checked the IsDeleted field. If the difference between the current date and the timestamp was greater than N days, the document was hard deleted. Considering the document still be available on some backup, it was safe to do it.
EDIT: This NoSQL use case is about big documents created in the database, tens or hundreds of them every day, but not thousands or millions. By general, they were documents with the status, data and attachments of workflow processes. That was the reason why there was the possibility of a user deletes an important document. This user could be someone with Admin privileges, or maybe the document's owner, just to name a few.
TL;DR My use case was not Big Data. In that case, you will need a different approach.
One pattern I have used is to create a mirror table and attach a trigger on the primary table, so all deletes (and updates if desired) are recorded in the mirror table.
This allows you to "reconstruct" deleted/changed records, and you can still hard delete in the primary table and keep it "clean" - it also allows the creation of an "undo" function, and you can also record the date, time, and user who did the action in the mirror table (invaluable in witch hunt situations).
The other advantage is there is no chance of accidentally including deleted records when querying off the primary unless you deliberately go to the trouble of including records from the mirror table (you may want to show live and deleted records).
Another advantage is that the mirror table can be independently purged, as it should not have any actual foreign key references, making this a relatively simple operation in comparison to purging from a primary table that uses soft deletes but still has referential connections to other tables.
What other advantages? - great if you have a bunch of coders working on the project, doing reads on the database with mixed skill and attention to detail levels, you don't have to stay up nights hoping that one of them didn’t forget to not include deleted records (lol, Not Include Deleted Records = True), which results in things like overstating say the clients available cash position which they then go buy some shares with (i.e., as in a trading system), when you work with trading systems, you will find out very quickly the value of robust solutions, even though they may have a little bit more initial "overhead".
Exceptions:
- as a guide, use soft deletes for "reference" data such as user, category, etc, and hard deletes to a mirror table for "fact" type data, i.e., transaction history.
I used to do soft-delete, just to keep old records. I realized that users don't bother to view old records as often as I thought. If users want to view old records, they can just view from archive or audit table, right? So, what's the advantage of soft-delete? It only leads to more complex query statement, etc.
Following are the things i've implemented, before I decided to not-soft-delete anymore:
implement audit, to record all activities (add,edit,delete). Ensure that there's no foreign key linked to audit, and ensure this table is secured and nobody can delete except administrators.
identify which tables are considered "transactional table", which very likely that it will be kept for long time, and very likely user may want to view the past records or reports. For example; purchase transaction. This table should not just keep the id of master table (such as dept-id), but also keep the additional info such as the name as reference (such as dept-name), or any other necessary fields for reporting.
Implement "active/inactive" or "enable/disable" or "hide/show" record of master table. So, instead of deleting record, the user can disable/inactive the master record. It is much safer this way.
Just my two cents opinion.
I'm a big fan of the logical delete, especially for a Line of Business application, or in the context of user accounts. My reasons are simple: often times I don't want a user to be able to use the system anymore (so the account get's marked as deleted), but if we deleted the user, we'd lose all their work and such.
Another common scenario is that the users might get re-created a while after having been delete. It's a much nicer experience for the user to have all their data present as it was before they were deleted, rather than have to re-create it.
I usually think of deleting users more as "suspending" them indefinitely. You never know when they'll legitimately need to be back.
I commonly use logical deletions - I find they work well when you also intermittently archive off the 'deleted' data to an archived table (which can be searched if needed) thus having no chance of affecting the performance of the application.
It works well because you still have the data if you're ever audited. If you delete it physically, it's gone!
I almost always soft delete and here's why:
you can restore deleted data if a customer asks you to do so. More happy customers with soft deletes. Restoring specific data from backups is complex
checking for isdeleted everywhere is not an issue, you have to check for userid anyway (if the database contains data from multiple users). You can enforce the check by code, by placing those two checks on a separate function (or use views)
graceful delete. Users or processes dealing with deleted content will continue to "see" it until they hit the next refresh. This is a very desirable feature if a process is processing some data which is suddenly deleted
synchronization: if you need to design a synchronization mechanism between a database and mobile apps, you'll find soft deletes much easier to implement
Re: "Is this secure?" - that depends on what you mean.
If you mean that by doing physical delete, you'll prevent anyone from ever finding the deleted data, then yes, that's more or less true; you're safer in physically deleting the sensitive data that needs to be erased, because that means it's permanently gone from the database. (However, realize that there may be other copies of the data in question, such as in a backup, or the transaction log, or a recorded version from in transit, e.g. a packet sniffer - just because you delete from your database doesn't guarantee it wasn't saved somewhere else.)
If you mean that by doing logical delete, your data is more secure because you'll never lose any data, that's also true. This is good for audit scenarios; I tend to design this way because it admits the basic fact that once data is generated, it'll never really go away (especially if it ever had the capability of being, say, cached by an internet search engine). Of course, a real audit scenario requires that not only are deletes logical, but that updates are also logged, along with the time of the change and the actor who made the change.
If you mean that the data won't fall into the hands of anyone who isn't supposed to see it, then that's totally up to your application and its security structure. In that respect, logical delete is no more or less secure than anything else in your database.
Logical deletions if are hard on referential integrity.
It is the right think to do when there is a temporal aspect of the table data (are valid FROM_DATE - TO_DATE).
Otherwise move the data to an Auditing Table and delete the record.
On the plus side:
It is the easier way to rollback (if at all possible).
It is easy to see what was the state at a specific point in time.
I strongly disagree with logical delete because you are exposed to many errors.
First of all queries, each query must take care the IsDeleted field and the possibility of error becomes higher with complex queries.
Second the performance: imagine a table with 100000 recs with only 3 active, now multiply this number for the tables of your database; another performance problem is a possible conflict with new records with old (deleted records).
The only advantage I see is the history of records, but there are other methods to achieve this result, for example you can create a logging table where you can save info: TableName,OldValues,NewValues,Date,User,[..] where *Values ​​can be varchar and write the details in this form fieldname : value; [..] or store the info as xml.
All this can be achieved via code or Triggers but you are only ONE table with all your history.
Another options is to see if the specified database engine are native support for tracking change, for example on SQL Server database there are SQL Track Data Change.
It's fairly standard in cases where you'd like to keep a history of something (e.g. user accounts as #Jon Dewees mentions). And it's certainly a great idea if there's a strong chance of users asking for un-deletions.
If you're concerned about the logic of filtering out the deleted records from your queries getting messy and just complicating your queries, you can just build views that do the filtering for you and use queries against that. It'll prevent leakage of these records in reporting solutions and such.
There are requirements beyond system design which need to be answered. What is the legal or statutory requirement in the record retention? Depending on what the rows are related to, there may be a legal requirement that the data be kept for a certain period of time after it is 'suspended'.
On the other hand, the requirement may be that once the record is 'deleted', it is truly and irrevocably deleted. Before you make a decision, talk to your stakeholders.
Mobile apps that depend on synchronisation might impose the use of logical rather than physical delete: a server must be able to indicate to the client that a record has been (marked as) deleted, and this might not be possible if records were physically deleted.
I just wanted to expand on the mentioned unique constraint problem.
Suppose I have a table with two columns: id and my_column. To support soft-deletes I need to update my table definition to this:
create table mytable (
id serial primary key,
my_column varchar unique not null,
deleted_at datetime
)
But if a row is soft-deleted, I want my_column constraint to be ignored, because deleted data should not interfere with non-deleted data. My original model will not work.
I would need to update my data definition to this:
create table mytable (
id serial primary key,
my_column varchar not null,
my_column_repetitions integer not null default 0,
deleted_at datetime,
unique (my_column, my_column_repetitions),
check (deleted_at is not null and my_column_repetitions > 0 or deleted_at is null and my_column_repetitions = 0)
)
And apply this logic: when a row is current, i.e. not deleted, my_column_repetitions should hold the default value 0 and when the row is soft-deleted its my_column_repetitions needs to be updated to (max. number of repetitions on soft-deleted rows) + 1.
The latter logic must be implemented programmatically with a trigger or handled in my application code and there is no check that I could set.
Repeat this is for every unique column!
I think this solution is really hacky and would favor a separate archive table to store deleted rows.
They don't let the database perform as it should rendering such things as the cascade functionality useless.
For simple things such as inserts, in the case of re-inserting, then the code behind it doubles.
You can't just simply insert, instead you have to check for an existence and insert if it doesn't exist before or update the deletion flag if it does whilst also updating all other columns to the new values. This is seen as an update to the database transaction log and not a fresh insert causing inaccurate audit logs.
They cause performance issues because tables are getting glogged with redundant data. It plays havock with indexing especially with uniqueness.
I'm not a big fan of logical deletes.
To reply to Tohid's comment, we faced same problem where we wanted to persist history of records and also we were not sure whether we wanted is_deleted column or not.
I am talking about our python implementation and a similar use-case we hit.
We encountered https://github.com/kvesteri/sqlalchemy-continuum which is an easy way to get versioning table for your corresponding table. Minimum lines of code and captures history for add, delete and update.
This serves more than just is_deleted column. You can always backref version table to check what happened with this entry. Whether entry got deleted, updated or added.
This way we didn't need to have is_deleted column at all and our delete function was pretty trivial. This way we also don't need to remember to mark is_deleted=False in any of our api's.
Soft Delete is a programming practice that being followed in most of the application when data is more relevant. Consider a case of financial application where a delete by the mistake of the end user can be fatal.
That is the case when soft delete becomes relevant. In soft delete the user is not actually deleting the data from the record instead its being flagged as IsDeleted to true (By normal convention).
In EF 6.x or EF 7 onward Softdelete is Added as an attribute but we have to create a custom attribute for the time being now.
I strongly recommend SoftDelete In a database design and its a good convention for the programming practice.
Most of time softdeleting is used because you don't want to expose some data but you have to keep it for historical reasons (A product could become discontinued, so you don't want any new transaction with it but you still need to work with the history of sale transaction). By the way, some are copying the product information value in the sale transaction data instead of making a reference to the product to handle this.
In fact it looks more like a rewording for a visible/hidden or active/inactive feature. Because that's the meaning of "delete" in business world. I'd like to say that Terminators may delete people but boss just fire them.
This practice is pretty common pattern and used by a lot of application for a lot of reasons. As It's not the only way to achieve this, so you will have thousand of people saying that's great or bullshit and both have pretty good arguments.
From a point of view of security, SoftDelete won't replace the job of Audit and it won't replace the job of backup too. If you are afraid of "the insert/delete between two backup case", you should read about Full or Bulk recovery Models. I admit that SoftDelete could make the recovery process more trivial.
Up to you to know your requirement.
To give an alternative, we have users using remote devices updating via MobiLink. If we delete records in the server database, those records never get marked deleted in the client databases.
So we do both. We work with our clients to determine how long they wish to be able to recover data. For example, generally customers and products are active until our client say they should be deleted, but history of sales is only retained for 13 months and then deletes automatically. The client may want to keep deleted customers and products for two months but retain history for six months.
So we run a script overnight that marks things logically deleted according to these parameters and then two/six months later, anything marked logically deleted today will be hard deleted.
We're less about data security than about having enormous databases on a client device with limited memory, such as a smartphone. A client who orders 200 products twice a week for four years will have over 81,000 lines of history, of which 75% the client doesn't care if he sees.
It all depends on the use case of the system and its data.
For example, if you are talking about a government regulated system (e.g. a system at a pharmaceutical company that is considered a part of the quality system and must follow FDA guidelines for electronic records), then you darned well better not do hard deletes! An auditor from the FDA can come in and ask for all records in the system relating to product number ABC-123, and all data better be available. If your business process owner says the system shouldn't allow anyone to use product number ABC-123 on new records going forward, use the soft-delete method instead to make it "inactive" within the system, while still preserving historical data.
However, maybe your system and its data has a use case such as "tracking the weather at the North Pole". Maybe you take temperature readings once every hour, and at the end of the day aggregate a daily average. Maybe the hourly data will no longer ever be used after aggregation, and you'd hard-delete the hourly readings after creating the aggregate. (This is a made-up, trivial example.)
The point is, it all depends on the use case of the system and its data, and not a decision to be made purely from a technological standpoint.
Well! As everyone said, it depends on the situation.
If you have an index on a column like UserName or EmailID - and you never expect the same UserName or EmailID to be used again; you can go with a soft delete.
That said, always check if your SELECT operation uses the primary key. If your SELECT statement uses a primary key, adding a flag with the WHERE clause wouldn't make much difference. Let's take an example (Pseudo):
Table Users (UserID [primary key], EmailID, IsDeleted)
SELECT * FROM Users where UserID = 123456 and IsDeleted = 0
This query won't make any difference in terms of performance since the UserID column has a primary key. Initially, it will scan the table based on PK and then execute the next condition.
Cases where soft deletes cannot work at all:
Sign-up in majorly all websites take EmailID as your unique identification. We know very well, once an EmailID is used on a website like facebook, G+, it cannot be used by anyone else.
There comes a day when the user wants to delete his/her profile from the website. Now, if you make a logical delete, that user won't be able to register ever again. Also, registering again using the same EmailID wouldn't mean to restore the entire history. Everyone knows, deletion means deletion. In such scenarios, we have to make a physical delete. But in order to maintain the entire history of the account, we should always archive such records in either archive tables or deleted tables.
Yes, in situations where we have lots of foreign tables, handling is quite cumbersome.
Also keep in mind that soft/logical deletes will increase your table size, so the index size.
I have already answered in another post.
However, I think my answer more fit to the question here.
My practical solution for soft-delete is archiving by creating a new
table with following columns: original_id, table_name, payload,
(and an optional primary key `id).
Where original_id is the original id of deleted record, table_name
is the table name of the deleted record ("user" in your case),
payload is JSON-stringified string from all columns of the deleted
record.
I also suggest making an index on the column original_id for latter
data retrievement.
By this way of archiving data. You will have these advantages
Keep track of all data in history
Have only one place to archive records from any table, regardless of the deleted record's table structure
No worry of unique index in the original table
No worry of checking foreign index in the original table
No more WHERE clause in every query to check for deletion
The is already a discussion
here explaining why
soft-deletion is not a good idea in practice. Soft-delete introduces
some potential troubles in future such as counting records, ...
It depends on the case, consider the below:
Usually, you don't need to "soft-delete" a record.
Keep it simple and fast.
e.g. Deleting a product no longer available, so you don't have to check the product isn't soft-deleted all over your app (count, product list, recommended products, etc.).
Yet, you might consider the "soft-delete" in a data warehouse model. e.g. You are viewing an old receipt on a deleted product.*
Advantages are data preservation/perpetuation. A disadvantage would be a decrease in performance when querying or retrieving data from tables with significant number of soft deletes.
In our case we use a combination of both: as others have mentioned in previous answers, we soft-delete users/clients/customers for example, and hard-delete on items/products/merchandise tables where there are duplicated records that don't need to be kept.