Checking Unenforced Conditions before Insert - sql

I've seen a few questions where developers were asking about checking if a record already exists before insert to avoid a primary key constraint violation... this is not a big problem IMHO. But, what if the condition I want to check is not to avoid a DB error, because no DB error will be raised if the condition exists during insert.
For example, a large de-normalized table that has several state values, but I only want to allow inserting of a new row that only changes the state of a single state value column at a time.
The columns could be something like:
id
animal_id
health_problem_id
treatment_plan
treatment_approved
treatment_scheduled
treatment_in_progress
treatment_complete
who_updated
when_updated
When a user updates a state field in their application view, the other unchanged information will be used as part of the insert if it matches what they currently see, otherwise they will be notified, and then they can review the current information and try again.
For example, this would prevent two different users scheduling something twice with no immediate notification to one of the users.
Now let's pretend this table is going to be used heavily and concurrently (my example may not the best for this...). For those with astounding imagination let's also assume that there is a reasonable likelihood people will be working on the same animal at the same time. Let's also assume data integrity is important.
Obviously, if it's going to be used concurrently and heavily locking the whole table is NOT a good solution, but certainly would insure data integrity.
Is there a performant solution to this problem?
Is bad database design the culprit here (ie. de-normalization of the data.)? If that is the case what design would solve this problem?
Is this design ok if it's not going to be used heavily, and there is a low likelihood of people working on the same animal? (Note: data integrity is still extremely important.)
Same question as 3, but with the requirement of data integrity being less important.
Note: My example, is an example this is not something I'm doing with data on animals :)
Opinions and/or details on how to solve solve this generically with existing relational databases is certainly preferred.

Related

Why would all tables not be temporal tables by default?

I'm creating a new database and plan to use temporal tables to log all changes. The data stored will be updated daily but will not be more than 5000 records per table
Is there any reason I shouldn't just make all tables temporal?
Ps. I am aware of the space usage of temporal tables, this is not as far as I understand a problem
I am aware of the space usage of temporal tables, this is not as far as I understand a problem
On the contrary - it's pretty big problem - and there are many other downsides too.
When you use Temporal Tables (at least in SQL Server), every UPDATE operation (even if the data is unchanged) results in a copy being made in the History table (granted, under-the-hood this may be a COW-optimized copy, but it's still another conceptual entity instance).
Secondly - from my personal experience working with LoB applications: most changes to databases are not important enough to justify creating an entire copy of a row, for example, imagine a table with 4 columns ( CREATE TABLE People ( FirstName nvarchar(50), LastName nvarchar(50), Address nvarchar(200), Biography nvarchar(max): whenever a typo in FirstName is fixed then all of the data in the other columns is copied-over, even if Biography contains a 4GB worth of text data - even if this is COW-optimized it's still creating copies for every user action that results in a change.
Is there any reason I shouldn't just make all tables temporal?
The main reason, in my experience, is that it makes changing your table schema much harder because the schemas (aka "table design") of the Active and History tables must be identical: so if you have a table with a NULL column that you want to change to a NOT NULL column and you have NULL values in your History table then you're stuck - at least until you write a data transformation step that will supply the History table with valid data - it's basically creating more work for yourself with little to gain.
Also, don't confuse Temporal Tables with Immutable, Append-only data-stores (like the Bitcoin Blockchain) - while they share similar design objectives (except true immutability) they exist to solve different problems - and if you consider the size requirements and scaling issues of the Ethereum block-chain (over a terabyte by now) then that should give you another idea why it's probably not a good idea.
Finally, even if Temporal Tables didn't have these issues - you still need to go through the effort to write your main software such that it can natively handle temporal data - and things like Entity Framework still don't have built-in support for querying Temporal Data.
...and even with all the historical records you've managed to save in the History table, what do you want it for? Do you really need to track every corrected typo and small, inconsequential change? How will your users react to needing to manually audit the changes to determine what's meaningful or not?
In short:
If your table design probably won't change much in the future...
AND small updates happen infrequently...
OR large updates happen regularly AND you need an audit record
...then go ahead and use Temporal Tables wherever you can.
if not, then you're just creating more future work for yourself with little to gain.
"log all changes" is not a good use case for the temporal features in SQL.
The use case for the SYSTEM TIME temporal feature is when there is a pressing requirement obligating you/the user to be able to easily and quickly reconstruct (should be in scare quotes actually) the state that your database was in at a given moment in time. That is difficult and error-prone and expensive if all you have is a true log of past changes. But if you can suffice with keeping just a log of the changes, then do that (it will be difficult and error-prone and expensive to recreate past database states from the current state and your log, but that's not a pressing problem if there's no pressing need).
Also note that the SQL temporal features encompass also the notion of BUSINESS TIME, which is a different time dimension than the SYSTEM TIME one. Business time is targeted to keeping a history of the world situation, system time is targeted at keeping a history of your database itself, that is, a history of your records of the world situation.

Delete records from a database or simply hide them during Reads?

I'm wondering if someone can provide various rationales/solutions for knowing when to delete records from a database vs. simplying hiding them during read operations via a field value, e.g., is_hidden=1.
My application is a social network/e-commerce web application. I tend to favor the is_hidden strategy but as one's site grows I can see this leading to a really badly performing site.
Here's my list. What items on the list am I missing? Is the list's prioritization good?
Delete:
rationale: Reduce table size/improve database performance
rationale: Useful if data is trivial to create
solution: SQL DELETE
is_hidden:
rationale: allow users/DBA to restore data/useful for sensitive & hard to CREATE data
rationale: can DELETE it later if necessary
solution: SQL SELECT ... WHERE is_hidden!=1
Thoughts?
The major reason you might want to do a soft-delete is where an audit trail requires it. For example we might have an invoice table along with an voided column and we might normally just omit voided invoices. This preserves an audit trail so we know what invoices were entered and which ones were voided.
There are many fields (particularly in finance) where soft deletes are preferred for this reason. Typically the number of deletes are small compared to the data set, and you don't want to really delete because actually doing so might allow someone to cover for theft of money or real-world goods. The "deleted" data can then be shown for those queries which require it.
A good non-db example would be as follows: "When writing in your general journal or general ledger, write with a pen and if you make an error that you spot right away, cross it out with a single line so that the original data is still legible, and write correct values underneath. If you find out later, either write in an adjustment entry or write in a reversal and a new one." In that case, your principle reason is to see what was changed when so that you can audit those changes if there is ever a question.
The people typically needing to see such information are likely to be financial or other auditors.
You've already said everthing in your question:
DELETE will entirely delete the entry and
is_hidden=1 will hide it.
So: If there's the possibility that you will need the data in the future you should use the hiding method. If you are sure that the data will never ever be used again: Use delete.
Concerning performance:
You can use two tables:
1 for visible items
1 for the hidden ones
Or even three tables:
1 for visible
1 for hidden
1 as an archive, where you move all the hidden data that's older than 3 years or something
Or:
1 for visible and hidden ones (using the is_hidden flag)
1 as an archive for old entries
It's all up to you. But if you look at facebook or google: They will never ever delete anything! Data == Money == Power ;)
As far as performance and ease of development, it may be possible on your platform to have filtered indexes, indexed views etc which would mean that keeping the soft-deleted data around has little impact on your system.

SQL Server Auditing Data in the Same Table

A project I'm working on requires that a record be digitally "signed" and after that any modifications would create a new "version" of the row. The "signed" record can't be modified for regulatory reasons and new versions shouldn't be modified very often. In the past, done so by creating a separate logging table with the same schema as the main table with some extra columns for tracking who modified it and when.
However, after doing some work with SharePoint where ALL data (including different versions) is put into the same table I thought of a different approach which I can't find any examples of people doing: I could put the new version of the row right in the same table and increment the version number. Then add the version number to the PK.
PROS:
Implementation is easy, just create an "Instead of update" trigger which performs an insert instead of an update of the row is "signed". I could easily add a IsCurrentVersion column to be updated in the trigger.
Querying for older versions is easy, just get all the records with
the ID I want let the user choose from the list.
A trigger is nice because it guarantees that a row CAN'T be updated if signed (for regulatory and audit purposes).
Schema changes to the table don't have to be replicated to the mirror "logging" table.
CONS:
The table could get a bit larger but most of the time the record won't be changed after "signing" it. The client estimated around 100,000 rows/year max at current usage levels. SQL Server can handle hundreds of millions of rows so this doesn't seem too bad.
Indexing and performance could be an issue. SharePoint adds a tp_CalculatedVersion int to the PK where the calculated number is always 0 for the latest version. I could do the same and calculate it based off the Version number. Would that help performance?
There is an extra step in querying the data to make sure you get the latest version but that could be handled in a SP.
What other cons are there in this scenario. Am I missing anything??
I've seen this pattern used in an enterprise system before,and IMO it wasn't successful.
You are Mixing two different concerns here, viz storage of live and audit data. Queries to this table will always need to keep in mind whether they are seeking leaf or audit data (e.g. reports) - new team members may find this non intuitive. You would likely need to encapsulate this complexity with views etc.
As you mentioned performance will always be a concern. Inserting a new record will also need to update the previous record to mark it as inactive.You may also need to consider changing your clustered index to keep all versions on the same page.
Foreign keys to this table are going to be problematic. Do you
reference an exact version record? Do you then fix up the foreign
keys to point to the new live leaf record?
The one benefit I can think of doing this is that the audit table DDL will always be in synch with the live table - often with the 2 table strategy changes are made to the live, and the audit and trigger DDL isn't updated accordingly.
Overall, I would still recommend keeping your audit table separate.
If the requirement is that the signed data not be changed, then you should move it to another table. In fact, I might suggest moving it to another database/schema, where the only operation allowed on the table is inserting and reading records. You can use both permissions and triggers, if you really want to prevent updates.
You don't want to mess around with regulatory requirements. A complex schema that uses a combination of primary key with version, along with triggers, is a sign that there might be a simpler way.
The historical records can affect performance of the current records. If you end up in a situation where every record has changed 100 times, then keeping them in the same table is just going to slow down queries. Of course, you can embark on more complexity, in the form of partitioning the data. In the end, the solution is simpler: keep the data that cannot be changed in another table where it cannot be changed. You don't want to have to upgrade the hardware just because lots of history has accumulated.
I would also suggest including an effective and end date in the history records. This will allow you to reconstruct all the data as of a particular date, something that users might find useful in the future.
That's right. Audit trails can stay in an application for internal reporting/audit but infosec best practice mandates getting audit logs off the system where they are generated into your log management / SIEM solution.

Is this INSERT likely to cause any locking/concurrency issues?

In an effort to avoid auto sequence numbers and the like for one reason or another in this particular database, I wondered if anyone could see any problems with this:
INSERT INTO user (label, username, password, user_id)
SELECT 'Test', 'test', 'test', COALESCE(MAX(user_id)+1, 1) FROM user;
I'm using PostgreSQL (but also trying to be as database agnostic as possible)..
EDIT:
There's two reasons for me wanting to do this.
Keeping dependency on any particular RDBMS low.
Not having to worry about updating sequences if the data is batch-updated to a central database.
Insert performance is not an issue as the only tables where this will be needed are set-up tables.
EDIT-2:
The idea I'm playing with is that each table in the database have a human-generated SiteCode as part of their key, so we always have a compound key. This effectively partitions the data on SiteCode and would allow taking the data from a particular site and putting it somewhere else (obviously on the same database structure). For instance, this would allow backing up of various operational sites onto one central database, but also allow that central database to have operational sites using it.
I could still use sequences, but it seems messy. The actual INSERT would look more like this:
INSERT INTO user (sitecode, label, username, password, user_id)
SELECT 'SITE001', 'Test', 'test', 'test', COALESCE(MAX(user_id)+1, 1)
FROM user
WHERE sitecode='SITE001';
If that makes sense..
I've done something similar before and it worked fine, however the central database in that case was never operational (it was more of a way of centrally viewing data / analyzing) so it did not need to generate ids.
EDIT-3:
I'm starting to think it'd be simpler to only ever allow the centralised database to be either active-only or backup-only, thus avoiding the problem completely and allowing a more simple design.
Oh well back to the drawing board!
There are a couple of points:
Postgres uses Multi-Version Concurrency Control (MVCC) so Readers are never waiting on writers and vice versa. But there is of course a serialization that happens upon each write. If you are going to load a bulk of data into the system, then look at the COPY command. It is much faster than running a large swab of INSERT statements.
The MAX(user_id) can be answered with an index, and probably is, if there is an index on the user_id column. But the real problem is that if two transactions start at the same time, they will see the same MAX(user_id) value. It leads me to the next point:
The canonical way of handling numbers like user_id's is by using SEQUENCE's. These essentially are a place where you can draw the next user id from. If you are really worried about performance on generating the next sequence number, you can generate a batch of them per thread and then only request a new batch when it is exhausted (sometimes called a HiLo sequence).
You may be wanting to have user_id's packed up nice and tight as increasing numbers, but I think you should try to get rid of that. The reason is that deleting a user_id will create a hole anyway. So i'd not worry too much if the sequences were not strictly increasing.
Yes, I can see a huge problem. Don't do it.
Multiple connections can get the EXACT SAME id at the same time. I was going to add "under load" but it doesn't even need to be - just need the right timing between two queries.
To avoid it, you can use transactions or locking mechanisms or isolation levels specific to each DB, but once we get to that stage, you might as well use the dbms-specific sequence/identity/autonumber etc.
EDIT
For question edit2, there is no reason to fear gaps in the user_id, so you have one sequence across all sites. If gaps are ok, some options are
use guaranteed update statements, such as (in SQL Server)
update tblsitesequenceno set #nextnum = nextnum = nextnum + 1
Multiple callers to this statement are each guaranteed to get a unique number.
use a single table that produces the identity/sequence/autonumber (db specific)
If you cannot have gaps at all, consider using a transaction mechanism that will restrict access while you are running the max() query. Either that or use a proliferation of (sequences/tables with identity columns/tables with autonumber) that you manipulate using dynamic SQL using the same technique for a single sequence.
By all means use a sequence to generate unique numbers. They are fast, transaction safe and reliable.
Any self-written implemention of a "sequence generator" is either not scalable for a multi-user environment (because you need to do heavy locking) or simply not correct.
If you do need to be DBMS independent, then create an abstraction layer that uses sequences for those DBMS that support them (Posgres, Oracle, Firebird, DB2, Ingres, Informix, ...) and a self written generator on those that don't.
Trying to create a system than is DBMS independent, simply means it will run equally slow on all systems if you don't exploit the advantages of each DBMS.
Your goal is a good one. Avoiding IDENTITY and AUTOINCREMENT columns means avoiding a whole plethora of administration problems. Here is just one example of the many.
However most responders at SO will not appreciate it, the popular (as opposed to technical) response is "always stick an Id AUTOINCREMENT column on everything that moves".
A next-sequential number is fine, all vendors have optimised it.
As long as this code is inside a Transaction, as it should be, two users will not get the same MAX()+1 value. There is a concept called Isolation Level which needs to be understood when coding Transactions.
Getting away from user_id and onto a more meaningful key such as ShortName or State plus UserNo is even better (the former spreads the contention, latter avoids the next-sequential contention altogether, relevant for high volume systems).
What MVCC promises, and what it actually delivers, are two different things. Just surf the net or search SO to view the hundreds of problems re PostcreSQL/MVCC. In the realm of computers, the laws of physics applies, nothing is free. MVCC stores private copies of all rows touched, and resolves collisions at the end of the Transaction, resulting in far more Rollbacks. Whereas 2PL blocks at the beginning of the Transaction, and waits, without the massive storage of copies.
most people with actual experience of MVCC do not recommend it for high contention, high volume systems.
The first example code block is fine.
As per Comments, this item no longer applies: The second example code block has an issue. "SITE001" is not a compound key, it is a compounded column. Do not do that, separate "SITE" and "001" into two discrete columns. And if "SITE" is a fixed, repeatingvalue, it can be eliminated.
Different users can have the same user_id, concurrent SELECT-statements will see the same MAX(user_id).
If you don't want to use a SEQUENCE, you have to use an extra table with a single record and update this single record every time you need a new unique id:
CREATE TABLE my_sequence(id INT);
BEGIN;
UPDATE my_sequence SET id = COALESCE(id, 0) + 1;
INSERT INTO
user (label, username, password, user_id)
SELECT 'Test', 'test', 'test', id FROM my_sequence;
COMMIT;
I agree with maksymko, but not because I dislike sequences or autoincrementing numbers, as they have their place. If you need a value to be unique throughout your "various operational sites" i.e. not only within the confines of the single database instance, a globally unique identifier is a robust, simple solution.

Physical vs. logical (hard vs. soft) delete of database record? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
What is the advantage of doing a logical/soft delete of a record (i.e. setting a flag stating that the record is deleted) as opposed to actually or physically deleting the record?
Is this common practice?
Is this secure?
Advantages are that you keep the history (good for auditing) and you don't have to worry about cascading a delete through various other tables in the database that reference the row you are deleting. Disadvantage is that you have to code any reporting/display methods to take the flag into account.
As far as if it is a common practice - I would say yes, but as with anything whether you use it depends on your business needs.
EDIT: Thought of another disadvantange - If you have unique indexes on the table, deleted records will still take up the "one" record, so you have to code around that possibility too (for example, a User table that has a unique index on username; A deleted record would still block the deleted users username for new records. Working around this you could tack on a GUID to the deleted username column, but it's a very hacky workaround that I wouldn't recommend. Probably in that circumstance it would be better to just have a rule that once a username is used, it can never be replaced.)
Are logical deletes common practice? Yes I have seen this in many places. Are they secure? That really depends are they any less secure then the data was before you deleted it?
When I was a Tech Lead, I demanded that our team keep every piece of data, I knew at the time that we would be using all that data to build various BI applications, although at the time we didn't know what the requirements would be. While this was good from the standpoint of auditing, troubleshooting, and reporting (This was an e-commerce / tools site for B2B transactions, and if someone used a tool, we wanted to record it even if their account was later turned off), it did have several downsides.
The downsides include (not including others already mentioned):
Performance Implications of keeping all that data, We to develop various archiving strategies. For example one area of the application was getting close to generating around 1Gb of data a week.
Cost of keeping the data does grow over time, while disk space is cheap, the amount of infrastructure to keep and manage terabytes of data both online and off line is a lot. It takes a lot of disk for redundancy, and people's time to ensure backups are moving swiftly etc.
When deciding to use logical, physical deletes, or archiving I would ask myself these questions:
Is this data that might need to be re-inserted into the table. For example User Accounts fit this category as you might activate or deactivate a user account. If this is the case a logical delete makes the most sense.
Is there any intrinsic value in storing the data? If so how much data will be generated. Depending on this I would either go with a logical delete, or implement an archiving strategy. Keep in mind you can always archive logically deleted records.
It might be a little late but I suggest everyone to check Pinal Dave's blog post about logical/soft delete:
I just do not like this kind of design [soft delete] at all. I am firm believer of the architecture where only necessary data should be in single table and the useless data should be moved to an archived table. Instead of following the isDeleted column, I suggest the usage of two different tables: one with orders and another with deleted orders. In that case, you will have to maintain both the table, but in reality, it is very easy to maintain. When you write UPDATE statement to the isDeleted column, write INSERT INTO another table and DELETE it from original table. If the situation is of rollback, write another INSERT INTO and DELETE in reverse order. If you are worried about a failed transaction, wrap this code in TRANSACTION.
What are the advantages of the smaller table verses larger table in above described situations?
A smaller table is easy to maintain
Index Rebuild operations are much faster
Moving the archive data to another filegroup will reduce the load of primary filegroup (considering that all filegroups are on different system) – this will also speed up the backup as well.
Statistics will be frequently updated due to smaller size and this will be less resource intensive.
Size of the index will be smaller
Performance of the table will improve with a smaller table size.
I'm a NoSQL developer, and on my last job, I worked with data that was always critical for someone, and if it was deleted by accident in the same day that was created, I were not able to find it in the last backup from yesterday! In that situation, soft deletion always saved the day.
I did soft-deletion using timestamps, registering the date the document was deleted:
IsDeleted = 20150310 //yyyyMMdd
Every Sunday, a process walked on the database and checked the IsDeleted field. If the difference between the current date and the timestamp was greater than N days, the document was hard deleted. Considering the document still be available on some backup, it was safe to do it.
EDIT: This NoSQL use case is about big documents created in the database, tens or hundreds of them every day, but not thousands or millions. By general, they were documents with the status, data and attachments of workflow processes. That was the reason why there was the possibility of a user deletes an important document. This user could be someone with Admin privileges, or maybe the document's owner, just to name a few.
TL;DR My use case was not Big Data. In that case, you will need a different approach.
One pattern I have used is to create a mirror table and attach a trigger on the primary table, so all deletes (and updates if desired) are recorded in the mirror table.
This allows you to "reconstruct" deleted/changed records, and you can still hard delete in the primary table and keep it "clean" - it also allows the creation of an "undo" function, and you can also record the date, time, and user who did the action in the mirror table (invaluable in witch hunt situations).
The other advantage is there is no chance of accidentally including deleted records when querying off the primary unless you deliberately go to the trouble of including records from the mirror table (you may want to show live and deleted records).
Another advantage is that the mirror table can be independently purged, as it should not have any actual foreign key references, making this a relatively simple operation in comparison to purging from a primary table that uses soft deletes but still has referential connections to other tables.
What other advantages? - great if you have a bunch of coders working on the project, doing reads on the database with mixed skill and attention to detail levels, you don't have to stay up nights hoping that one of them didn’t forget to not include deleted records (lol, Not Include Deleted Records = True), which results in things like overstating say the clients available cash position which they then go buy some shares with (i.e., as in a trading system), when you work with trading systems, you will find out very quickly the value of robust solutions, even though they may have a little bit more initial "overhead".
Exceptions:
- as a guide, use soft deletes for "reference" data such as user, category, etc, and hard deletes to a mirror table for "fact" type data, i.e., transaction history.
I used to do soft-delete, just to keep old records. I realized that users don't bother to view old records as often as I thought. If users want to view old records, they can just view from archive or audit table, right? So, what's the advantage of soft-delete? It only leads to more complex query statement, etc.
Following are the things i've implemented, before I decided to not-soft-delete anymore:
implement audit, to record all activities (add,edit,delete). Ensure that there's no foreign key linked to audit, and ensure this table is secured and nobody can delete except administrators.
identify which tables are considered "transactional table", which very likely that it will be kept for long time, and very likely user may want to view the past records or reports. For example; purchase transaction. This table should not just keep the id of master table (such as dept-id), but also keep the additional info such as the name as reference (such as dept-name), or any other necessary fields for reporting.
Implement "active/inactive" or "enable/disable" or "hide/show" record of master table. So, instead of deleting record, the user can disable/inactive the master record. It is much safer this way.
Just my two cents opinion.
I'm a big fan of the logical delete, especially for a Line of Business application, or in the context of user accounts. My reasons are simple: often times I don't want a user to be able to use the system anymore (so the account get's marked as deleted), but if we deleted the user, we'd lose all their work and such.
Another common scenario is that the users might get re-created a while after having been delete. It's a much nicer experience for the user to have all their data present as it was before they were deleted, rather than have to re-create it.
I usually think of deleting users more as "suspending" them indefinitely. You never know when they'll legitimately need to be back.
I commonly use logical deletions - I find they work well when you also intermittently archive off the 'deleted' data to an archived table (which can be searched if needed) thus having no chance of affecting the performance of the application.
It works well because you still have the data if you're ever audited. If you delete it physically, it's gone!
I almost always soft delete and here's why:
you can restore deleted data if a customer asks you to do so. More happy customers with soft deletes. Restoring specific data from backups is complex
checking for isdeleted everywhere is not an issue, you have to check for userid anyway (if the database contains data from multiple users). You can enforce the check by code, by placing those two checks on a separate function (or use views)
graceful delete. Users or processes dealing with deleted content will continue to "see" it until they hit the next refresh. This is a very desirable feature if a process is processing some data which is suddenly deleted
synchronization: if you need to design a synchronization mechanism between a database and mobile apps, you'll find soft deletes much easier to implement
Re: "Is this secure?" - that depends on what you mean.
If you mean that by doing physical delete, you'll prevent anyone from ever finding the deleted data, then yes, that's more or less true; you're safer in physically deleting the sensitive data that needs to be erased, because that means it's permanently gone from the database. (However, realize that there may be other copies of the data in question, such as in a backup, or the transaction log, or a recorded version from in transit, e.g. a packet sniffer - just because you delete from your database doesn't guarantee it wasn't saved somewhere else.)
If you mean that by doing logical delete, your data is more secure because you'll never lose any data, that's also true. This is good for audit scenarios; I tend to design this way because it admits the basic fact that once data is generated, it'll never really go away (especially if it ever had the capability of being, say, cached by an internet search engine). Of course, a real audit scenario requires that not only are deletes logical, but that updates are also logged, along with the time of the change and the actor who made the change.
If you mean that the data won't fall into the hands of anyone who isn't supposed to see it, then that's totally up to your application and its security structure. In that respect, logical delete is no more or less secure than anything else in your database.
Logical deletions if are hard on referential integrity.
It is the right think to do when there is a temporal aspect of the table data (are valid FROM_DATE - TO_DATE).
Otherwise move the data to an Auditing Table and delete the record.
On the plus side:
It is the easier way to rollback (if at all possible).
It is easy to see what was the state at a specific point in time.
I strongly disagree with logical delete because you are exposed to many errors.
First of all queries, each query must take care the IsDeleted field and the possibility of error becomes higher with complex queries.
Second the performance: imagine a table with 100000 recs with only 3 active, now multiply this number for the tables of your database; another performance problem is a possible conflict with new records with old (deleted records).
The only advantage I see is the history of records, but there are other methods to achieve this result, for example you can create a logging table where you can save info: TableName,OldValues,NewValues,Date,User,[..] where *Values ​​can be varchar and write the details in this form fieldname : value; [..] or store the info as xml.
All this can be achieved via code or Triggers but you are only ONE table with all your history.
Another options is to see if the specified database engine are native support for tracking change, for example on SQL Server database there are SQL Track Data Change.
It's fairly standard in cases where you'd like to keep a history of something (e.g. user accounts as #Jon Dewees mentions). And it's certainly a great idea if there's a strong chance of users asking for un-deletions.
If you're concerned about the logic of filtering out the deleted records from your queries getting messy and just complicating your queries, you can just build views that do the filtering for you and use queries against that. It'll prevent leakage of these records in reporting solutions and such.
There are requirements beyond system design which need to be answered. What is the legal or statutory requirement in the record retention? Depending on what the rows are related to, there may be a legal requirement that the data be kept for a certain period of time after it is 'suspended'.
On the other hand, the requirement may be that once the record is 'deleted', it is truly and irrevocably deleted. Before you make a decision, talk to your stakeholders.
Mobile apps that depend on synchronisation might impose the use of logical rather than physical delete: a server must be able to indicate to the client that a record has been (marked as) deleted, and this might not be possible if records were physically deleted.
I just wanted to expand on the mentioned unique constraint problem.
Suppose I have a table with two columns: id and my_column. To support soft-deletes I need to update my table definition to this:
create table mytable (
id serial primary key,
my_column varchar unique not null,
deleted_at datetime
)
But if a row is soft-deleted, I want my_column constraint to be ignored, because deleted data should not interfere with non-deleted data. My original model will not work.
I would need to update my data definition to this:
create table mytable (
id serial primary key,
my_column varchar not null,
my_column_repetitions integer not null default 0,
deleted_at datetime,
unique (my_column, my_column_repetitions),
check (deleted_at is not null and my_column_repetitions > 0 or deleted_at is null and my_column_repetitions = 0)
)
And apply this logic: when a row is current, i.e. not deleted, my_column_repetitions should hold the default value 0 and when the row is soft-deleted its my_column_repetitions needs to be updated to (max. number of repetitions on soft-deleted rows) + 1.
The latter logic must be implemented programmatically with a trigger or handled in my application code and there is no check that I could set.
Repeat this is for every unique column!
I think this solution is really hacky and would favor a separate archive table to store deleted rows.
They don't let the database perform as it should rendering such things as the cascade functionality useless.
For simple things such as inserts, in the case of re-inserting, then the code behind it doubles.
You can't just simply insert, instead you have to check for an existence and insert if it doesn't exist before or update the deletion flag if it does whilst also updating all other columns to the new values. This is seen as an update to the database transaction log and not a fresh insert causing inaccurate audit logs.
They cause performance issues because tables are getting glogged with redundant data. It plays havock with indexing especially with uniqueness.
I'm not a big fan of logical deletes.
To reply to Tohid's comment, we faced same problem where we wanted to persist history of records and also we were not sure whether we wanted is_deleted column or not.
I am talking about our python implementation and a similar use-case we hit.
We encountered https://github.com/kvesteri/sqlalchemy-continuum which is an easy way to get versioning table for your corresponding table. Minimum lines of code and captures history for add, delete and update.
This serves more than just is_deleted column. You can always backref version table to check what happened with this entry. Whether entry got deleted, updated or added.
This way we didn't need to have is_deleted column at all and our delete function was pretty trivial. This way we also don't need to remember to mark is_deleted=False in any of our api's.
Soft Delete is a programming practice that being followed in most of the application when data is more relevant. Consider a case of financial application where a delete by the mistake of the end user can be fatal.
That is the case when soft delete becomes relevant. In soft delete the user is not actually deleting the data from the record instead its being flagged as IsDeleted to true (By normal convention).
In EF 6.x or EF 7 onward Softdelete is Added as an attribute but we have to create a custom attribute for the time being now.
I strongly recommend SoftDelete In a database design and its a good convention for the programming practice.
Most of time softdeleting is used because you don't want to expose some data but you have to keep it for historical reasons (A product could become discontinued, so you don't want any new transaction with it but you still need to work with the history of sale transaction). By the way, some are copying the product information value in the sale transaction data instead of making a reference to the product to handle this.
In fact it looks more like a rewording for a visible/hidden or active/inactive feature. Because that's the meaning of "delete" in business world. I'd like to say that Terminators may delete people but boss just fire them.
This practice is pretty common pattern and used by a lot of application for a lot of reasons. As It's not the only way to achieve this, so you will have thousand of people saying that's great or bullshit and both have pretty good arguments.
From a point of view of security, SoftDelete won't replace the job of Audit and it won't replace the job of backup too. If you are afraid of "the insert/delete between two backup case", you should read about Full or Bulk recovery Models. I admit that SoftDelete could make the recovery process more trivial.
Up to you to know your requirement.
To give an alternative, we have users using remote devices updating via MobiLink. If we delete records in the server database, those records never get marked deleted in the client databases.
So we do both. We work with our clients to determine how long they wish to be able to recover data. For example, generally customers and products are active until our client say they should be deleted, but history of sales is only retained for 13 months and then deletes automatically. The client may want to keep deleted customers and products for two months but retain history for six months.
So we run a script overnight that marks things logically deleted according to these parameters and then two/six months later, anything marked logically deleted today will be hard deleted.
We're less about data security than about having enormous databases on a client device with limited memory, such as a smartphone. A client who orders 200 products twice a week for four years will have over 81,000 lines of history, of which 75% the client doesn't care if he sees.
It all depends on the use case of the system and its data.
For example, if you are talking about a government regulated system (e.g. a system at a pharmaceutical company that is considered a part of the quality system and must follow FDA guidelines for electronic records), then you darned well better not do hard deletes! An auditor from the FDA can come in and ask for all records in the system relating to product number ABC-123, and all data better be available. If your business process owner says the system shouldn't allow anyone to use product number ABC-123 on new records going forward, use the soft-delete method instead to make it "inactive" within the system, while still preserving historical data.
However, maybe your system and its data has a use case such as "tracking the weather at the North Pole". Maybe you take temperature readings once every hour, and at the end of the day aggregate a daily average. Maybe the hourly data will no longer ever be used after aggregation, and you'd hard-delete the hourly readings after creating the aggregate. (This is a made-up, trivial example.)
The point is, it all depends on the use case of the system and its data, and not a decision to be made purely from a technological standpoint.
Well! As everyone said, it depends on the situation.
If you have an index on a column like UserName or EmailID - and you never expect the same UserName or EmailID to be used again; you can go with a soft delete.
That said, always check if your SELECT operation uses the primary key. If your SELECT statement uses a primary key, adding a flag with the WHERE clause wouldn't make much difference. Let's take an example (Pseudo):
Table Users (UserID [primary key], EmailID, IsDeleted)
SELECT * FROM Users where UserID = 123456 and IsDeleted = 0
This query won't make any difference in terms of performance since the UserID column has a primary key. Initially, it will scan the table based on PK and then execute the next condition.
Cases where soft deletes cannot work at all:
Sign-up in majorly all websites take EmailID as your unique identification. We know very well, once an EmailID is used on a website like facebook, G+, it cannot be used by anyone else.
There comes a day when the user wants to delete his/her profile from the website. Now, if you make a logical delete, that user won't be able to register ever again. Also, registering again using the same EmailID wouldn't mean to restore the entire history. Everyone knows, deletion means deletion. In such scenarios, we have to make a physical delete. But in order to maintain the entire history of the account, we should always archive such records in either archive tables or deleted tables.
Yes, in situations where we have lots of foreign tables, handling is quite cumbersome.
Also keep in mind that soft/logical deletes will increase your table size, so the index size.
I have already answered in another post.
However, I think my answer more fit to the question here.
My practical solution for soft-delete is archiving by creating a new
table with following columns: original_id, table_name, payload,
(and an optional primary key `id).
Where original_id is the original id of deleted record, table_name
is the table name of the deleted record ("user" in your case),
payload is JSON-stringified string from all columns of the deleted
record.
I also suggest making an index on the column original_id for latter
data retrievement.
By this way of archiving data. You will have these advantages
Keep track of all data in history
Have only one place to archive records from any table, regardless of the deleted record's table structure
No worry of unique index in the original table
No worry of checking foreign index in the original table
No more WHERE clause in every query to check for deletion
The is already a discussion
here explaining why
soft-deletion is not a good idea in practice. Soft-delete introduces
some potential troubles in future such as counting records, ...
It depends on the case, consider the below:
Usually, you don't need to "soft-delete" a record.
Keep it simple and fast.
e.g. Deleting a product no longer available, so you don't have to check the product isn't soft-deleted all over your app (count, product list, recommended products, etc.).
Yet, you might consider the "soft-delete" in a data warehouse model. e.g. You are viewing an old receipt on a deleted product.*
Advantages are data preservation/perpetuation. A disadvantage would be a decrease in performance when querying or retrieving data from tables with significant number of soft deletes.
In our case we use a combination of both: as others have mentioned in previous answers, we soft-delete users/clients/customers for example, and hard-delete on items/products/merchandise tables where there are duplicated records that don't need to be kept.