Some sort of “different auto-increment indexes” per a primary key values - sql

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.

MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?

SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.

In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.

It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

Related

DB schema for updating downstream sources?

I want a table to be sync-able by a web API.
For example,
GET /projects?sequence_latest=2113&limit=10
[{"state":"updated", "id":12,"sequence":2116},
{"state":"deleted" "id":511,"sequence":2115}
{"state":"created", "id":601,"sequence":2114}]
What is a good schema to achieve this?
I intend this for Postgresql with Django ORM, which uses surrogate keys. Presence of an ORM may kill answers like unions.
I can come up with only half-solutions.
I could have a modified_time column, but we cannot convey deletions.
I could have a table for storing deleted IDs, when returning 10 new/updated rows, I could return all the deleted rows between them. But this works only when the latest change is an insert/update and there are a moderate number of deleted rows.
I could set a deleted flag on the row and null the rest, but its kinda bad schema design to set all columns nullable.
I could have another table that stores ID, modification sequence number and state(new, updated, deleted), but its another table to maintain and setting sequence numbers cause contentions; imagine n concurrent requests querying for latest ID.
If you're using an ORM you want simple(ish) and if you're serving the data via an API you want quick.
To go through your suggested options:
Correct, so this doesn't help you. You could have a deleted flag in your main table though.
This seems quite a random way of doing it and breaks your insistence that there be no UNION queries.
Not sure why you would need to NULL the rest of the column here? What benefit does this bring?
I would strongly advise against having a table that has a modification sequence number. Either this means that you're performing a lot of analytic queries in order to find out the most recent state or you're updating the same rows multiple times and maintaining a table with the same PK as your normal one. At that point you might as well have a deleted flag in your main table.
Essentially the design of your API gives you one easy option; you should have everything in the same table because all data is being returned through the same method. I would follow your point 2 and Wolph's suggestion, have a deleted_on column in your table; making it look like:
create table my_table (
id ... primary key
, <other_columns>
, created_on date
, modified_on date
, deleted_on date
);
I wouldn't even bother updating all the other columns to be NULL. If you want to ensure that you return no data create a view on top of your table that nulls data where the deleted_on column has data in it. Then, your API only accesses the table through the view.
If you are really, really worried about space and the volume of records and will perform regular database maintenance to ensure that both are controlled then maybe go with option 4. Create a second table that has the state of each ID in your main table and actually delete the data from your main table. You then can do a LEFT OUTER JOIN to the main table to get the data. When there is no data that ID has been deleted. Honestly, this is overkill until you know whether you will definitely require it.
You don't mention why you're using an web API for data-transfers; but, if you're going to be transferring a lot of data or using this for internal systems only it might be worth using a lower-level transfer mechanism.

SQL unique field: concurrency bugs? [duplicate]

This question already has answers here:
Only inserting a row if it's not already there
(7 answers)
Closed 9 years ago.
I have a DB table with a field that must be unique. Let's say the table is called "Table1" and the unique field is called "Field1".
I plan on implementing this by performing a SELECT to see if any Table1 records exist where Field1 = #valueForField1, and only updating or inserting if no such records exist.
The problem is, how do I know there isn't a race condition here? If two users both click Save on the form that writes to Table1 (at almost the exact same time), and they have identical values for Field1, isn't it possible that the following would happen?
User1 makes a SQL call, which performs the select operation and determines there are no existing records where Field1 = #valueForField1. User1's process is preempted by User2's process, which also finds no records where Field1 = #valueForField1, and performs an insert. User1's process is allowed to run again, and inserts a second record where Field1 = #valueForField1, violating the requirement that Field1 be unique.
How can I prevent this? I'm told that transactions are atomic, but then why do we need table locks too? I've never used a lock before and I don't know whether or not I need one in this case. What happens if a process tries to write to a locked table? Will it block and try again?
I'm using MS SQL 2008R2.
Add a unique constraint on the field. That way you won't have to SELECT. You will only have to insert. The first user will succeed the second will fail.
On top of that you may make the field autoincremented, so you won't have to care on filling it, or you may add a default value, again not caring on filling it.
Some options would be an autoincremented INT field, or a unique identifier.
You can add a add a unique constraint. Example from http://www.w3schools.com/sql/sql_unique.asp:
CREATE TABLE Persons
(
P_Id int NOT NULL UNIQUE
)
EDIT: Please also read Martin Smith's comment below.
jyparask has a good answer on how you can tackle this specific problem. However, I would like to elaborate on your confusion over locks, transactions, blocking, and retries. For the sake of simplicity, I'm going to assume transaction isolation level serializable.
Transactions are atomic. The database guarantees that if you have two transactions, then all operations in one transaction occur completely before the next one starts, no matter what kind of race conditions there are. Even if two users access the same row at the same time (multiple cores), there is no chance of a race condition, because the database will ensure that one of them will fail.
How does the database do this? With locks. When you select a row, SQL Server will lock the row, so that all other clients will block when requesting that row. Block means that their query is paused until that row is unlocked.
The database actually has a couple of things it can lock. It can lock the row, or the table, or somewhere in between. The database decides what it thinks is best, and it's usually pretty good at it.
There is never any retrying. The database will never retry a query for you. You need to explicitly tell it to retry a query. The reason is because the correct behavior is hard to define. Should a query retry with the exact same parameters? Or should something be modified? Is it still safe to retry the query? It's much safer for the database to simply throw an exception and let you handle it.
Let's address your example. Assuming you use transactions correctly and do the right query (Martin Smith linked to a few good solutions), then the database will create the right locks so that the race condition disappears. One user will succeed, and the other will fail. In this case, there is no blocking, and no retrying.
In the general case with transactions, however, there will be blocking, and you get to implement the retrying.

Postgresql wrong auto-increment for serial

I have a problem on postgresql which I think there is a bug in the postgresql, I wrongly implement something.
There is a table including colmn1(primary key), colmn2(unique), colmn3, ...
After an insertion of a row, if I try another insertion with an existing colmn2 value I am getting a duplicate value error as I expected. But after this unsuccesful try, colmn1's next value is
incremented by 1 although there is no insertion so i am getting rows with id sequences like , 1,2,4,6,9.(3,5,6,7,8 goes for unsuccessful trials).
I need help from the ones who can explain this weird behaviour.
This information may be useful: I used "create unique index on tableName (lower(column1)) " query to set unique constraint.
See the PostgreSQL sequence FAQ:
Sequences are intended for generating unique identifiers — not
necessarily identifiers that are strictly sequential. If two
concurrent database clients both attempt to get a value from a
sequence (using nextval()), each client will get a different sequence
value. If one of those clients subsequently aborts their transaction,
the sequence value that was generated for that client will be unused,
creating a gap in the sequence.
This can't easily be fixed without incurring a significant performance
penalty. For more information, see Elein Mustein's "Gapless Sequences for Primary Keys" in the General Bits Newsletter.
From the manual:
Important: Because sequences are non-transactional, changes made by
setval are not undone if the transaction rolls back.
In other words, it's normal to have gaps. If you don't want gaps, don't use a sequence.

Is it safe to use ROWID to locate a Row/Record in Oracle?

I'm looking at a client application which retrieves several columns including ROWID, and
later uses ROWID to identify rows it needs to update:
update some_table t set col1=value1
where t.rowid = :selected_rowid
Is it safe to do so? As the table is being modified, can ROWID of a row change?
"From Oracle 8 the ROWID format and size changed from 8 to 10 bytes. Note that ROWID's will change when you reorganize or export/import a table. In case of a partitioned table, it also changes if the row migrates from a partition to another one during an UPDATE."
http://www.orafaq.com/wiki/ROWID
I'd say no. This could be safe if for instance the application stores ROWID temporarily(say generating a list of select-able items, each identified with ROWID, but the list is routinely regenerated and not stored). But if ROWID is used in any persistent way it's not safe.
Assuming that you are using the ROWID a short period of time after you SELECT it, that the table is a standard heap-organized table, and that the DBA isn't doing something to the table (which is a reasonably safe assumption if the application is online), the ROWID will be stable. It would be preferable to use the primary key but when the primary key isn't available, plenty of Oracle-developed tools and frameworks will use the ROWID for short periods of time. It would not be safe if you intended to use the ROWID a long period of time after you SELECT it-- for example, if you allow users to edit data locally and then synchronize with the master database some arbitrary length of time later.
The ROWID is just a physical location of a row so anything that causes that location to change will change the ROWID.
If you are using index-organized tables or partitioned tables, updates to the row can change where the row is physically located which will change the ROWID.
If a row is deleted from a heap-organized table, a subsequent INSERT might insert data with completely different data that happens to use the same ROWID the deleted row previously had.
Various administrative tasks can cause the ROWID to change. Exporting and importing the table will change the ROWID for example, but so will doing something like the new-ish online shrink command. These administrative tasks will not normally be done while the application is up, however, and will almost certainly not be done during the day. But it could lead to problems if the application isn't shut down when a DBA does this sort of thing or if the application persists the data.
Over time, it has become more and more common for new features to introduce new possibilities for ROWIDs to change. Index-organized tables and the online shrink option, for example, are relatively new features. In the future, it is likely that there will be more features that will involve the potential at least for a ROWID to change.
Of course, if we're being pedantic, it's also not safe to rely on the primary key. It is perfectly possible that some other session comes along and updates the primary key of the row after you read it or that some other session deletes the row after you select it and inserts a new row with the same data and a different primary key. In either case, it helps to have some local knowledge about what the applications using the database are actually supposed to be doing. It would be extremely uncommon, for example, to allow updates to primary keys or to reuse primary keys so you can generally determine that it's safe to use a primary key. Similarly, it is relatively common to conclude that given the way you're using partitioning or given the way you've defined the index in your index-organized table that updates won't actually change the ROWID. If you know that the table is partitioned by the LOAD_DATE, for example, and that you never update the LOAD_DATE, you won't actually experience changes to the ROWID because of an update. If you know that the table is index-organized but that you're not updating a column that is part of that index, the ROWID won't change on an UPDATE.
I do not think it is safe to do so, in theory it will not change, that is of course until someone "accidentally" deletes something on the actual DB...
I would just use the PK makes a lot more sense.

Physical vs. logical (hard vs. soft) delete of database record? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
What is the advantage of doing a logical/soft delete of a record (i.e. setting a flag stating that the record is deleted) as opposed to actually or physically deleting the record?
Is this common practice?
Is this secure?
Advantages are that you keep the history (good for auditing) and you don't have to worry about cascading a delete through various other tables in the database that reference the row you are deleting. Disadvantage is that you have to code any reporting/display methods to take the flag into account.
As far as if it is a common practice - I would say yes, but as with anything whether you use it depends on your business needs.
EDIT: Thought of another disadvantange - If you have unique indexes on the table, deleted records will still take up the "one" record, so you have to code around that possibility too (for example, a User table that has a unique index on username; A deleted record would still block the deleted users username for new records. Working around this you could tack on a GUID to the deleted username column, but it's a very hacky workaround that I wouldn't recommend. Probably in that circumstance it would be better to just have a rule that once a username is used, it can never be replaced.)
Are logical deletes common practice? Yes I have seen this in many places. Are they secure? That really depends are they any less secure then the data was before you deleted it?
When I was a Tech Lead, I demanded that our team keep every piece of data, I knew at the time that we would be using all that data to build various BI applications, although at the time we didn't know what the requirements would be. While this was good from the standpoint of auditing, troubleshooting, and reporting (This was an e-commerce / tools site for B2B transactions, and if someone used a tool, we wanted to record it even if their account was later turned off), it did have several downsides.
The downsides include (not including others already mentioned):
Performance Implications of keeping all that data, We to develop various archiving strategies. For example one area of the application was getting close to generating around 1Gb of data a week.
Cost of keeping the data does grow over time, while disk space is cheap, the amount of infrastructure to keep and manage terabytes of data both online and off line is a lot. It takes a lot of disk for redundancy, and people's time to ensure backups are moving swiftly etc.
When deciding to use logical, physical deletes, or archiving I would ask myself these questions:
Is this data that might need to be re-inserted into the table. For example User Accounts fit this category as you might activate or deactivate a user account. If this is the case a logical delete makes the most sense.
Is there any intrinsic value in storing the data? If so how much data will be generated. Depending on this I would either go with a logical delete, or implement an archiving strategy. Keep in mind you can always archive logically deleted records.
It might be a little late but I suggest everyone to check Pinal Dave's blog post about logical/soft delete:
I just do not like this kind of design [soft delete] at all. I am firm believer of the architecture where only necessary data should be in single table and the useless data should be moved to an archived table. Instead of following the isDeleted column, I suggest the usage of two different tables: one with orders and another with deleted orders. In that case, you will have to maintain both the table, but in reality, it is very easy to maintain. When you write UPDATE statement to the isDeleted column, write INSERT INTO another table and DELETE it from original table. If the situation is of rollback, write another INSERT INTO and DELETE in reverse order. If you are worried about a failed transaction, wrap this code in TRANSACTION.
What are the advantages of the smaller table verses larger table in above described situations?
A smaller table is easy to maintain
Index Rebuild operations are much faster
Moving the archive data to another filegroup will reduce the load of primary filegroup (considering that all filegroups are on different system) – this will also speed up the backup as well.
Statistics will be frequently updated due to smaller size and this will be less resource intensive.
Size of the index will be smaller
Performance of the table will improve with a smaller table size.
I'm a NoSQL developer, and on my last job, I worked with data that was always critical for someone, and if it was deleted by accident in the same day that was created, I were not able to find it in the last backup from yesterday! In that situation, soft deletion always saved the day.
I did soft-deletion using timestamps, registering the date the document was deleted:
IsDeleted = 20150310 //yyyyMMdd
Every Sunday, a process walked on the database and checked the IsDeleted field. If the difference between the current date and the timestamp was greater than N days, the document was hard deleted. Considering the document still be available on some backup, it was safe to do it.
EDIT: This NoSQL use case is about big documents created in the database, tens or hundreds of them every day, but not thousands or millions. By general, they were documents with the status, data and attachments of workflow processes. That was the reason why there was the possibility of a user deletes an important document. This user could be someone with Admin privileges, or maybe the document's owner, just to name a few.
TL;DR My use case was not Big Data. In that case, you will need a different approach.
One pattern I have used is to create a mirror table and attach a trigger on the primary table, so all deletes (and updates if desired) are recorded in the mirror table.
This allows you to "reconstruct" deleted/changed records, and you can still hard delete in the primary table and keep it "clean" - it also allows the creation of an "undo" function, and you can also record the date, time, and user who did the action in the mirror table (invaluable in witch hunt situations).
The other advantage is there is no chance of accidentally including deleted records when querying off the primary unless you deliberately go to the trouble of including records from the mirror table (you may want to show live and deleted records).
Another advantage is that the mirror table can be independently purged, as it should not have any actual foreign key references, making this a relatively simple operation in comparison to purging from a primary table that uses soft deletes but still has referential connections to other tables.
What other advantages? - great if you have a bunch of coders working on the project, doing reads on the database with mixed skill and attention to detail levels, you don't have to stay up nights hoping that one of them didn’t forget to not include deleted records (lol, Not Include Deleted Records = True), which results in things like overstating say the clients available cash position which they then go buy some shares with (i.e., as in a trading system), when you work with trading systems, you will find out very quickly the value of robust solutions, even though they may have a little bit more initial "overhead".
Exceptions:
- as a guide, use soft deletes for "reference" data such as user, category, etc, and hard deletes to a mirror table for "fact" type data, i.e., transaction history.
I used to do soft-delete, just to keep old records. I realized that users don't bother to view old records as often as I thought. If users want to view old records, they can just view from archive or audit table, right? So, what's the advantage of soft-delete? It only leads to more complex query statement, etc.
Following are the things i've implemented, before I decided to not-soft-delete anymore:
implement audit, to record all activities (add,edit,delete). Ensure that there's no foreign key linked to audit, and ensure this table is secured and nobody can delete except administrators.
identify which tables are considered "transactional table", which very likely that it will be kept for long time, and very likely user may want to view the past records or reports. For example; purchase transaction. This table should not just keep the id of master table (such as dept-id), but also keep the additional info such as the name as reference (such as dept-name), or any other necessary fields for reporting.
Implement "active/inactive" or "enable/disable" or "hide/show" record of master table. So, instead of deleting record, the user can disable/inactive the master record. It is much safer this way.
Just my two cents opinion.
I'm a big fan of the logical delete, especially for a Line of Business application, or in the context of user accounts. My reasons are simple: often times I don't want a user to be able to use the system anymore (so the account get's marked as deleted), but if we deleted the user, we'd lose all their work and such.
Another common scenario is that the users might get re-created a while after having been delete. It's a much nicer experience for the user to have all their data present as it was before they were deleted, rather than have to re-create it.
I usually think of deleting users more as "suspending" them indefinitely. You never know when they'll legitimately need to be back.
I commonly use logical deletions - I find they work well when you also intermittently archive off the 'deleted' data to an archived table (which can be searched if needed) thus having no chance of affecting the performance of the application.
It works well because you still have the data if you're ever audited. If you delete it physically, it's gone!
I almost always soft delete and here's why:
you can restore deleted data if a customer asks you to do so. More happy customers with soft deletes. Restoring specific data from backups is complex
checking for isdeleted everywhere is not an issue, you have to check for userid anyway (if the database contains data from multiple users). You can enforce the check by code, by placing those two checks on a separate function (or use views)
graceful delete. Users or processes dealing with deleted content will continue to "see" it until they hit the next refresh. This is a very desirable feature if a process is processing some data which is suddenly deleted
synchronization: if you need to design a synchronization mechanism between a database and mobile apps, you'll find soft deletes much easier to implement
Re: "Is this secure?" - that depends on what you mean.
If you mean that by doing physical delete, you'll prevent anyone from ever finding the deleted data, then yes, that's more or less true; you're safer in physically deleting the sensitive data that needs to be erased, because that means it's permanently gone from the database. (However, realize that there may be other copies of the data in question, such as in a backup, or the transaction log, or a recorded version from in transit, e.g. a packet sniffer - just because you delete from your database doesn't guarantee it wasn't saved somewhere else.)
If you mean that by doing logical delete, your data is more secure because you'll never lose any data, that's also true. This is good for audit scenarios; I tend to design this way because it admits the basic fact that once data is generated, it'll never really go away (especially if it ever had the capability of being, say, cached by an internet search engine). Of course, a real audit scenario requires that not only are deletes logical, but that updates are also logged, along with the time of the change and the actor who made the change.
If you mean that the data won't fall into the hands of anyone who isn't supposed to see it, then that's totally up to your application and its security structure. In that respect, logical delete is no more or less secure than anything else in your database.
Logical deletions if are hard on referential integrity.
It is the right think to do when there is a temporal aspect of the table data (are valid FROM_DATE - TO_DATE).
Otherwise move the data to an Auditing Table and delete the record.
On the plus side:
It is the easier way to rollback (if at all possible).
It is easy to see what was the state at a specific point in time.
I strongly disagree with logical delete because you are exposed to many errors.
First of all queries, each query must take care the IsDeleted field and the possibility of error becomes higher with complex queries.
Second the performance: imagine a table with 100000 recs with only 3 active, now multiply this number for the tables of your database; another performance problem is a possible conflict with new records with old (deleted records).
The only advantage I see is the history of records, but there are other methods to achieve this result, for example you can create a logging table where you can save info: TableName,OldValues,NewValues,Date,User,[..] where *Values ​​can be varchar and write the details in this form fieldname : value; [..] or store the info as xml.
All this can be achieved via code or Triggers but you are only ONE table with all your history.
Another options is to see if the specified database engine are native support for tracking change, for example on SQL Server database there are SQL Track Data Change.
It's fairly standard in cases where you'd like to keep a history of something (e.g. user accounts as #Jon Dewees mentions). And it's certainly a great idea if there's a strong chance of users asking for un-deletions.
If you're concerned about the logic of filtering out the deleted records from your queries getting messy and just complicating your queries, you can just build views that do the filtering for you and use queries against that. It'll prevent leakage of these records in reporting solutions and such.
There are requirements beyond system design which need to be answered. What is the legal or statutory requirement in the record retention? Depending on what the rows are related to, there may be a legal requirement that the data be kept for a certain period of time after it is 'suspended'.
On the other hand, the requirement may be that once the record is 'deleted', it is truly and irrevocably deleted. Before you make a decision, talk to your stakeholders.
Mobile apps that depend on synchronisation might impose the use of logical rather than physical delete: a server must be able to indicate to the client that a record has been (marked as) deleted, and this might not be possible if records were physically deleted.
I just wanted to expand on the mentioned unique constraint problem.
Suppose I have a table with two columns: id and my_column. To support soft-deletes I need to update my table definition to this:
create table mytable (
id serial primary key,
my_column varchar unique not null,
deleted_at datetime
)
But if a row is soft-deleted, I want my_column constraint to be ignored, because deleted data should not interfere with non-deleted data. My original model will not work.
I would need to update my data definition to this:
create table mytable (
id serial primary key,
my_column varchar not null,
my_column_repetitions integer not null default 0,
deleted_at datetime,
unique (my_column, my_column_repetitions),
check (deleted_at is not null and my_column_repetitions > 0 or deleted_at is null and my_column_repetitions = 0)
)
And apply this logic: when a row is current, i.e. not deleted, my_column_repetitions should hold the default value 0 and when the row is soft-deleted its my_column_repetitions needs to be updated to (max. number of repetitions on soft-deleted rows) + 1.
The latter logic must be implemented programmatically with a trigger or handled in my application code and there is no check that I could set.
Repeat this is for every unique column!
I think this solution is really hacky and would favor a separate archive table to store deleted rows.
They don't let the database perform as it should rendering such things as the cascade functionality useless.
For simple things such as inserts, in the case of re-inserting, then the code behind it doubles.
You can't just simply insert, instead you have to check for an existence and insert if it doesn't exist before or update the deletion flag if it does whilst also updating all other columns to the new values. This is seen as an update to the database transaction log and not a fresh insert causing inaccurate audit logs.
They cause performance issues because tables are getting glogged with redundant data. It plays havock with indexing especially with uniqueness.
I'm not a big fan of logical deletes.
To reply to Tohid's comment, we faced same problem where we wanted to persist history of records and also we were not sure whether we wanted is_deleted column or not.
I am talking about our python implementation and a similar use-case we hit.
We encountered https://github.com/kvesteri/sqlalchemy-continuum which is an easy way to get versioning table for your corresponding table. Minimum lines of code and captures history for add, delete and update.
This serves more than just is_deleted column. You can always backref version table to check what happened with this entry. Whether entry got deleted, updated or added.
This way we didn't need to have is_deleted column at all and our delete function was pretty trivial. This way we also don't need to remember to mark is_deleted=False in any of our api's.
Soft Delete is a programming practice that being followed in most of the application when data is more relevant. Consider a case of financial application where a delete by the mistake of the end user can be fatal.
That is the case when soft delete becomes relevant. In soft delete the user is not actually deleting the data from the record instead its being flagged as IsDeleted to true (By normal convention).
In EF 6.x or EF 7 onward Softdelete is Added as an attribute but we have to create a custom attribute for the time being now.
I strongly recommend SoftDelete In a database design and its a good convention for the programming practice.
Most of time softdeleting is used because you don't want to expose some data but you have to keep it for historical reasons (A product could become discontinued, so you don't want any new transaction with it but you still need to work with the history of sale transaction). By the way, some are copying the product information value in the sale transaction data instead of making a reference to the product to handle this.
In fact it looks more like a rewording for a visible/hidden or active/inactive feature. Because that's the meaning of "delete" in business world. I'd like to say that Terminators may delete people but boss just fire them.
This practice is pretty common pattern and used by a lot of application for a lot of reasons. As It's not the only way to achieve this, so you will have thousand of people saying that's great or bullshit and both have pretty good arguments.
From a point of view of security, SoftDelete won't replace the job of Audit and it won't replace the job of backup too. If you are afraid of "the insert/delete between two backup case", you should read about Full or Bulk recovery Models. I admit that SoftDelete could make the recovery process more trivial.
Up to you to know your requirement.
To give an alternative, we have users using remote devices updating via MobiLink. If we delete records in the server database, those records never get marked deleted in the client databases.
So we do both. We work with our clients to determine how long they wish to be able to recover data. For example, generally customers and products are active until our client say they should be deleted, but history of sales is only retained for 13 months and then deletes automatically. The client may want to keep deleted customers and products for two months but retain history for six months.
So we run a script overnight that marks things logically deleted according to these parameters and then two/six months later, anything marked logically deleted today will be hard deleted.
We're less about data security than about having enormous databases on a client device with limited memory, such as a smartphone. A client who orders 200 products twice a week for four years will have over 81,000 lines of history, of which 75% the client doesn't care if he sees.
It all depends on the use case of the system and its data.
For example, if you are talking about a government regulated system (e.g. a system at a pharmaceutical company that is considered a part of the quality system and must follow FDA guidelines for electronic records), then you darned well better not do hard deletes! An auditor from the FDA can come in and ask for all records in the system relating to product number ABC-123, and all data better be available. If your business process owner says the system shouldn't allow anyone to use product number ABC-123 on new records going forward, use the soft-delete method instead to make it "inactive" within the system, while still preserving historical data.
However, maybe your system and its data has a use case such as "tracking the weather at the North Pole". Maybe you take temperature readings once every hour, and at the end of the day aggregate a daily average. Maybe the hourly data will no longer ever be used after aggregation, and you'd hard-delete the hourly readings after creating the aggregate. (This is a made-up, trivial example.)
The point is, it all depends on the use case of the system and its data, and not a decision to be made purely from a technological standpoint.
Well! As everyone said, it depends on the situation.
If you have an index on a column like UserName or EmailID - and you never expect the same UserName or EmailID to be used again; you can go with a soft delete.
That said, always check if your SELECT operation uses the primary key. If your SELECT statement uses a primary key, adding a flag with the WHERE clause wouldn't make much difference. Let's take an example (Pseudo):
Table Users (UserID [primary key], EmailID, IsDeleted)
SELECT * FROM Users where UserID = 123456 and IsDeleted = 0
This query won't make any difference in terms of performance since the UserID column has a primary key. Initially, it will scan the table based on PK and then execute the next condition.
Cases where soft deletes cannot work at all:
Sign-up in majorly all websites take EmailID as your unique identification. We know very well, once an EmailID is used on a website like facebook, G+, it cannot be used by anyone else.
There comes a day when the user wants to delete his/her profile from the website. Now, if you make a logical delete, that user won't be able to register ever again. Also, registering again using the same EmailID wouldn't mean to restore the entire history. Everyone knows, deletion means deletion. In such scenarios, we have to make a physical delete. But in order to maintain the entire history of the account, we should always archive such records in either archive tables or deleted tables.
Yes, in situations where we have lots of foreign tables, handling is quite cumbersome.
Also keep in mind that soft/logical deletes will increase your table size, so the index size.
I have already answered in another post.
However, I think my answer more fit to the question here.
My practical solution for soft-delete is archiving by creating a new
table with following columns: original_id, table_name, payload,
(and an optional primary key `id).
Where original_id is the original id of deleted record, table_name
is the table name of the deleted record ("user" in your case),
payload is JSON-stringified string from all columns of the deleted
record.
I also suggest making an index on the column original_id for latter
data retrievement.
By this way of archiving data. You will have these advantages
Keep track of all data in history
Have only one place to archive records from any table, regardless of the deleted record's table structure
No worry of unique index in the original table
No worry of checking foreign index in the original table
No more WHERE clause in every query to check for deletion
The is already a discussion
here explaining why
soft-deletion is not a good idea in practice. Soft-delete introduces
some potential troubles in future such as counting records, ...
It depends on the case, consider the below:
Usually, you don't need to "soft-delete" a record.
Keep it simple and fast.
e.g. Deleting a product no longer available, so you don't have to check the product isn't soft-deleted all over your app (count, product list, recommended products, etc.).
Yet, you might consider the "soft-delete" in a data warehouse model. e.g. You are viewing an old receipt on a deleted product.*
Advantages are data preservation/perpetuation. A disadvantage would be a decrease in performance when querying or retrieving data from tables with significant number of soft deletes.
In our case we use a combination of both: as others have mentioned in previous answers, we soft-delete users/clients/customers for example, and hard-delete on items/products/merchandise tables where there are duplicated records that don't need to be kept.