The lead developer on a project I'm involved in says it's bad practice to rely on cascades to delete related rows.
I don't see how this is bad, but I would like to know your thoughts on if/why it is.
I'll preface this by saying that I rarely delete rows period. Generally most data you want to keep. You simply mark it as deleted so it won't be shown to users (ie to them it appears deleted). Of course it depends on the data and for some things (eg shopping cart contents) actually deleting the records when the user empties his or her cart is fine.
I can only assume that the issue here is you may unintentionally delete records you don't actually want to delete. Referential integrity should prevent this however. So I can't really see a reason against this other than the case for being explicit.
I would say that you follow the principle of least surprise.
Cascading deletes should not cause unexpected loss of data. If a delete requires related records to be deleted, and the user needs to know that those records are going to go away, then cascading deletes should not be used. Instead, the user should be required to explicitly delete the related records, or be provided a notification.
On the other hand, if the table relates to another table that is temporary in nature, or that contains records that will never be needed once the parent entity is gone, then cascading deletes may be OK.
That said, I prefer to state my intentions explicitly by deleting the related records in code, rather than relying on cascading deletes. In fact, I've never actually used a cascading delete to implicitly delete related records. Also, I use soft deletion a lot, as described by cletus.
I never use cascading deletes. Why? Because it is too easy to make a mistake. Much safer to require client applications to explicitly delete (and meet the conditions for deletion, such as deleting FK referred records.)
In fact, deletions per se can be avoided by marking records as deleted or moving into archival/history tables.
In the case of marking records as deleted, it depends on the relative proportion of marked as deleted data, since SELECTs will have to filter on 'isDeleted = false' an index will only be used if less than 10% (approximately, depending on the RDBMS) of records are marked as deleted.
Which of these 2 scenarios would you prefer:
Developer comes to you, says "Hey, this delete won't work". You both look into it and find that he was accidently trying to delete entire table contents. You both have a laugh, and go back to what you were doing.
Developer comes to you, and sheepishly asks "Do we have backups?"
There's another great reason to not use cascading UPDATES or DELETES: they hold a serializable lock. Holding a serializable lock can kill performance.
Another huge reason to avoid cascading deletes is performance. They seem like a good idea until you need to delete 10,000 records from the main table which in turn have millions of records in child tables. Given the size of this delete, it is likely to completely lock down all of the table for hours maybe even days. Why would you ever risk this? For the convenience of spending ten minutes less time writing the extra delete statements for one record deletes?
Further, the error you get when you try to delete a record that has a child record is often a good thing. It tells you that you don't want to delete this record becasue there is data that you need that you would lose if you did so. Cascade delete would just go ahead and delete the child records resulting in loss of information about orders for instance if you deleted a customer who had orders in the past. This sort of thing can thoroughly mess up your financial records.
I was likewise told that cascading deletes were bad practice... and thus never used them until I came across a client who used them. I really didn't know why I was not supposed to use them but thought they were very convenient in not having to code out deleting all the FK records as well.
Thus I decided to research why they were so "bad" and from what I've found so far their doesn't to appear to be anything problematic about them. In fact the only good argument I've seen so far is what HLGLEM stated above about performance. But as I am usually not deleting this number of records I think in most cases using them should be fine. I would like to hear of any other arguments others may have against using them to make sure I've considered all options.
I'd add that ON DELETE CASCADE makes it difficult to maintain a copy of the data in a data warehouse using binlog replication which is how most commercial ETL tools work. Explicit deletion from each table maintains a full log record and is much easier on the data team :)
I actually agree with most of the answers here, YET not all scenarios are the same, and it depends on the situation at hand and what would be the entropy of that decision, for example:
If you have a deletion command for an entity that has multiple many/belong relationships with a large number of entities, each time you would call that deletion process you would also need to remember to delete all the corresponding FKs from each relational pivot that A has corrosponding relationships with.
Whereas via a cascade on delete, you write that once as part of your schema and it will ONLY delete those corresponding FKs and cleanup the pivots from relations that are no longer necessary, imagine 24 relations for an entity + other entities that would also have large number of relations on top of that, again, it really depends on your setup and what YOU feel comfortable with. In anycase just for FYIs, in an Illuminate migration schema file, you would write it as such:
$table->dropForeign(['permission_id']);
$table->foreign('permission_id')
->references('id')
->on('permission')
->onDelete('cascade');
Related
We've got a table with two colums: USER and MESSAGE
An USER can have more than one message.
The table is frequently updated with more USER-MESSAGE pairs.
I want to frequently retrieve the top X users that sent the most messages. What would be the optimal (DX and performnce wise) solution for it?
The solutions I see myself:
I could GROUP BY and COUNT, however it doesn't seem like the most performant nor clean solution.
I could keep an additional table that'd keep count of every user's messages. On every message insertion into the main table, I could also update the relevant row here. Could the update be done automaticaly? Perhaps I could write a procedure for it?
For the main table, I could create a VIEW that'd have an additional "calculated" column - it'd GROUP BY and COUNT, but again, it's probably not the most performant solution. I'd query the view instead.
Please tell me whatever you think might be the best solution.
Some databases have incrementally updated views, where you create a view like in your example 3, and it automatically keeps it updated like in your example 2. PostgreSQL does not have this feature.
For your option 1, it seems pretty darn clean to me. Hard to get much simpler than that. Yes, it could have performance problems, but how fast do you really need it to be? You should make sure you actually have a problem before worrying about solving it.
For your option 2, what you are looking for is a trigger. For each insertion, it would increment a count in the user table. If you ever delete, you would also need to decrease the count. Also, if ever update to change the user of an existing entry, the trigger would need to decrease the count of the old user and increase it of the new user. This will decrease the concurrency, as if two processes try to insert messages from the same user at the same time, one will block until the other finishes. This may not matter much to you. Also, the mere existence of triggers imposes some CPU overhead, plus whatever the trigger itself actually does. But unless our server is already overloaded, this might not matter.
Your option 3 doesn't make much sense to me, at least not in PostgreSQL. There is no performance benefit, and it would act to obscure rather than clarify what is going on. Anyone who can't understand a GROUP BY is probably going to have even more problems understanding a view which exists only to do a GROUP BY.
Another option is a materialized view. But you will see stale data from them between refreshes. For some uses that is acceptable, for some it is not.
The first and third solutions are essentially the same, since a view is nothing but a “crystallized” query.
The second solution would definitely make for faster queries, but at the price of storing redundant data. The disadvantages of such an approach are:
You are running danger of inconsistent data. You can reduce that danger somewhat by using triggers that automatically keep the data synchronized.
The performance of modifications of message will be worse, because the trigger will have to be executed, and each modification will also modify users (that is the natural place to keep such a count).
The decision should be based on the question whether the GROUP BY query will be fast enough for your purposes. If yes, use it and avoid the above disadvantages. If not, consider storing the extra count.
I need to use one Access(2007)database on 2 offline locations and then get all the data back in one database. Some advised me to use SharePoint, but after some trial and frustration I wonder if it's really the best way.
Is it possible to manage this in an automated way, with update queries or so?
I have 26 tables, but only 14 need to be updated frequently. I use autonumber to create the parentkey and use cascade updating for the linked tables.
If your data can handle it, it's probably better to use a more natural key for the tables that require frequent updating. I.e. ideally you can uniquely identify a record my some combination of the columns in that record. Autonumbers in two databases can, and very likely will, step on each other, then when you do merge any records based on an old auto number need to be mapped properly. That can be done but is kind of a pain. It'd be nicer to avoid it all from the start.
As for using Sharepoint (I assume the suggestion is to replace your tables with lists, not to just put your accdb on SP) it has a lot of limitations in terms of the kinds of indices that can be created and relationships you can establish. Maybe your data are simple enough to live with this. I'm yet to be able to justify the move.
ultimate the answer to your question is YES it is possible to manage the synchonization with insert/update queries and very likely some VBA (possibly lots depending on how complicated your table hierarchy is). You'll need to be vigilant about two people updating a single record. You'll need to come up with some means to resolve the conflict.
As a programmer, adding a reference to an object is pretty safe but adding a foreign key relationship (I think) is pretty dangerous. By adding a FK relationship, ALL the queries that delete a row from this foreign table has to be updated to properly delete the foreign key that's tied to that row before actually deleting the row. How do you search for all the queries that delete a row from this foreign table? These queries can lie buried in code and in stored procedures. Is this a real life example of a maintenance nightmare? Is there a solution to this problem?
You should never design a relational database without foreign keys from the very beginning. That is a guarantee of poor data integrity over time.
You can add the code and use cascade delete as others have suggested, but that too is often the wrong answer. There are times when you genuinely want the delete stopped because you have child records. For instance, suppose you have customers and orders. If you delete a customer who has an order, then you lose the financial record of the order which is a disaster. Instead you would want the application to get an error saying an order exists for this customer. Further cascade delete could suddenly get you into deleting millions of child records thus locking up your datbase while a huge transaction happens. It is a dangerous practice that should rarely, if ever, be used in a production database.
Add the FK (if you have the relationships, it is needed) and then search for the code that deletes from that table and adjust it appropriately. Consider if a soft delete isn't a better option. This is where you mark a record as deleted or inactive, so it no longer shows up as a data entry option, but you can still see the existing records. Again you may need to adjust your database code fairly severly to implement this correctly. There is no easy fix for having a database that was badly designed from the start.
The soft delete is also a good choice if you think you will have many child records and actually do want to delete them. This way you can mark the records so they no longer show in the application and use a job that runs during non-peak hours to batch delete records.
If you are adding a new table and adding an FK, it is certainly easier to deal with becasue you would create the table before writing any code against it.
Your statement is simply not true. When establishing a foreign key relationship, you can set the cascading property to cascade delete. Once that's done, the child records will be deleted when the parent is deleted, ensuring that no records are orphaned.
If you use a proper ORM solution, configure FK's and PK's correctly, and enable cascading deletes, you shouldn't have any problems.
I wouldn't say so (to confirm what others mentioned) - that is usually taken care of with cascading deletes. Providing you want it that way - or with careful procedures that clean things behind.
The bigger system is you get to see more of the 'procedures' and less of the 'automation' (i.e. cascade deletes). For larger setups - DBA-s usually prefer to deal with that during database maintenance phase. Quite often, records are not allowed to be deleted, through middle-ware application code - but are simply marked as 'deleted' or inactive - and dealt with later on according to database routines and procedures in place in the organization (archived etc.).
And unless you have a very large code base, that's not a huge issue. Also, usually, most of the Db code goes through some DAL layer which can be easily traversed. Or you can also query system tables for all the relationships and 'dependencies' and many routines were written for such a code maintenance (on both sides of the 'fence'). It's not that it's not an 'issue', just nothing much different than normal Db work - and there're worse things than that.
So, I wouldn't lose my sleep over that. There are other issues with around using 'too much' of the referential integrity constraints (performance, maintenance) - but that is often a very controversial issue among DBA-s (and Db professionals in general), so I won't get into that:)
I have an issue I am working with an existing SQL Server 2008 database: I need to occasionally change the primary key value for some existing records in a table. Unfortunately, there are about 30 other tables with foreign key references to this table.
What is the most elegant way to change a primary key and related foreign keys?
I am not in a situation where I can change the existing key structure, so this is not an option. Additionally, as the system is expanded, more tables will be related to this table, so maintainability is very important. I am looking for the most elegant and maintainable solution, and any help is greatly appreciated. I so far have thought about using Stored Procedures or Triggers, but I wanted some advice before heading in the wrong direction.
Thanks!
When you say "I am not in a situation where I can change the existing key structure" are you able to add the ON UPDATE CASCADE option to the foreign keys? That is the easiest way to handle this situation — no programming required.
As Larry said, On Update Cascade will work, however, it can cause major problems in a production database and most dbas are not too thrilled with letting you use it. For instance, suppose you have a customer who changes his company name (and that is the PK) and there are two million related records in various tables. On UPDATE Cascade will do all the updates in one transaction which could lock up your major tables for several hours. This is one reason why it is a very bad idea to have a PK that will need to be changed. A trigger would be just as bad and if incorrectly written, it could be much worse.
If you do the changes in a stored proc you can put each part in a separate transaction, so at least you aren't locking everything up. You can also update records in batches so that if you have a million records to update in a table, you can do them in smaller batches which will will run faster and have fewer locks. The best way to do this is to create a new record in the primary table with the new PK and then move the old records to the new one in batches and then delete the old record once all related records are moved. If you do this sort of thing, it is best to have audit tables so you can easily revert the data if there is a problem since you will want to do this in multiple transactions to avoid locking the whole database. Now this is harder to maintain, you have to remember to add to the proc when you add an FK (but you would have to remember to do on UPDATE CASCADE as well). On the other hand if it breaks due to a problem with a new FK, it is an easy fix, you know right what the problems is and can easily put a change to prod relatively quickly.
There are no easy solutions to this problem because the basic problem is poor design. You'll have to look over the pros and cons of all solutions (I would throw out the trigger idea as Cascade Update will perform better and be less subject to bugs) and decide what works best in your case. Remember data integrity and performance are critical to enterprise databases and may be more important than maintainability (heresy, I know).
If you have to update your primary key regularly then something is wrong there. :)
I think the simplest way to do it is add another column and make it the primary key. This would allow you to change the values easily and also related the foreign keys. Besides, I do not understand why you cannot change the existing key structure.
But, as you pointed in the question (and Larry Lustig commented) you cannot change the existing structure. But, I am afraid if it is a column which requires frequent updates then use of triggers could affect the performance adversely. And, you also say that as the system expands, more tables will be related to this table so maintainability is very important. But, a quick fix now will only worsen the problem.
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
What to do when I want to use database constraints but only mark as deleted instead of deleting?
Is it more appropriate to set some kind of "IsDeleted" flag in a heavily used database to simply mark records for deletion (and then delete them later), or should they be deleted directly?
I like the IsDeleted flag approach because it gives an easy option to restore data in case something went terribly wrong, and I could even provide some kind of "Undo" function to the user. The data I'm dealing with is fairly important.
I don't like IsDeleted because it really messes with data retrieval queries, as I'd have to filter by the state of the IsDeleted flag in addition to the regular query. Queries use no more than one index so I'd also assume that this would slow down things tremendously unless I create composite indexes.
So, what is more appropriate? Is there a better "middle way" to get the benefits of both, and what are you using & why?
As a rule of thumb I never delete any data. The type of business I am in there are always questions suchas 'Of the customers that cancelled how many of them had a widget of size 4' If I had deleted the customer how could I get it. Or more likely if had deleted a widget of size 4 from the widget table this would cause a problem with referential integrity. An 'Active' bit flag seems to work for me and with indexing there is no big performance hit.
I would be driven by business requirements. If the client expects you to restore deleted data instantly and undeleting data is part of business logic and/or use cases then isDeleted flag makes sense.
Otherwise, by leaving deleted data in the database, you address problems that are more suitable to be addressed by database backups and maintenance procedures.
The mechanism for doing this has been discussed several times before.
Question 771197
Question 68323
My personal favourite, a deleted_at column is documented in Question 771197
The answer is, it depends on the scenario.
Does it requires undo-delete?
what is the frequency of user doing that?
how many records will it result into over time?
If it is required, you can create tables with identical structure with name _DELETED suffix.
Customers__DELETED.
I think you should also consider what happens if there is a conflict when you Undelete the record and some other user has entered the record with similar content.
I have learnt that deleting the data rarely makes sense, as there's always some reporting that needs the data or more often some one deletes by mistake and needs it back. Personally I move all "deleted" items to an archive version of the database. This is then backed up separetly and reports can use it. The main DB Size is kept lean and restoring the data is not too much of an issue.
But like others have said, it depends on your business requirements and scale / size of DB. An archived / deleted field may be enough.