Database safety: Intermediary "to_be_deleted" column/table? - sql

Everyone has accidentally forgotten the WHERE clause on a DELETE query and blasted some un-backed up data once or twice. I was pondering that problem, and I was wondering if the solution I came up with is practical.
What if, in place of actual DELETE queries, the application and maintenance scripts did something like:
UPDATE foo SET to_be_deleted=1 WHERE blah = 50;
And then a cron job was set to go through and actually delete everything with the flag? The downside would be that pretty much every other query would need to have WHERE to_be_deleted != 1 appended to it, but the upside would be that you'd never mistakenly lose data again. You could see "2,349,325 rows affected" and say, "Hmm, looks like I forgot the WHERE clause," and reset the flags. You could even make the to_be_deleted field a DATE column, so the cron job would check to see if a row's time had come yet.
Also, you could remove DELETE permission from the production database user, so even if someone managed to inject some SQL into your site, they wouldn't be able to remove anything.
So, my question is: Is this a good idea, or are there pitfalls I'm not seeing?

That is fine if you want to do that, but it seems like a lot of work. How many people are manually changing the database? It should be very few, especially if your users have an app to work with.
When I work on the production db I put EVERYTHING I do in a transaction so if I mess up I can rollback. Just having a standard practice like that for me has helped me.
I don't see anything really wrong with that though other than ever single point of data manipulation in each applicaiton will have to be aware of this functionality and not just the data it wants.

This would be fine as long as your appliction does not require that the data is immediately deleted since you have to wait for the next interval of the cron job.
I think a better solution and the more common practice is to use a development server and a production server. If your development database gets blown out, simply reload it. No harm done. If you're testing code on your production database, you deserve anything bad that happens.

A lot of people have a delete flag or a row status flag. But if someone is doing a change through the back end (and they will be doing it since often people need batch changes done that can't be accomplished through the front end) and they make a mistake they will still often go for delete. Ultimately this is no substitute for testing the script before applying it to a production environment.
Also...what happens if the following query gets executed "UPDATE foo SET to_be_deleted=1" because they left off the where clause. Unless you have auditing columns with a time stamp how do you know which columns were deleted and which ones were done in error? But even if you have auditing columns with a time stamp, if the auditing is done via a stored procedure or programmer convention then these back end queries may not supply information letting you know that they were just applied.

Too complicated. The standard approach to this is to do all your work inside a transaction, so if you screw up and forget a WHERE clause, then you simply roll back when you see the "2,349,325 rows affected" result.

It may be easier to create a parallel table for deleted rows. A DELETE trigger (and UPDATE too if you want to undo changes as well) on the original table could copy the affected rows to the parallel table. Adding a datetime column to the parallel table to record the date & time of the change would let you permanently remove rows past a certain age using your cron job.
That way, you'd use normal DELETE statements on the original table, so there's no chance you'll forget to run your special "DELETE" statement. You also sidestep the to_be_deleted != 1 expression, which is just a bug waiting to happen when someone inevitably forgets.

It looks like you're describing three cases here.
Case 1 - maintenance scripts. Risk can be minimized by developing them and testing them in an environment other than your production box. For quick maintenance, do the maintenance in a single transaction, and check everything before committing. If you made a mistake, issue the rollback command. For more serious maintenance that you can't necessarily wait around for, or do in a single transaction, consider taking a backup directly before running the maintenance job, so that you can always restore back to the point before you ran your script if you encounter serious problems.
Case 2 - SQL Injection. This is an architecture issue. Your application shouldn't pass SQL into the database, access should be controlled through packages / stored procedures / functions, and values that are going to come from the UI and be used in a DDL statement should be applied using bind variables, rather than by creating dynamic SQL by appending strings together.
Case 3 - Regular batch jobs. These should have been tested before being deployed to production. If you delete too much, you have a bug, and are going to have to rely on your backup strategy.

Everyone has accidentally forgotten
the WHERE clause on a DELETE query and
blasted some un-backed up data once or
twice.
No. I always prototype my DELETEs as SELECTs and only if the latter gives the results I want to delete change the statement before WHERE to a DELETE. This let's me inspect in any needed detail the rows I want to affect before doing anything.

You could set up a view on that table that selects WHERE to_be_deleted != 1, and all of your normal selects are done on that view - that avoids having to put the WHERE on all of your queries.

The pitfall is that it's unnecessarily complicated and someone will inadvertently forget too check the flag in their query. There's also the issue of potentially needing to delete something immediately instead of wait for the scheduled job to run.

To avoid the to_be_deleted WHERE clause you could create a trigger before the delete command fires off to insert the deleted rows into a separate table. This table could be cleared out when you're sure everything in it really needs to be deleted, or you could keep it around for archive purposes.

You also get a "soft delete" feature so you can give the(certain) end-users the power of "undo" - there would have to be a pretty strong downside in the mix to cancel the benefits of soft deleting.

The "WHERE to_be_deleted <> 1" on every other query is a huge one. Another is once you've ran your accidentally rogue query, how will you determine which of the 2,349,325 were previously marked as deleted?
I think the practical solution is regular backups, and failing that, perhaps a delete trigger that captures the tuples to be axed.

The other option would be to create a delete trigger on each table. When anything is deleted, it would insert that "to be deleted" record into another table, ideally named TABLENAME_deleted.
The downside would be that the db would have twice as many tables.
I don't recommend triggers in general, but it might be what you are looking for.

This is why, whenever you are editing data by hand, you should BEGIN TRAN, edit your data, check that it looks good (for instance that you didn't delete more data than you were expecting) and then END TRAN. If you're using Postgres then you want to create lots of savepoints as well so that a typo doesn't wipe out your intermediate work.
But that said, in many applications it does make sense to have software mark records as invalid rather than deleting them. Add a last_modified date that is automatically updated, and you are all prepared to set up incremental updates into a data warehouse. Even if you don't have a data warehouse now, it never hurts to prepare for the future when preparing is cheap. Plus in the event of manual mistakes you still have the data, and can just find all of the records that got "deleted" when you made your mistake and fix them. (You should still use transactions though.)

Related

When a SQL DELETE query times out, what happens with the data?

If I were to run a DELETE FROM some_table, and that were to timeout, what happens to the data?
The way I see it, one of two things might happen:
The data is deleted up to the point where the query times out, so if there were 1,000,000 entries in the database and the first 500,000 were deleted, they'd stay deleted. The database now contains half as many as it did before the query was run.
The data is deleted, the query times out, the data is rolled back (I would guess from the logs made by DELETE?). The database now contains the exact same data it started with.
Both seem logical. Would one happen 100% of the time? Or is this dependent on some settings I'm unaware of? Note that I'm not asking about the viability of the DELETE, I realize that TRUNCATE would likely be opportune. This is purely out of curiosity of how timeout functions with DELETE.
The Oracle, SQL Server,MySQL, PostgreSQL databases follows ACID properties. Hence whenever delete statement shows timed-out it must get rolled back.
You can get overview of ACID from the this Link.

Fire SQL Trigger only when a particular user update the row

There is a trigger in postgres that gets called whenever a particular table is updated.
It is used to send updates to another API.
Is there a way one can control the firing of this trigger?
Sometimes when I update the table I don't want the trigger to be fired. How do I do this?
Is there a silence trigger sql syntax?
If not
Can I fire triggers when a row is updated by PG user X and when PG user Y updates the table no trigger should be fired?
In recent Postgres versions, there is a when clause that you can use to conditionally fire the trigger. You could use it like:
... when (old.* is distinct from new.*) ...
I'm not 100% this one will work (can't test atm):
... when (current_user = 'foo') ...
(If not, try placing it in an if block in your plpgsql.)
http://www.postgresql.org/docs/current/static/sql-createtrigger.html
(There also is the [before|after] update of [col_name] syntax, but I tend to find it less useful because it'll fire even if the column's value remains the same.)
Adding this extra note, seeing that #CraigRinger's answer highlights what you're up to...
Trying to set up master-master replication between Salesforce and Postgres using conditional triggers is, I think, a pipe dream. Just forget it... There's going to be a lot more to it than that: you'll need to lock data as appropriate on both ends (which won't necessarily be feasible in a reasonable way), manage the resulting deadlocks (which might not automatically get detected), and deal with conflicting data.
Your odds of successfully pulling this off with a tiny team is about about zero -- especially if your Postgres skills are at the level where investing time in reading the manual would answer your own questions. You can safely bet that someone much more competent at Salesforce or some major SQL shop (e.g. like the one Craig works for) considered the same, and either miserably failed or ruled it out.
Moreover, I'd stress that implementing efficient, synchronous, multi-master replication is not a solved problem. You read that right: not solved. Just a few years ago, doing it at all wasn't well solved enough to make it in the Postgres core. So you've no prior art that works well to base your work on and iterate upon.
This seems to be the same problem as this post a few minutes ago, approaching it from a different direction.
If so, while you can indeed do as Denis suggests, don't attempt to reinvent this wheel. Use an established tool like Slony-I or Bucardo if you are attempting two-way (multi-master) replication. You also need to understand the major limitations involved in multi-master when dealing with conflicting updates.
In general, there are a few ways to control trigger firing:
Let the trigger fire, then put logic in the PL/PgSQL trigger body to cause it to take no action if a certain condition is met. This is often the only option when the rules are complex.
As Denis points out, use a trigger WHEN clause to conditionally fire the trigger
Use session_replication_role to control the firing of all triggers
Directly enable/disable triggers.
In particular, if your application shares a single SQL-level user ID for all database access and does its own user management above the SQL level, and you want to control trigger firing on a per-user basis, the only way to do it will be with in-trigger logic. You might find this prior answer about getting user IDs within triggers useful:
Passing user id to PostgreSQL triggers

How to continuously delivery SQL-based app?

I'm looking to apply continuous delivery concepts to web app we are building, and wondering if there any solution to protecting the database from accidental erroneous commit. For example, a bug that erases whole table instead of a single record.
How this issue impact can be limited according to continuous delivery doctorine, where the application deployed gradually over segments of infrastructure?
Any ideas?
Well first you cannot tell just from looking what is a bad SQL statement. You might have wanted to delete the entire contents of the table. Therefore is is not physiucally possible to have an automated tool that detects intent.
So to protect your database, first make sure you are in full recovery (not simple) mode and have full backups nightly and transaction log backups every 15 minutes or so. Now you cannot lose much information no matter how badly the process breaks. Your dbas should be trained to be able to recover to a point in time. If you don't have any dbas, I'd suggest the best thing you can do to protect your data is hire some. This is a non-negotiable in any non-trivial database environment and it is terribly risky not to have trained, experienced dbas if your data is critical to the business.
Next, you need to treat SQL like any other code, it should be in source control in scripts. If you are terribly concerned about accidental deletions, then write the scripts for deletes to copy all deletes to a staging table and delete the content of the staging table once a week or so. Enforce this convention in the code reviews. Or better yet set up an auditing process that runs through triggers. Once all records are audited, it is much easier to get back the 150 accidental deletions without having to restore a database. I would never consider having any enterprise application without auditing.
All SQL scripts without exception should be code-reviewed just like other code. All SQL scripts should be tested on QA and passed before moving to porduction. This will greatly reduce the possiblility for error. No developer should have write rights to production, only dbas should have that. Therefore each script should be written so that is can just be run, not run one chunk at a time where you could accidentally forget to highlight the where clause. Train your developers to use transactions correctly in the scripts as well.
Your concern is bad data happening to the database. The solution is to use full logging of all transactions so you can back out of transactions that you want to. This would usually be used in a context of full backups/incremental backups/full logging.
SQL Server, for instance, allows you to restore to a point in time (http://msdn.microsoft.com/en-us/library/ms190982(v=sql.105).aspx), assuming you have full logging.
If you are creating and dropping tables, this could be an expensive solution, in terms of space needed for the log. However, it might meet your needs for development.
You may find that full-logging is too expensive for such an application. In that case, you might want to make periodic backups (daily? hourly?) and just keep these around. For this purpose, I've found LightSpeed to be a good product for fast and efficient backups.
One of the strategies that is commonly adopted is to log the incremental sql statements rather than a collective schema generation so you can control the change at a much granular levels:
ex:
change 1:
UP:
Add column
DOWN:
Remove column
change 2:
UP:
Add trigger
DOWN:
Remove trigger
Once the changes are incrementally captured like this, you can have a simple but efficient script to upgrade (UP) from any version to any version without having to worry about the changes that happening. When the change # are linked to build, it becomes even more effective. When you deploy a build the database is also automatically upgraded(UP) or downgraded(DOWN) to that specific build.
We have an pipeline app which does that at CloudMunch.

Versioning data in SQL Server so user can take a certain cut of the data

I have a requirement that in a SQL Server backed website which is essentially a large CRUD application, the user should be able to 'go back in time' and be able to export the data as it was at a given point in time.
My question is what is the best strategy for this problem? Is there a systematic approach I can take and apply it across all tables?
Depending on what exactly you need, this can be relatively easy or hell.
Easy: Make a history table for every table, copy data there pre update or post insert/update (i.e. new stuff is there too). Never delete from the original table, make logical deletes.
Hard: There is an fdb version counting up on every change, every data item is correlated to start and end. This requires very fancy primary key mangling.
Just add a little comment to previous answers. If you need to go back for all users you can use snapshots.
The simplest solution is to save a copy of each row whenever it changes. This can be done most easily with a trigger. Then your UI must provide search abilities to go back and find the data.
This does produce an explosion of data, which gets worse when tables are updated frequently, so the next step is usually some kind of data-based purge of older data.
An implementation you could look at is Team Foundation Server. It has the ability to perform historical queries (using the WIQL keyword ASOF). The backend is SQL Server, so there might be some clues there.

How to rollback a database deployment without losing new data?

My company uses virtual machines for our web/app servers. This allows for very easy rollbacks of a deployment if something goes wrong. However, if an app server deployment also requires a database deployment and we have to rollback I'm kind of at a loss. How can you rollback database schema changes without losing data? The only thing that I can think of is to write a script that will drop/revert tables/columns back to their original state. Is this really the best way?
But if you do drop columns then you will lose data since those columns/tables (supposedly) will contain some data. And since I'd assume that any rollbacks often are temporary in that a bug is found, a rollback is made to get it going while that's fixed and then more or less the same changes are re-installed, the users could get quite upset if you lost that data and they had to re-enter it when the system was fixed.
I'd suggest that you should only allow additions of tables and columns, no alterations or deletions, then you can rollback just the code and leave the data as is, if you have a lot of rollbacks you might end up with some unused columns, but that shouldn't happen that often that someone added a table/column by mistake and in that case the DBA can remove them manually.
Generally speaking you can not do this.
However assuming that such a rollback makes sense it implies that the data you are trying to retain is independent from the schema changes you'd like to revert.
One way to deal with it would be to:
backup only data (script),
revert the schema to the old one and
restore the data
The above would work well if schema changes would not invalidate the created script (for example changing number of columns would be tricky).
This question has details on tools available in MS SQL for generating scripts.