How do I securely delete a row from a database? - sql

For compliance reasons, when I delete a user's personal information from the database in my current project, the relevant rows need to be really, irrecoverably deleted.
The database we are using is postgres 8.x,
Is there anything I can do, beyond running COMPACT/VACUUM regularly?
Thankfully, our backups will be held by others, and they are allowed to keep the deleted information.

"Irrecoverable deletion" is harder than it sounds, and extends beyond your database. For example, are you planning on going back to all previous instances of your database on tape/backup where this row also exists, and deleting it there too?
Consider a regular deletion and the periodic VACUUMing that you mentioned before.

To accomplish the "D" in ACID, relational databases use a transaction log type system for changes to the database. When a delete is made that delete is made to a memory copy of the data (buffer cache) and then written to a transaction log file in synchronous mode. If the database were to crash the transaction log would be replayed to bring the system back to the correct state. So a delete exists in multiple locations where it would have to be removed. Only at some later time is the record "deleted" from the actual data file on disk (and any indexes). This amount of time varies depending on the database.

Do you back up your database? - If Yes, make sure you delete it from Back ups too.
Is that because of security risk? In that case, I'd change the data in the row and then delete the row.

Perhaps I'm off on a tangent, but do you really want to delete users like that? Most identity & access management approaches recommend keeping users around but in a flagged-as-deleted state, in order not to lose auditing ability (what has this user been up to in the previous five years)?
Deleting user information might be needed for integrity compliance reasons, or for nefarious black-hat purposes. In neither case is there a deletion method which guarantees that no traces could be left of the user's existence, as has been noted in other posts.
Perhaps you should elaborate as to why such an irrevocable delete is desirable...?

This is not something that you can do on the software side. Its a hardware issue to really delete it you need to physically destroy the drive.

How about overwriting the record with random characters/dates/numbers etc?

Related

Oracle drop user check

DB: Oracle11gR2
OS: Linux
I want to drop USER1 Oracle user which is already locked for few weeks now.
I can run "drop user USER1 cascade;" to drop user but before dropping want to confirm nobody else is using or used objects after user was locked.
How to verify in Oracle that nobody is accessing or have accessed USER1 objects in last month or so?
Is there a db query/view available which we can use to make sure it's safe to run DROP command?
Thanks
Ideally, you would have enabled auditing of accesses on the various objects when you locked the account and left that in place for however long you would need to feel comfortable. A month may be sufficient but there may be quarterly or annual processes as well that you need to consider.
Assuming that you didn't enable auditing at the time and don't want to enable auditing now and wait another month, there are less complete approaches that you may be able to use (with the understanding that those approaches are going to provide less certainty).
You can query v$segment_statistics joined to v$statname to look at a variety of statistics about the table segments. "db block gets" and "consistent gets", for example, would show you how many times some process did a current or a consistent read on a block in a table. But it won't tell you what did the reads-- the background job that gathers statistics, for example, might read the data from the table. Those tables should accumulate data since the database was last restarted which may be significantly longer or shorter than the time period that you're interested in. You can get a list of the available statistics in the Oracle documentation to fine-tune exactly what you want to look for.
You can query dba_hist_seg_stat rather than v$segment_statistics. That will break out the statistics by time period so it will tell you when reads and writes happened. But it won't tell you who did them. That also requires that you be licensed to use the AWR (otherwise querying the table may violate your license and create an issue if you're ever audited).
You can look at dba_dependencies to see if any objects depend on objects owned by the user in question. But that will only work for stored objects (views, procedures, etc.). It won't capture information about SQL statements that are submitted from applications or ad hoc queries issued by users.
If you don't want to enable auditing and wait an appropriate period, you may be better served revoking privileges on the user1 objects from whatever roles/ users have them rather than dropping the objects outright. That way if something blows up from lack of privileges, it's relatively easy to restore the privilege without getting the object(s) back from backup. You could also create a trigger on a permission denied error that told you where the request was coming from.

Should I create separate SQL Server database for each user?

I am working on Asp.Net MVC web application, back-end is SQL Server 2012.
This application will provide billing, accounting, and inventory management. The user will create an account by signup. just like http://www.quickbooks.in. Each user will create some masters and various transactions. There is no limit, user can make unlimited records in the database.
I want to keep stable database performance, after heavy data load. I am maintaining proper indexing and primary keys in it, but there would be a heavy load on the database, per user.
So, should I create a separate database for each user, or should maintain one database with UserID. Add UserID in each table and making a partition based on UserID?
I am not an expert in SQL Server, so please provide suggestions with clear specifications.
Please inform me if there is any lack of information.
A DB per user is what happens when customers need to be able pack up and leave taking the actual database with them. Think of a self hosted wordpress website. Or if there are incredible risks to one user accidentally seeing another user's data, so it's safer to rely on the servers security model than to rely on remembering to add the UserId filter to all your queries. I can't imagine a scenario like that, but who knows-- maybe if the privacy laws allowed for jail time, I would rather data partitioned by security rules rather than carefully writing WHERE clauses.
If you did do user-per-database, creating a new user will be 10x more effort. While INSERT, UPDATE and so on stay the same from version to version, with each upgrade the syntax for database, user creation, permission granting and so on will evolve enough to break those scripts each SQL version upgrade.
Also, this will multiply your migration headaches by the number of users. Let's say you have 5000 users and you need to add some new columns, change a columns data type, update a trigger, and so on. Instead of needing to run that change script 1x, you need to run it 5000 times.
Per user Dbs also probably wastes disk space. Each of those databases is going to have a transaction log, sitting idle taking up the minimum log space.
As for load, if collectively your 5000 users are doing 1 billion inserts, updates and so on per day, my intuition tells me that it's going to be faster on one database, unless there is some sort of contension issue (everyone reading and writing to the same table at the same time and the same pages of the same table). Each database has machine resources (probably threads and memory) per database doing housekeeping, so these extra DBs can't be free.
Anyhow, the best thing to do is to simulate the two architectures and use a random data generator to simulate load and see how they perform.
It's not an easy answer to give.
First, there is logical design to be considered. Then you have integrity, security, management and performance (in this very order).
A database is a logical unit of data, self contained. Ideally, you should be able to take a database, move it to another instance, probably change the connection strings and be running again.
All the constraints are database-level. No foreign keys can exist referencing some object outside the database.
So, try thinking in these terms first.
How would you reliably prevent one user messing up the other user's data? Keep in mind that it's just a matter of time before someone opens an excel sheet and fire up queries on the database bypassing your application. Row level security in SQL Server is something you don't want to deal with.
Multiple databases mean that all management tasks should be scripted out and executed on all databases. Yes, there is some overhead to it, but once you set it up it's just the matter of monitoring. If a database goes suspect, it's a single customer down, not all of them. You can even have different versions for different customes if each customer have it's own database. Additionally, if you roll an upgrade, you can do it per customer, so the inpact will be much less.
Performance is the least relevant factor here. Of course, it really depends on how many customers and how much data, but proper indexing will solve these issues. Scale-out is much easier with multiple databases.
BTW, partitioning, as you mentioned it, is never a performance booster, it's simply a management feature, allowing for faster loading and evicting of data from a table.
I'd probably put each customer in separate database, but it's up to you eventually to make a decision for yourself. Hope I've helped some with this.

question about frequency of updating access

i have a table in an access database
this access database is used on a regular basis, basically from 9-5
someone else has a copy of this exact table. sometimes records are added, sometimes deleted, and sometimes data within the records is updated.
i need to update the access database table with the offsite table every hour or so. what is the best algorithm of updating the data? there are about 5000 records.
would it severely lock up the table for a few seconds every hour?
i would like to publicly apologize for my rude comment to david fenton
My impression is that this question ties together pieces you've been exploring with your previous questions:
a file "listener" to detect the presence of a new file and do something with it when found
list files with some extension in a folder
DoCmd.TransferText to pull file data into your database
Insert, Update, Delete records in a table based on an imported set of records
Maybe it's time to give us a more detailed picture of what you're dealing with.
Tony asked if both sites are on the same WAN (Wide Area Network). You replied they are on Windows. Elsewhere you said you're using a network. Please tell us about the network.
I'm still unsure whether you need a one-way or two-way data exchange. You've talked about importing changes from the remote table into the local master table. Do you need to do the same type of operation at the remote site: import changes made to the table at the master site?
Tell us what needs to happen regarding the issue James raised. Can local and remote users ever edit the same record? If they can, how will you resolve the conflict? Similarly, what should happen if a remote user updates a record and a local user deletes their copy of that record?
Based on what you've told us so far, this sounds like a real challenge for Access, made more challenging by the rate of record changes (5,000 per hour). I like the outline Kevin suggested. However your challenge will be more complicated since you also need to account for record deletions at both sites.
It seems like you may have to create something which duplicates Access' Replication feature. Maybe you should look at the Jet Replication Wiki to see if you can modify your design to take advantage of Replication. I can't help you there, and unfortunately you appear to have frustrated David Fenton who is a leading authority on Jet Replication.
If a few seconds performance is critical, you'd rather move to a better database engine (like Sqlite, MySQL, MS SQL server). If you want a single file, then Sqlite is the best for you. All these use by-single-record locks, so you can read and write simultaneously.
If you stay with access, you will probably have to implement a timer to update only a few records at a time.
Before you do anything else you need to establish the "rules" as far as collisions go.
If a row in the local copy is updated and the same row in the remote copy is updated which one is the "correct" version? Ditto for deletions, inserts are even more of a pain as you can have the "same" set of values but perhaps a different key.
After you have worked out how to handle each of these cases you can then go on to thinking about the implementation.
As other posters have suggested the way to completely avoid these issues is switch to SQLServer or any other "proper" database which can be updated over the network by all users and where concurrency issues are handled by the DBMS when the updates are applied.
Other users have already suggested switching to a server based database i.e. SQL server etc. I would echo this and say it is the best way to go however if you are stuck with access and have no choice then I would suggest you add a field (with an index) along the lines of “Last Updated”. You could then export all records that have been modified within a particular time frame. Export this file as a CSV, ship it over to the remote site and import it into the “master” access database. With a bit of scripting you could automate this process.

How to rollback a database deployment without losing new data?

My company uses virtual machines for our web/app servers. This allows for very easy rollbacks of a deployment if something goes wrong. However, if an app server deployment also requires a database deployment and we have to rollback I'm kind of at a loss. How can you rollback database schema changes without losing data? The only thing that I can think of is to write a script that will drop/revert tables/columns back to their original state. Is this really the best way?
But if you do drop columns then you will lose data since those columns/tables (supposedly) will contain some data. And since I'd assume that any rollbacks often are temporary in that a bug is found, a rollback is made to get it going while that's fixed and then more or less the same changes are re-installed, the users could get quite upset if you lost that data and they had to re-enter it when the system was fixed.
I'd suggest that you should only allow additions of tables and columns, no alterations or deletions, then you can rollback just the code and leave the data as is, if you have a lot of rollbacks you might end up with some unused columns, but that shouldn't happen that often that someone added a table/column by mistake and in that case the DBA can remove them manually.
Generally speaking you can not do this.
However assuming that such a rollback makes sense it implies that the data you are trying to retain is independent from the schema changes you'd like to revert.
One way to deal with it would be to:
backup only data (script),
revert the schema to the old one and
restore the data
The above would work well if schema changes would not invalidate the created script (for example changing number of columns would be tricky).
This question has details on tools available in MS SQL for generating scripts.

Does SQL's DELETE statement truly delete data?

Story: today one of our customers asked us if all the data he deleted in the program was not recoverable.
Aside scheduled backups, we shrink the log file once a day, and we use the DELETE command to remove records inside our tables where needed.
Though, just for the sake of it, I opened the .mdf file with an editor (used PSPad), and searched for a particular unique piece of data -I was sure- was inside one of tables.
Problem: I tracked it in the file, then executed the DELETE command, and it was still there.
Question:
Is there a particular command we are not aware of to delete the records physically form the disk?
Note: we know there are particular techniques to recover lost data from the hard drives, but here I am talking about a notepad-wannabe!
The text may still be there, but SQL Server has no concept of that data having any structure or being available.
The "freed space" is simply deallocated: not removed, compacted or zeroed.
The "Instant File Initialization" feature relies on this too (not zeroing the entire MDF file) and previous disk data is still available eben for a brand new database:
Because the deleted disk content is overwritten only as new data is written to the files, the deleted content might be accessed by an unauthorized principal.
Edit: To reclaim space:
ALTER INDEX...WITH REBUILD is the best way
DBCC SHRINKFILE using NOTRUNCATE can compact pages into gaps caused by deallocated pages, but won't reclaim space in a page for deleted row
SQL Server just marks the space of deleted rows as available, but does not reorganize the database and does not zero out the freed up space. Try to "Shrink" the database, and the deleted rows should no longer be found.
Thanks, gbn, for your correction. A page is the allocation unit of the database, and shrinking a database only eliminates pages, but does not compact them. You'd have to delete all rows in a page in order to see them disappear after shrinking.
If your client is concerned about data security it should use Transparent Database Encryption. Even if you obliterate information from the table, the record is still in the log. Even when log is recycled, the info is still in the backups.
You could update the record with dummy values before issuing the delete, thereby overwriting the data on disk before the database marks it as free. (Whether this also works with LOB fields would warrant investigation, though).
And of course, you'd still have the problem of logs and backups, but I take it you already solved those.