Maintaining consistency when doing logical delete - sql

I'm performing logical delete when an item should be deleted from the database.
I have added an additional DateTime column to every table that we need to perform logical delete. So when deleting you just update the field like...
UPDATE Client
SET deleted = GETDATE()
WHERE Client.CID = #cid
Later if it should be recovered then...
UPDATE Client
SET deleted = NULL
WHERE Client.CID = #cid
So that a typical selection statement would look like...
SELECT *
FROM client
WHERE CID = #cid AND deleted IS NULL
But the problem is how to handle dependencies to maintain the consistency of the database in this approach. For ex. before deleting (actually updating) an employee I have to do several checks such as to see whether there are any related attendance / Bank Accounts / Wages data/ History etc. in related tables pertaining to the employee being deleted.
So that what's the normal practice in doing such things? Do I need to check every thing in
IF EXISTS (SELECT...)
statements?
EDIT:
If I want to prevent the update when it has related records I could do something like this using UNION...
IF NOT EXISTS (SELECT emp_id FROM BankAccount WHERE emp_id = '100' UNION SELECT EID FROM Attendance WHERE EID = '100' UNION SELECT employee_id FROM SalaryTrans WHERE employee_id = '100')
UPDATE Employee SET Employee.deleted = GETDATE() WHERE emp_id = '100'
Would this be a acceptable solution?

But the problem is how to handle dependencies to maintain the
consistency of the database in this approach. For ex. before deleting
(actually updating) an employee I have to do several checks such as to
see whether there are any related attendance / Bank Accounts / Wages
data/ History etc. in related tables pertaining to the employee being
deleted.
So that what's the normal practice in doing such things?
It depends entirely on your application.
Some companies might require all the pending wages, accumulated vacation days and sick days, etc., to be "handled" before deleting a person. Handled might mean converting all those things to money, which is added to a final paycheck. Other companies might allow deleting at any time, knowing that a logical delete doesn't affect any of the related rows in other tables. Application code would be expected to know how to deal with cutting a final check to a deleted person.
Other applications might not deal with anything as important as wages and taxes. They might allow a logical delete at any time, and just not worry about the trivial consequences.

Look into triggers, they might be helpful here.
You could define a trigger on your employee table that checked to see if your logical delete would cause problems for other tables. It involves manually keeping track of what tables need access to employees, so it isn't as robust as allowing foreign key constraints to track that for you, but it can work. I'd set it up as an "AFTER UPDATE" trigger and roll back the transaction (within the trigger) if it found another table referencing the employee. They'd get a rollback anyway if they tried to actually delete an employee used in a FK constraint, so that's not that different.
Another approach is to use an AFTER DELETE trigger to copy deleted employees to a "deleted_employees" table, that way you're still hanging on to them, but any tables that reference that employee via FK will error and roll back the transaction before your trigger even has a chance to run.
I have to use a similar logic to what you proposed (just check every time you use it) in some of my stuff, and mostly I include a bit field "IsDead" that I set when I kill a record, and then I have to reference that EVERY time I use the table. But I mostly build views because my schema is complex, and it's trivial to include the IsDead = 0 in the where clause of the view. I don't know how IsDead = 0 would compare to DelDate IS NULL, if you have a large database you might test that out.

Related

Sql Server, Building a Non-cyclical Parent Child relationship with a Self Referencing Forgien Key

Please Note: I'm a Software Developer with limited knowledge of Database Programming/Administration.
I'm trying to build a structure where companies can have parent-child relationships with one another. The basic idea being that you might have different branches of a company operating in different parts of the world which share some data but not all data (Mailing Address, and Local Contacts for instance).
A striped down version of the table would look something like this:
CREATE TABLE COMPANY (
COMPANY_ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
PARENT_COMPANY INT FOREIGN KEY REFERENCES COMPANY(COMPANY_ID),
NAME NVARCHAR(256)
);
The problem I'm having is that this structure allows cyclical relationships where a parent company could become the child of one of its descendants. The sql below would cause such a situation to occur.
INTO COMPANY
(COMPANY_ID, PARENT_COMPANY, NAME)
VALUES
(1, null, 'Company1'),
(2, 1, 'Company2'),
(3, 2, 'Company3');
UPDATE COMPANY
SET PARENT_COMPANY = 3
WHERE COMPANY_ID = 1;
Because this kind of relationship could cause infinite loops, I want to prevent the situation from ever occurring.
The best idea I could come up with was run a trigger on the COMPANY table, that would check to make sure any value updated in the PARENT_COMPANY column didn't cause a cyclical relationship. However Sql Server doesn't have a BEFORE UPDATE trigger; only AFTER and INSTEAD OF triggers, both of which run after the table has already been updated. This means the cyclical relationship would already be created before I got a chance to check for it.
At this point I could potentially rebuild the "before" version of the table from the triggers DELETED temporary table, and the Company Table itself; then search that table for a cyclical relationship; but that seems very cumbersome and inefficient.
Is there any other way I could check for cyclical relationships in a self referencing Sql Server structure.
PS. I was planning on using something like this to search for a cyclical relationship in the trigger:
DECLARE #CompanyId int = 1 -- ID of company thats been changed
;WITH cte AS
(
SELECT a.COMPANY_ID, a.PARENT_COMPANY, a.NAME
FROM COMPANY a
WHERE COMPANY_ID = #CompanyId
UNION ALL
SELECT a.COMPANY_ID, a.PARENT_COMPANY, a.NAME
FROM COMPANY a JOIN cte c ON a.PARENT_COMPANY = c.COMPANY_ID
)
SELECT COMPANY_ID, PARENT_COMPANY, NAME
FROM cte
ORDER BY NAME
First of all I want to correct some of your statements:
AFTER triggers:
Trigger is part of the batch operation that caused the trigger to fire (e.g. UPDATE) and therefore the record does not become committed (and visible) unless trigger succeeds.
.... an after trigger fires before the an implicit transaction is
committed. A rollback in a trigger will rollback the statement that
fired the trigger and abort the entire batch as well
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/73862414-d770-46bb-97fb-249a3fb38680/does-a-insert-trigger-fire-when-the-insert-is-committed?forum=transactsql
The only way to bypass this is to run SELECT queries with explicit READ UNCOMMITTED hints, which is, generally never done, unless for very specific purposes.
INSTEAD OF triggers:
They run, as the name implies, instead of a regular insert/update operation. In other words if an INSTEAD OF trigger for an operation (e.g. update) is defined, it overrides update behaviour.
See more: https://technet.microsoft.com/en-us/library/ms179288(v=sql.105).aspx
Summary:
In your case I would recommend creating a simple AFTER UPDATE trigger, where you will check your data and if problems are found throw a meaningful error and execute a ROLLBACK TRANSACTION to "cancel" this change. This way you achieve a similar behaviour to how constrains etc. work.
I suggest you also take a look at this comprehensive post about storing hierarchical relationships in a database: What are the options for storing hierarchical data in a relational database?

Re-assigning IDs in a non-IDENTITY type field in SQL Server database

WARNING: This tale of woe contains examples of code smells and poor design decisions, and technical debt.
If you are conversant with SOLID principles, practice TDD and unit test your work, DO NOT READ ON. Unless you want a good giggle at someone's misfortune and gloat in your own awesomeness knowing that you would never leave behind such a monumental pile of crap for your successors.
So, if you're sitting comfortably then I'll begin.
In this app that I have inherited and been supporting and bug fixing for the last 7 months I have been left with a DOOZY of a balls up by a developer that left 6 and a half months ago. Yes, 2 weeks after I started.
Anyway. In this app we have clients, employees and visits tables.
There is also a table called AppNewRef (or something similar) that ... wait for it ... contains the next record ID to use for each of the other tables. So, may contain data such as :-
TypeID Description NextRef
1 Employees 804
2 Clients 1708
3 Visits 56783
When the application creates new rows for Employees, it looks in the AppNewRef table, gets the value, uses that value for the ID, and then updates the NextRef column. Same thing for Clients, and Visits and all the other tables whose NextID to use is stored in here.
Yes, I know, no auto-numbering IDENTITY columns on this database. All under the excuse of "when it was an Access app". These ID's are held in the (VB6) code as longs. So, up to 2 billion 147 million records possible. OK, that seems to work fairly well. (apart from the fact that the app is updating and taking care of locking / updating, etc., and not the database)
So, our users are quite happily creating Employees, Clients, Visits etc. The Visits ID is steady increasing a few dozen at a time. Then the problems happen. Our clients are causing database corruptions while creating batches of visits because the server is beavering away nicely, and the app becomes unresponsive. So they kill the app using task manager instead of being patient and waiting. Granted the app does seem to lock up.
Roll on to earlier this year and developer Tim (real name. No protecting the guilty here) starts to modify the code to do the batch updates in stages, so that the UI remains 'responsive'. Then April comes along, and he's working his notice (you can picture the scene now, can't you ?) and he's beavering away to finish the updates.
End of April, and beginning of May we update some of our clients. Over the next few months we update more and more of them.
Unseen by Tim (real name, remember) and me (who started two weeks before Tim left) and the other new developer that started a week after, the ID's in the visit table start to take huge leaps upwards. By huge, I mean 10000, 20000, 30000 at a time. Sometimes a few hundred thousand.
Here's a graph that illustrates the rapid increase in IDs used.
Roll on November. Customer phones Tech Support and reports that he's getting an error. I look at the error message and ask for the database so I can debug the code. I find that the value is too large for a long. I do some queries, pull the information, drop it into Excel and graph it.
I don't think making the code handle anything longer than a long for the ID's is the right approach, as this app passes that ID into other DLL's and OCX's and breaking the interface on those just seems like a whole world of hurt that I don't want to encounter right now.
One potential idea that I'm investigating is try to modify the ID's so that I can get them down to a lower level. Essentially filling the gaps. Using the ROW_NUMBER function
What I'm thinking of doing is adding a new column to each of the tables that have a Foreign Key reference to these Visit ID's (not a proper foreign key mind, those constraints don't exist in this database). This new column could store the old (current) value of the Visit ID (oh, just to confuse things; on some tables it's called EventID, and on some it's called VisitID).
Then, for each of the other tables that refer to that VisitID, update to the new value.
Ideas ? Suggestions ? Snippets of T-SQL to help all gratefully received.
Option one:
Explicitly constrain all of your foreign key relationships, and set them to be ON UPDATE CASCADE.
This will mean that whenever you change the ID, the foreign keys will automatically be updated.
Then you just run something like this...
WITH
resequenced AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY id) AS newID,
*
FROM
yourTable
)
UPDATE
resequenced
SET
id = newID
I haven't done this in ages, so I forget if it causes problems mid-update by having two records with the same id value. If it does, you could do somethign like this first...
UPDATE yourTable SET id = -id
Option two:
Ensure that none of your foreign key relationships are explicitly defined. If they are, note them donw and remove them.
Then do something like...
CREATE TABLE temp AS
newID INT IDENTITY (1,1),
oldID INT
)
INSERT INTO temp (oldID) SELECT id FROM yourTable
/* Do this once for the table you are re-identifiering */
/* Repeat this for all fact tables holding that ID as a foreign key */
UPDATE
factTable
SET
foreignID = temp.newID
FROM
temp
WHERE
foreignID = temp.oldID
Then re-apply any existing foreign key relationships.
This is a pretty scary option. If you forget to update a table, you just borked your data. But, you can give that temp table a much nicer name and KEEP it.
Good luck. And may the lord have mercy on your soul. And Tim's if you ever meet him in a dark alley.
I would create a numbers table that has just a sequence from 1 to whatever max with an increment of 1 is for long and then change the logic of getting the maxid for visitid and maybe the others doing a right join between the numbers and the visits table. and then you can just look for te min of that number
select min(number) from visits right join numbers on visits.id = numbers.number
That way you get all the gaps filled in without having to change any of the other tables.
but I would just redo the whole database.

Is it sensible to have a table that does not reference any other in a database design?

I'd like to get some advice on database design. Specifically, consider the following (hypothetical) scenario:
Employees - table holding all employee details
Users - table holding employees that have username and password to access software
UserLog - table to track when users login and logout and calculate
time on software
In this scenario, if an employee leaves the company I also want to make sure I delete them from the Users table so that they can no longer access the software. I can achieve this using ON DELETE CASCADE as part of the FK relationship between EmployeeID in Employees and Users.
However, I don't want to delete their details from the UserLog as I am interested in collating data on how long people spend on the software and the fact that they no longer work at the company does not mean their user behaviour is no longer relevant.
What I am left with is a table UserLog that has no relationships with any other tables in my database. Is this a sensible idea?
Having looked through books etc / googled online I haven't come across any DB schemas with tables that have no relationships with others and so my gut instinct here is saying that my approach is not robust...
I'd appreciate some guidance please.
My personal preference in this case would be to "soft delete" an employee by adding a "DeletedDate" column to the Employees table. This will allow you to maintain referential integrity with your UserLog table and all details for all employees, past and present, remain available in the database.
The downside to this approach is that you need to add application logic to check for active employees.
Yes, this is perfectly sensible. The log is just a raw audit of data that should never change. It doesn't need to be normalized (and shouldn't be) and/or linked to other tables.
Ideally, I would put write-heavy audit logging in a different database entirely than the read-heavy transactional day-to-day stuff. They may grow differently over time. But starting small it's fine to keep them in the same database as long as you understand the fundamental differences between them.
On a side note, I would recommend not deleting the users from the tables. Maybe have some kind of IsActive or IsDeleted bit on them that would effectively blind them from the application, but deleting should be avoided if possible.
The problem you have here is that it's perfectly possible to insert UserLog data for users that have never existed as there's no link to the table that defines valid users.
I would say that perhaps the better course of action would be to mark the users as invalid and remove all their personal details when they leave rather than delete the record entirely.
That's not to say there aren't situations where it is valid to have a table (or tables) on the database that don't reference others.
Is this a sensible idea
The problem is this. Since the data isn't linked you can delete something from the employee table and still have references in the UserLog. After the employee infomration is deleted, you have no way of knowing what Log data ties back to. Is this ok? Technically yes. There is nothing preventing you from doing it, but then why are you keeping the data in the first place? You also have no guarantee that the data in the table actually is about an employee. Someone could accidently enter a wrong EmployeeID in the table that doesn't belong to anyone. Keys help prevent data corruption. It's always better to have extra data than it is to have bad data.
What I've found is that you never want to delete data when possible. Space is cheap, and you can add flags etc. to show the record isn't active. Yes, this does cause more work (this can be quickly remedied by creating a view which only shows active employees), and saying that you should never delete data is far fetched, but you start linking data together. Deleting becomes very difficult. If you are not adding a FK just so you can delete records, it's a tell tale sign you need to rethink your strategy.
Relying on Cascade Delete can be very dangerous too. The model you are stating is that anytime you don't want data deleted you have to know not to add a FK to that table which links it back to users. It doesn't take long for someone to forget this.
What you can do is use logical deletion or disabling a user by adding a bool value Deleted or Disabled to the Users table.
Or replace the EmployeeId with the name of the employee in the UserLog.
An alternative to using the soft delete process, is to store all the historical details you would want about the user at the time the log record is created rather than store the employee id. So you might have username, logintime, logouttime, sessionlength in your table.
Sensible? Sure, as in it makes sense as you've described your need to keep those users indefinitely. The problem you'll run into is maintaining the tables. Instead of doing a cascading update once, you'll have to use at least two updates in order to insert a new user.
I think a table as you are suggesting is perfectly fine. I frequently encounter log tables that are do not have explicit relationships with other tables. Just because a database is "relational" doesn't mean everything has to relate haha.
One thing that I do notice though is that you are using EmployeeID in the log, but not using it as a foreign key to your Employee table. I understand why you don't want that, since you will be dropping employees. But, if you are dropping them completely, then the EmployeeID column is meaningless.
A solution to this would be to keep a flag for employees, such as active, that tracks if they are active or not. That way, the log data is meaningful.
IANADBA but it's generally considered very bad practice indeed to delete almost anything from a DB ever,It would be far better here to have some kind of locked flag / "deleted" datestamp on your users table and preserve your FK.

Fixing DB Inconsistencies - ID Fields

I've inherited a (Microsoft?) SQL database that wasn't very pristine in its original state. There are still some very strange things in it that I'm trying to fix - one of them is inconsistent ID entries.
In the accounts table, each entry has a number called accountID, which is referenced in several other tables (notes, equipment, etc. ). The problem is that the numbers (for some random reason) - range from about -100000 to +2000000 when there are about only 7000 entries.
Is there any good way to re-number them while changing corresponding numbers in the other tables? At my disposal I also have ColdFusion, so any thing that works with SQL and/or that I'll accept.
For surrogate keys, they are meant to be meaningless, so unless you actually had a database integrity issue (like there were no foreign key contraints properly defined) or your identity was approaching the maximum for its datatype, I would leave them alone and go after some other low hanging fruit that would have more impact.
In this instance, it sounds like "why" is a better question than "how". The OP notes that there is a strange problem that needs to be fixed but doesn't say why it is a problem. Is it causing problems? What positive impact would changing these numbers have? Unless you originally programmed the system and understand precisely why the number is in its current state, you are taking quite a risky making changes like this.
I would talk to an accountant (or at least your financial people) before messing in anyway with the numbers in the accounts tables if this is a financial app. The Table of accounts is very critical to how finances are reported. These IDs may have meaning you don't understand. No one puts in a negative id unless they had a reason. I would under no circumstances change that unless I understood why it was negative to begin with. You could truly screw up your tax reporting or some other thing by making an uneeded change.
You could probably disable the foreign key relationships (if you're able to take it offline temporarily) and then update the primary keys using a script. I've used this update script before to change values, and you could pretty easily wrap this code in a cursor to go through the key values in question, one by one, and update the arbitrary value to an incrementing value you're keeping track of.
Check out the script here: http://vyaskn.tripod.com/sql_server_search_and_replace.htm
If you just have a list of tables that use the primary key, you could set up a series of UPDATE statements that run inside your cursor, and then you wouldn't need to use this script (which can be a little slow).
It's worth asking, though, why these values appear out of wack. Does this database have values added and deleted constantly? Are the primary key values really arbitrary, or do they just appear to be, but they really have meaning? Though I'm all for consolidating, you'd have to ensure that there's no purpose to those values.
With ColdFusion this shouldn't be a herculean task, but it will be messy and you'll have to be careful. One method you could use would be to script the database and then generate a brand new, blank table schema. Set the accountID as an identity field in the new database.
Then, using ColdFusion, write a query that will pull all of the old account data and insert them into the new database one by one. For each row, let the new database assign a new ID. After each insert, pull the new ID (using either ##IDENTITY or MAX(accountID)) and store the new ID and the old ID together in a temporary table so you know which old IDs belong to which new IDs.
Next, repeat the process with each of the child tables. For each old ID, pull its child entries and re-insert them into the new database using the new IDs. If the primary keys on the child tables are fine, you can insert them as-is or let the server assign new ones if they don't matter.
Assigning new IDs in place by disabling relationships temporarily may work, but you might also run into conflicts if one of the entries is assigned an ID that is already being used by the old data which could cause conflicts.
Create a new column in the accounts table for your new ID, and new column in each of your related tables to reference the new ID column.
ALTER TABLE accounts
ADD new_accountID int IDENTITY
ALTER TABLE notes
ADD new_accountID int
ALTER TABLE equipment
ADD new_accountID int
Then you can map the new_accountID column on each of your referencing tables to the accounts table.
UPDATE notes
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN notes ON (notes.accountID = accounts.accountID)
UPDATE equipment
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN equipment ON (equipment.accountID = accounts.accountID)
At this point, each table has both accountID with the old keys, and new_accountID with the new keys. From here it should be pretty straightforward.
Break all of the foreign keys on accountID.
On each table, UPDATE [table] SET accountID = new_accountID.
Re-add the foreign keys for accountID.
Drop new_accountID from all of the tables, as it's no longer needed.

Database Design for Revisions?

We have a requirement in project to store all the revisions(Change History) for the entities in the database. Currently we have 2 designed proposals for this:
e.g. for "Employee" Entity
Design 1:
-- Holds Employee Entity
"Employees (EmployeeId, FirstName, LastName, DepartmentId, .., ..)"
-- Holds the Employee Revisions in Xml. The RevisionXML will contain
-- all data of that particular EmployeeId
"EmployeeHistories (EmployeeId, DateModified, RevisionXML)"
Design 2:
-- Holds Employee Entity
"Employees (EmployeeId, FirstName, LastName, DepartmentId, .., ..)"
-- In this approach we have basically duplicated all the fields on Employees
-- in the EmployeeHistories and storing the revision data.
"EmployeeHistories (EmployeeId, RevisionId, DateModified, FirstName,
LastName, DepartmentId, .., ..)"
Is there any other way of doing this thing?
The problem with the "Design 1" is that we have to parse XML each time when you need to access data. This will slow the process and also add some limitations like we cannot add joins on the revisions data fields.
And the problem with the "Design 2" is that we have to duplicate each and every field on all entities (We have around 70-80 entities for which we want to maintain revisions).
I think the key question to ask here is 'Who / What is going to be using the history'?
If it's going to be mostly for reporting / human readable history, we've implemented this scheme in the past...
Create a table called 'AuditTrail' or something that has the following fields...
[ID] [int] IDENTITY(1,1) NOT NULL,
[UserID] [int] NULL,
[EventDate] [datetime] NOT NULL,
[TableName] [varchar](50) NOT NULL,
[RecordID] [varchar](20) NOT NULL,
[FieldName] [varchar](50) NULL,
[OldValue] [varchar](5000) NULL,
[NewValue] [varchar](5000) NULL
You can then add a 'LastUpdatedByUserID' column to all of your tables which should be set every time you do an update / insert on the table.
You can then add a trigger to every table to catch any insert / update that happens and creates an entry in this table for each field that's changed. Because the table is also being supplied with the 'LastUpdateByUserID' for each update / insert, you can access this value in the trigger and use it when adding to the audit table.
We use the RecordID field to store the value of the key field of the table being updated. If it's a combined key, we just do a string concatenation with a '~' between the fields.
I'm sure this system may have drawbacks - for heavily updated databases the performance may be hit, but for my web-app, we get many more reads than writes and it seems to be performing pretty well. We even wrote a little VB.NET utility to automatically write the triggers based on the table definitions.
Just a thought!
Do not put it all in one table with an IsCurrent discriminator attribute. This just causes problems down the line, requires surrogate keys and all sorts of other problems.
Design 2 does have problems with schema changes. If you change the Employees table you have to change the EmployeeHistories table and all the related sprocs that go with it. Potentially doubles you schema change effort.
Design 1 works well and if done properly does not cost much in terms of a performance hit. You could use an xml schema and even indexes to get over possible performance problems. Your comment about parsing the xml is valid but you could easily create a view using xquery - which you can include in queries and join to. Something like this...
CREATE VIEW EmployeeHistory
AS
, FirstName, , DepartmentId
SELECT EmployeeId, RevisionXML.value('(/employee/FirstName)[1]', 'varchar(50)') AS FirstName,
RevisionXML.value('(/employee/LastName)[1]', 'varchar(100)') AS LastName,
RevisionXML.value('(/employee/DepartmentId)[1]', 'integer') AS DepartmentId,
FROM EmployeeHistories
The History Tables article in the Database Programmer blog might be useful - covers some of the points raised here and discusses the storage of deltas.
Edit
In the History Tables essay, the author (Kenneth Downs), recommends maintaining a history table of at least seven columns:
Timestamp of the change,
User that made the change,
A token to identify the record that was changed (where the history is maintained separately from the current state),
Whether the change was an insert, update, or delete,
The old value,
The new value,
The delta (for changes to numerical values).
Columns which never change, or whose history is not required, should not be tracked in the history table to avoid bloat. Storing the delta for numerical values can make subsequent queries easier, even though it can be derived from the old and new values.
The history table must be secure, with non-system users prevented from inserting, updating or deleting rows. Only periodic purging should be supported to reduce overall size (and if permitted by the use case).
Avoid Design 1; it is not very handy once you will need to for example rollback to old versions of the records - either automatically or "manually" using administrators console.
I don't really see disadvantages of Design 2. I think the second, History table should contain all columns present in the first, Records table. E.g. in mysql you can easily create table with the same structure as another table (create table X like Y). And, when you are about to change structure of the Records table in your live database, you have to use alter table commands anyway - and there is no big effort in running these commands also for your History table.
Notes
Records table contains only lastest revision;
History table contains all previous revisions of records in Records table;
History table's primary key is a primary key of the Records table with added RevisionId column;
Think about additional auxiliary fields like ModifiedBy - the user who created particular revision. You may also want to have a field DeletedBy to track who deleted particular revision.
Think about what DateModified should mean - either it means where this particular revision was created, or it will mean when this particular revision was replaced by another one. The former requires the field to be in the Records table, and seems to be more intuitive at the first sight; the second solution however seems to be more practical for deleted records (date when this particular revision was deleted). If you go for the first solution, you would probably need a second field DateDeleted (only if you need it of course). Depends on you and what you actually want to record.
Operations in Design 2 are very trivial:
Modify
copy the record from Records table to History table, give it new RevisionId (if it is not already present in Records table), handle DateModified (depends on how you interpret it, see notes above)
go on with normal update of the record in Records table
Delete
do exactly the same as in the first step of Modify operation. Handle DateModified/DateDeleted accordingly, depending on the interpretation you have chosen.
Undelete (or rollback)
take highest (or some particular?) revision from History table and copy it to the Records table
List revision history for particular record
select from History table and Records table
think what exactly you expect from this operation; it will probably determine what information you require from DateModified/DateDeleted fields (see notes above)
If you go for Design 2, all SQL commands needed to do that will be very very easy, as well as maintenance! Maybe, it will be much much easier if you use the auxiliary columns (RevisionId, DateModified) also in the Records table - to keep both tables at exactly the same structure (except for unique keys)! This will allow for simple SQL commands, which will be tolerant to any data structure change:
insert into EmployeeHistory select * from Employe where ID = XX
Don't forget to use transactions!
As for the scaling, this solution is very efficient, since you don't transform any data from XML back and forth, just copying whole table rows - very simple queries, using indices - very efficient!
We have implemented a solution very similar to the solution that Chris Roberts suggests, and that works pretty well for us.
Only difference is that we only store the new value. The old value is after all stored in the previous history row
[ID] [int] IDENTITY(1,1) NOT NULL,
[UserID] [int] NULL,
[EventDate] [datetime] NOT NULL,
[TableName] [varchar](50) NOT NULL,
[RecordID] [varchar](20) NOT NULL,
[FieldName] [varchar](50) NULL,
[NewValue] [varchar](5000) NULL
Lets say you have a table with 20 columns. This way you only have to store the exact column that has changed instead of having to store the entire row.
If you have to store history, make a shadow table with the same schema as the table you are tracking and a 'Revision Date' and 'Revision Type' column (e.g. 'delete', 'update'). Write (or generate - see below) a set of triggers to populate the audit table.
It's fairly straightforward to make a tool that will read the system data dictionary for a table and generate a script that creates the shadow table and a set of triggers to populate it.
Don't try to use XML for this, XML storage is a lot less efficient than the native database table storage that this type of trigger uses.
Ramesh, I was involved in development of system based on first approach.
It turned out that storing revisions as XML is leading to a huge database growth and significantly slowing things down.
My approach would be to have one table per entity:
Employee (Id, Name, ... , IsActive)
where IsActive is a sign of the latest version
If you want to associate some additional info with revisions you can create separate table
containing that info and link it with entity tables using PK\FK relation.
This way you can store all version of employees in one table.
Pros of this approach:
Simple data base structure
No conflicts since table becomes append-only
You can rollback to previous version by simply changing IsActive flag
No need for joins to get object history
Note that you should allow primary key to be non unique.
The way that I've seen this done in the past is have
Employees (EmployeeId, DateModified, < Employee Fields > , boolean isCurrent );
You never "update" on this table (except to change the valid of isCurrent), just insert new rows. For any given EmployeeId, only 1 row can have isCurrent == 1.
The complexity of maintaining this can be hidden by views and "instead of" triggers (in oracle, I presume similar things other RDBMS), you can even go to materialized views if the tables are too big and can't be handled by indexes).
This method is ok, but you can end up with some complex queries.
Personally, I'm pretty fond of your Design 2 way of doing it, which is how I've done it in the past as well. Its simple to understand, simple to implement and simple to maintain.
It also creates very little overhead for the database and application, especially when performing read queries, which is likely what you'll be doing 99% of the time.
It would also be quite easy to automatic the creation of the history tables and triggers to maintain (assuming it would be done via triggers).
Revisions of data is an aspect of the 'valid-time' concept of a Temporal Database. Much research has gone into this, and many patterns and guidelines have emerged. I wrote a lengthy reply with a bunch of references to this question for those interested.
I'm going to share with you my design and it's different from your both designs in that it requires one table per each entity type. I found the best way to describe any database design is through ERD, here's mine:
In this example we have an entity named employee. user table holds your users' records and entity and entity_revision are two tables which hold revision history for all the entity types that you will have in your system. Here's how this design works:
The two fields of entity_id and revision_id
Each entity in your system will have a unique entity id of its own. Your entity might go through revisions but its entity_id will remain the same. You need to keep this entity id in you employee table (as a foreign key). You should also store the type of your entity in the entity table (e.g. 'employee'). Now as for the revision_id, as its name shows, it keep track of your entity revisions. The best way I found for this is to use the employee_id as your revision_id. This means you will have duplicate revision ids for different types of entities but this is no treat to me (I'm not sure about your case). The only important note to make is that the combination of entity_id and revision_id should be unique.
There's also a state field within entity_revision table which indicated the state of revision. It can have one of the three states: latest, obsolete or deleted (not relying on the date of revisions helps you a great deal to boost your queries).
One last note on revision_id, I didn't create a foreign key connecting employee_id to revision_id because we don't want to alter entity_revision table for each entity type that we might add in future.
INSERTION
For each employee that you want to insert into database, you will also add a record to entity and entity_revision. These last two records will help you keep track of by whom and when a record has been inserted into database.
UPDATE
Each update for an existing employee record will be implemented as two inserts, one in employee table and one in entity_revision. The second one will help you to know by whom and when the record has been updated.
DELETION
For deleting an employee, a record is inserted into entity_revision stating the deletion and done.
As you can see in this design no data is ever altered or removed from database and more importantly each entity type requires only one table. Personally I find this design really flexible and easy to work with. But I'm not sure about you as your needs might be different.
[UPDATE]
Having supported partitions in the new MySQL versions, I believe my design also comes with one of the best performances too. One can partition entity table using type field while partition entity_revision using its state field. This will boost the SELECT queries by far while keep the design simple and clean.
If indeed an audit trail is all you need, I'd lean toward the audit table solution (complete with denormalized copies of the important column on other tables, e.g., UserName). Keep in mind, though, that bitter experience indicates that a single audit table will be a huge bottleneck down the road; it's probably worth the effort to create individual audit tables for all your audited tables.
If you need to track the actual historical (and/or future) versions, then the standard solution is to track the same entity with multiple rows using some combination of start, end, and duration values. You can use a view to make accessing current values convenient. If this is the approach you take, you can run into problems if your versioned data references mutable but unversioned data.
If you want to do the first one you might want to use XML for the Employees table too. Most newer databases allow you to query into XML fields so this is not always a problem. And it might be simpler to have one way to access employee data regardless if it's the latest version or an earlier version.
I would try the second approach though. You could simplify this by having just one Employees table with a DateModified field. The EmployeeId + DateModified would be the primary key and you can store a new revision by just adding a row. This way archiving older versions and restoring versions from archive is easier too.
Another way to do this could be the datavault model by Dan Linstedt. I did a project for the Dutch statistics bureau that used this model and it works quite well. But I don't think it's directly useful for day to day database use. You might get some ideas from reading his papers though.
How about:
EmployeeID
DateModified
and/or revision number, depending on how you want to track it
ModifiedByUSerId
plus any other information you want to track
Employee fields
You make the primary key (EmployeeId, DateModified), and to get the "current" record(s) you just select MAX(DateModified) for each employeeid. Storing an IsCurrent is a very bad idea, because first of all, it can be calculated, and secondly, it is far too easy for data to get out of sync.
You can also make a view that lists only the latest records, and mostly use that while working in your app. The nice thing about this approach is that you don't have duplicates of data, and you don't have to gather data from two different places (current in Employees, and archived in EmployeesHistory) to get all the history or rollback, etc).
If you want to rely on history data (for reporting reasons) you should use structure something like this:
// Holds Employee Entity
"Employees (EmployeeId, FirstName, LastName, DepartmentId, .., ..)"
// Holds the Employee revisions in rows.
"EmployeeHistories (HistoryId, EmployeeId, DateModified, OldValue, NewValue, FieldName)"
Or global solution for application:
// Holds Employee Entity
"Employees (EmployeeId, FirstName, LastName, DepartmentId, .., ..)"
// Holds all entities revisions in rows.
"EntityChanges (EntityName, EntityId, DateModified, OldValue, NewValue, FieldName)"
You can save your revisions also in XML, then you have only one record for one revision. This will be looks like:
// Holds Employee Entity
"Employees (EmployeeId, FirstName, LastName, DepartmentId, .., ..)"
// Holds all entities revisions in rows.
"EntityChanges (EntityName, EntityId, DateModified, XMLChanges)"
We have had similar requirements, and what we found was that often times the user just wants to see what has been changed, not necessarily roll back any changes.
I'm not sure what your use case is, but what we have done was create and Audit table that is automatically updated with changes to an business entity, including the friendly name of any foreign key references and enumerations.
Whenever the user saves their changes we reload the old object, run a comparison, record the changes, and save the entity (all are done in a single database transaction in case there are any problems).
This seems to work very well for our users and saves us the headache of having a completely separate audit table with the same fields as our business entity.
It sounds like you want to track changes to specific entities over time, e.g. ID 3, "bob", "123 main street", then another ID 3, "bob" "234 elm st", and so on, in essence being able to puke out a revision history showing every address "bob" has been at.
The best way to do this is to have an "is current" field on each record, and (probably) a timestamp or FK to a date/time table.
Inserts have to then set the "is current" and also unset the "is current" on the previous "is current" record. Queries have to specify the "is current", unless you want all of the history.
There are further tweaks to this if it's a very large table, or a large number of revisions are expected, but this is a fairly standard approach.