Create DB2 History Table Trigger - sql

I want to create a history table to track field changes across a number of tables in DB2.
I know history is usually done with copying an entire table's structure and giving it a suffixed name (e.g. user --> user_history). Then you can use a pretty simple trigger to copy the old record into the history table on an UPDATE.
However, for my application this would use too much space. It doesn't seem like a good idea (to me at least) to copy an entire record to another table every time a field changes. So I thought I could have a generic 'history' table which would track individual field changes:
CREATE TABLE history
(
history_id LONG GENERATED ALWAYS AS IDENTITY,
record_id INTEGER NOT NULL,
table_name VARCHAR(32) NOT NULL,
field_name VARCHAR(64) NOT NULL,
field_value VARCHAR(1024),
change_time TIMESTAMP,
PRIMARY KEY (history_id)
);
OK, so every table that I want to track has a single, auto-generated id field as the primary key, which would be put into the 'record_id' field. And the maximum VARCHAR size in the tables is 1024. Obviously if a non-VARCHAR field changes, it would have to be converted into a VARCHAR before inserting the record into the history table.
Now, this could be a completely retarded way to do things (hey, let me know why if it is), but I think it it's a good way of tracking changes that need to be pulled up rarely and need to be stored for a significant amount of time.
Anyway, I need help with writing the trigger to add records to the history table on an update. Let's for example take a hypothetical user table:
CREATE TABLE user
(
user_id INTEGER GENERATED ALWAYS AS IDENTITY,
username VARCHAR(32) NOT NULL,
first_name VARCHAR(64) NOT NULL,
last_name VARCHAR(64) NOT NULL,
email_address VARCHAR(256) NOT NULL
PRIMARY KEY(user_id)
);
So, can anyone help me with a trigger on an update of the user table to insert the changes into the history table? My guess is that some procedural SQL will need to be used to loop through the fields in the old record, compare them with the fields in the new record and if they don't match, then add a new entry into the history table.
It'd be preferable to use the same trigger action SQL for every table, regardless of its fields, if it's possible.
Thanks!

I don't think this is a good idea, as you generate even more overhead per value with a big table where more than one value changes. But that depends on your application.
Furthermore you should consider the practical value of such a history table. You have to get a lot of rows together to even get a glimpse of context to the value changed and it requeries you to code another application that does just this complex history logic for an enduser. And for an DB-admin it would be cumbersome to restore values out of the history.
it may sound a bit harsh, but that is not the intend. An experienced programmer in our shop had a simmilar idea through table journaling. He got it up and running, but it ate diskspace like there's no tomorrow.
Just think about what your history table should really accomplish.

Have you considered doing this as a two step process? Implement a simple trigger that records the original and changed version of the entire row. Then write a separate program that runs once a day to extract the changed fields as you describe above.
This makes the trigger simpler, safer, faster and you have more choices for how to implement the post processing step.

We do something similar on our SQL Server database, but the audit tables are for each indvidual table audited (one central table would be huge as our database is many many gigabytes in size)
One thing you need to do is make sure you also record who made the change. You should also record the old and new value together (makes it easier to put data back if you need to) and the change type (insert, update, delete). You don't mention recording deletes from the table, but we find those the some of the things we most frequently use the table for.
We use dynamic SQl to generate the code to create the audit tables (by using the table that stores the system information) and all audit tables have the exact same structure (makes is easier to get data back out).
When you create the code to store the data in your history table, create the code as well to restore the data if need be. This will save tons of time down the road when something needs to be restored and you are under pressure from senior management to get it done now.
Now I don't know if you were planning to be able to restore data from your history table, but once you have once, I can guarantee that management will want it used that way.

CREATE TABLE HIST.TB_HISTORY (
HIST_ID BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 0, INCREMENT BY 1, NO CACHE) NOT NULL,
HIST_COLUMNNAME VARCHAR(128) NOT NULL,
HIST_OLDVALUE VARCHAR(255),
HIST_NEWVALUE VARCHAR(255),
HIST_CHANGEDDATE TIMESTAMP NOT NULL
PRIMARY KEY(HIST_SAFTYNO)
)
GO
CREATE TRIGGER COMMON.TG_BANKCODE AFTER
UPDATE OF FRD_BANKCODE ON COMMON.TB_MAINTENANCE
REFERENCING OLD AS oldcol NEW AS newcol FOR EACH ROW MODE DB2SQL
WHEN(COALESCE(newcol.FRD_BANKCODE,'#null#') <> COALESCE(oldcol.FRD_BANKCODE,'#null#'))
BEGIN ATOMIC
CALL FB_CHECKING.SP_FRAUDHISTORY_ON_DATACHANGED(
newcol.FRD_FRAUDID,
'FRD_BANKCODE',
oldcol.FRD_BANKCODE,
newcol.FRD_BANKCODE,
newcol.FRD_UPDATEDBY
);--
INSERT INTO FB_CHECKING.TB_FRAUDMAINHISTORY(
HIST_COLUMNNAME,
HIST_OLDVALUE,
HIST_NEWVALUE,
HIST_CHANGEDDATE

Related

SQL Database Deleted Flag and DeletedBy DeletedOn

I've added a history table to my database. Originally I added a Bit called Deleted and intended to update it to 1 if that row was ever deleted, otherwise each row is an update.
Then I was informed we need to log who deleted what when. So I added Nullable [DeletedBy] [DeletedOn] fields.
At this point I was wondering if this made my Deleted Bit redundant. You could simply query the table, checking where DeletedBy is not Null if you want to see which row is Deleted.
I intended to ask in this question which is better practice:
Having the extra Bit Column
Using the nullable columns that are already there, to Identify Deleted Rows
But I'm starting to think this is a preference thing. So instead my question is, which is more efficient? If this table gets massive, is there a performance gain to running:
Select * from MyTable where [Deleted] = 1
over
Select * from MyTable where [DeletedBy] is not null
This is more a preference. Technically the datetime field is larger than a bit field, but since you are required to store it anyway it does not really matter. However performance wise you can index either and get the same results. I personally think the bit field is redundant and use the nullable datetime.
If you added the 'DeletedBy' bit a while ago, and there already records in your live database that are 'deleted', then you need to keep the bit field, as you don't have the information to enter in the 'deleted by' at this stage (I imagine).
well you do need to know who deleted, so the DeletedBy column MUST stay there. Which makes the main question: should you keep the bit column or not?
The answer is simple: no :)
I know it is just a bit columns and it doesn't occupy much but a bit multiply by a lot of rows are a lot of bits. It probably wont impact your storage of course, but there is no reason to keep redundant data in this case.
regarding the deleted = 1 fields you may have, just update the DeletedBy to something like 'system' or something that tells you that the record was deleted before the implementation of the new field
You are basically creating an audit trail, and it's simple to do. First, create all of your audit tables with some standard fields for audit information. For example:
[audit_id] [int] IDENTITY(1,1) NOT NULL,
[audit_action] [varchar](16) NOT NULL,
[audit_date] [datetime] NOT NULL,
[audit_user_name] [varchar](128) NOT NULL,
--<your fields from the table being audited>
Default the audit_date to a value of getdate(). If you are using Active Directory security, default audit_user_name to a value of suser_sname(), otherwise you'll have to provide this in your query.
Now, create a trigger for INSERT, UPDATE, and DELETE for the table to be audited. You'll write the values into your audit table. Here is an example for DELETE:
CREATE TRIGGER [dbo].[tr_my_table_being_audited_delete]
ON [dbo].[my_table_being_audited]
AFTER DELETE
AS
BEGIN
SET NOCOUNT ON;
INSERT INTO dbo.my_audit_trail_table (audit_action, --<your fields from the table being audited>)
SELECT 'DELETE', --<your fields from the table being audited>
FROM deleted
END
For massive tables I really don't like using soft deletes, I prefer archiving but I understand all projects are different.
I would probably just keep the 'DeletedBy' Flag on the primary table since its a little less overhead and create a DeletionLog table with 'DeletedBy' and 'Timestamp' for auditing
This would be especially beneficial in a high read environment.

Avoiding a two step insert in SQL

Let's say I have a table defined as follows:
CREATE TABLE SomeTable
(
P_Id int PRIMARY KEY IDENTITY,
CompoundKey varchar(255) NOT NULL,
)
CompoundKey is a string with the primary key P_Id concatenated to the end, like Foo00000001 which comes from "Foo" + 00000001. At the moment, entries insertions into this table happen in 2 steps.
Insert a dummy record with a place holder string for CompoundKey.
Update the CompoundKey with the column with the generated compound key.
I'm looking for a way to avoid the 2nd update entirely and do it all with one insert statement. Is this possible? I'm using MS SQL Server 2005.
p.s. I agree that this is not the most sensible schema in the world, and this schema will be refactored (and properly normalized) but I'm unable to make changes to the schema for now.
Your could use a computed column; change the schema to read:
CREATE TABLE SomeTable
(
P_Id int PRIMARY KEY IDENTITY,
CompoundKeyPrefix varchar(255) NOT NULL,
CompoundKey AS CompoundKeyPrefix + CAST(P_Id AS VARCHAR(10))
)
This way, SQL Server will automagically give you your compound key in a new column, and will automatically maintain it for you. You may also want to look into the PERSIST keyword for computed columns which will cause SQL Server to materialise the value in the data files rather than having to compute it on the fly. You can also add an index against the column should you so wish.
A trigger would easily accomplish this
This is simply not possible.
The "next ID" doesn't exist and thus cannot be read to fulfill the UPDATE until the row is inserted.
Now, if you were sourcing your autonumbers from somwhere else you could, but I don't think that's a good answer to your question.
Even if you want to use triggers, an UPDATE is still executed even if you don't manually execute it.
You can obscure the population of the CompoundKey, but at the end of the day it's still going to be an UPDATE
I think your safest bet is just to make sure the UPDATE is in the same transaction as the INSERT or use a trigger. But, for the academic argument of it, an UPDATE still occurs.
Two things:
1) if you end up using two inserts, you must use transaction! Otherwise other processes may see the database in inconsistent state (i.e. seeing record without CompoundKey).
2) I would refrain from trying to paste the Id to the end of CompoundKey in transaction, trigger etc. It is much cleaner to do it at the output if you need it, e.g. in queries (select concat(CompoundKey, Id) as CompoundKeyId ...). If you need it as a foreign key in other tables, just use the pair (CompoundKey, Id).

RODBC sqlSave() stopping insert query when PK violated

I have developed an online survey that stores my data in a Microsoft SQL 2005 database. I have written a set of outlier checks on my data in R. The general workflow for these scripts is:
Read data from SQL database with sqlQuery()
Perform outlier analysis
Write offending respondents back to database in separate table using sqlSave()
The table I am writing back to has the structure:
CREATE TABLE outliers2(
modelid int
, password varchar(50)
, reason varchar(50),
Constraint PK_outliers2 PRIMARY KEY(modelid, reason)
)
GO
As you can see, I've set the primary key to be modelid and reason. The same respondent may be an outlier for multiple checks, but I do not want to insert the same modelid and reason combo for any respondent.
Since we are still collecting data, I would like to be able to update these scripts on a daily / weekly basis as I develop the models I am estimating on the data. Here is the general form of the sqlSave() command I'm using:
sqlSave(db, db.insert, "outliers2", append = TRUE, fast = FALSE, rownames = FALSE)
where db is a valid ODBC Connection and db.insert has the form
> head(db.insert)
modelid password reason
1 873 abkd WRONG DIRECTION
2 875 ab9d WRONG DIRECTION
3 890 akdw WRONG DIRECTION
4 905 pqjd WRONG DIRECTION
5 941 ymne WRONG DIRECTION
6 944 okyt WRONG DIRECTION
sqlSave() chokes when it tries to insert a row that violates the primary key constraint and does not continue with the other records for the insert. I would have thought that setting fast = FALSE would have alleviated this problem, but it doesn't.
Any ideas on how to get around this problem? I could always drop the table at the beginning of the first script, but that seems pretty heavy handed and will undoubtedly lead to problems down the road.
In this case, everything is working as expected. You uploading everything as a batch and SQL Server is stopping the batch as soon as it finds an error. Unfortunately, I don't know of a graceful built-in solution. But, I think it is possible to build a system in the database to handle this more efficiently. I like doing data storage/management in databases rather than within R, so my solution is very database heavy. Others may offer you a solution that is more R oriented.
First, create a simple table, without constraints, to hold your new rows and adjust your sqlSave statement accordingly. This is where R will upload the information to.
CREATE TABLE tblTemp(
modelid int
, password varchar(50)
, reason varchar(50)
, duplicate int()
)
GO
Your query to put information into this table should assume 'No' for the column 'duplicate'. I use a pattern where 1=Y & 5=N. You could also only mark those that are outliers but I tend to prefer to be explicit with my logic.
You will also need a place to dump all rows which violate the PK in outliers2.
CREATE TABLE tblDuplicates(
modelid int
, password varchar(50)
, reason varchar(50)
)
GO
OK. Now all you need to do is to create a trigger to move the new rows from tblTemp to outliers2. This trigger will move all duplicate rows to tblDuplicates for later handling, deletion, whatever.
CREATE TRIGGER FindDups
ON tblOutliersTemp
AFTER INSERT
AS
I'm not going to go through and write the entire trigger. I don't have a SQL Server 2005 to test it against and I would probably make a syntax error and I don't want to give you bad code, but here's what the trigger needs to do:
Identify all rows in tblTemp that would violate the PK in outliers2. Where duplicates are found, change the duplicates to 1. This would be done with an UPDATE statement.
Copy all rows where duplicate=1 to tblDuplicates. You would do this with an INSERT INTO tblDuplicates ......
Now copy the non-duplicate rows to outliers2 with an INSERT INTO statement that looks almost exactly like the one used in step 2.
DROP all rows from tblTemp, to clear it out for your next batch of updates. This step is important.
The nice part about doing it this way is sqlSave() won't error out just because you have a violation of your PK and you can deal with the matches at a later time, like tomorrow. :-)

Fixing DB Inconsistencies - ID Fields

I've inherited a (Microsoft?) SQL database that wasn't very pristine in its original state. There are still some very strange things in it that I'm trying to fix - one of them is inconsistent ID entries.
In the accounts table, each entry has a number called accountID, which is referenced in several other tables (notes, equipment, etc. ). The problem is that the numbers (for some random reason) - range from about -100000 to +2000000 when there are about only 7000 entries.
Is there any good way to re-number them while changing corresponding numbers in the other tables? At my disposal I also have ColdFusion, so any thing that works with SQL and/or that I'll accept.
For surrogate keys, they are meant to be meaningless, so unless you actually had a database integrity issue (like there were no foreign key contraints properly defined) or your identity was approaching the maximum for its datatype, I would leave them alone and go after some other low hanging fruit that would have more impact.
In this instance, it sounds like "why" is a better question than "how". The OP notes that there is a strange problem that needs to be fixed but doesn't say why it is a problem. Is it causing problems? What positive impact would changing these numbers have? Unless you originally programmed the system and understand precisely why the number is in its current state, you are taking quite a risky making changes like this.
I would talk to an accountant (or at least your financial people) before messing in anyway with the numbers in the accounts tables if this is a financial app. The Table of accounts is very critical to how finances are reported. These IDs may have meaning you don't understand. No one puts in a negative id unless they had a reason. I would under no circumstances change that unless I understood why it was negative to begin with. You could truly screw up your tax reporting or some other thing by making an uneeded change.
You could probably disable the foreign key relationships (if you're able to take it offline temporarily) and then update the primary keys using a script. I've used this update script before to change values, and you could pretty easily wrap this code in a cursor to go through the key values in question, one by one, and update the arbitrary value to an incrementing value you're keeping track of.
Check out the script here: http://vyaskn.tripod.com/sql_server_search_and_replace.htm
If you just have a list of tables that use the primary key, you could set up a series of UPDATE statements that run inside your cursor, and then you wouldn't need to use this script (which can be a little slow).
It's worth asking, though, why these values appear out of wack. Does this database have values added and deleted constantly? Are the primary key values really arbitrary, or do they just appear to be, but they really have meaning? Though I'm all for consolidating, you'd have to ensure that there's no purpose to those values.
With ColdFusion this shouldn't be a herculean task, but it will be messy and you'll have to be careful. One method you could use would be to script the database and then generate a brand new, blank table schema. Set the accountID as an identity field in the new database.
Then, using ColdFusion, write a query that will pull all of the old account data and insert them into the new database one by one. For each row, let the new database assign a new ID. After each insert, pull the new ID (using either ##IDENTITY or MAX(accountID)) and store the new ID and the old ID together in a temporary table so you know which old IDs belong to which new IDs.
Next, repeat the process with each of the child tables. For each old ID, pull its child entries and re-insert them into the new database using the new IDs. If the primary keys on the child tables are fine, you can insert them as-is or let the server assign new ones if they don't matter.
Assigning new IDs in place by disabling relationships temporarily may work, but you might also run into conflicts if one of the entries is assigned an ID that is already being used by the old data which could cause conflicts.
Create a new column in the accounts table for your new ID, and new column in each of your related tables to reference the new ID column.
ALTER TABLE accounts
ADD new_accountID int IDENTITY
ALTER TABLE notes
ADD new_accountID int
ALTER TABLE equipment
ADD new_accountID int
Then you can map the new_accountID column on each of your referencing tables to the accounts table.
UPDATE notes
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN notes ON (notes.accountID = accounts.accountID)
UPDATE equipment
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN equipment ON (equipment.accountID = accounts.accountID)
At this point, each table has both accountID with the old keys, and new_accountID with the new keys. From here it should be pretty straightforward.
Break all of the foreign keys on accountID.
On each table, UPDATE [table] SET accountID = new_accountID.
Re-add the foreign keys for accountID.
Drop new_accountID from all of the tables, as it's no longer needed.

Database Design for Revisions?

We have a requirement in project to store all the revisions(Change History) for the entities in the database. Currently we have 2 designed proposals for this:
e.g. for "Employee" Entity
Design 1:
-- Holds Employee Entity
"Employees (EmployeeId, FirstName, LastName, DepartmentId, .., ..)"
-- Holds the Employee Revisions in Xml. The RevisionXML will contain
-- all data of that particular EmployeeId
"EmployeeHistories (EmployeeId, DateModified, RevisionXML)"
Design 2:
-- Holds Employee Entity
"Employees (EmployeeId, FirstName, LastName, DepartmentId, .., ..)"
-- In this approach we have basically duplicated all the fields on Employees
-- in the EmployeeHistories and storing the revision data.
"EmployeeHistories (EmployeeId, RevisionId, DateModified, FirstName,
LastName, DepartmentId, .., ..)"
Is there any other way of doing this thing?
The problem with the "Design 1" is that we have to parse XML each time when you need to access data. This will slow the process and also add some limitations like we cannot add joins on the revisions data fields.
And the problem with the "Design 2" is that we have to duplicate each and every field on all entities (We have around 70-80 entities for which we want to maintain revisions).
I think the key question to ask here is 'Who / What is going to be using the history'?
If it's going to be mostly for reporting / human readable history, we've implemented this scheme in the past...
Create a table called 'AuditTrail' or something that has the following fields...
[ID] [int] IDENTITY(1,1) NOT NULL,
[UserID] [int] NULL,
[EventDate] [datetime] NOT NULL,
[TableName] [varchar](50) NOT NULL,
[RecordID] [varchar](20) NOT NULL,
[FieldName] [varchar](50) NULL,
[OldValue] [varchar](5000) NULL,
[NewValue] [varchar](5000) NULL
You can then add a 'LastUpdatedByUserID' column to all of your tables which should be set every time you do an update / insert on the table.
You can then add a trigger to every table to catch any insert / update that happens and creates an entry in this table for each field that's changed. Because the table is also being supplied with the 'LastUpdateByUserID' for each update / insert, you can access this value in the trigger and use it when adding to the audit table.
We use the RecordID field to store the value of the key field of the table being updated. If it's a combined key, we just do a string concatenation with a '~' between the fields.
I'm sure this system may have drawbacks - for heavily updated databases the performance may be hit, but for my web-app, we get many more reads than writes and it seems to be performing pretty well. We even wrote a little VB.NET utility to automatically write the triggers based on the table definitions.
Just a thought!
Do not put it all in one table with an IsCurrent discriminator attribute. This just causes problems down the line, requires surrogate keys and all sorts of other problems.
Design 2 does have problems with schema changes. If you change the Employees table you have to change the EmployeeHistories table and all the related sprocs that go with it. Potentially doubles you schema change effort.
Design 1 works well and if done properly does not cost much in terms of a performance hit. You could use an xml schema and even indexes to get over possible performance problems. Your comment about parsing the xml is valid but you could easily create a view using xquery - which you can include in queries and join to. Something like this...
CREATE VIEW EmployeeHistory
AS
, FirstName, , DepartmentId
SELECT EmployeeId, RevisionXML.value('(/employee/FirstName)[1]', 'varchar(50)') AS FirstName,
RevisionXML.value('(/employee/LastName)[1]', 'varchar(100)') AS LastName,
RevisionXML.value('(/employee/DepartmentId)[1]', 'integer') AS DepartmentId,
FROM EmployeeHistories
The History Tables article in the Database Programmer blog might be useful - covers some of the points raised here and discusses the storage of deltas.
Edit
In the History Tables essay, the author (Kenneth Downs), recommends maintaining a history table of at least seven columns:
Timestamp of the change,
User that made the change,
A token to identify the record that was changed (where the history is maintained separately from the current state),
Whether the change was an insert, update, or delete,
The old value,
The new value,
The delta (for changes to numerical values).
Columns which never change, or whose history is not required, should not be tracked in the history table to avoid bloat. Storing the delta for numerical values can make subsequent queries easier, even though it can be derived from the old and new values.
The history table must be secure, with non-system users prevented from inserting, updating or deleting rows. Only periodic purging should be supported to reduce overall size (and if permitted by the use case).
Avoid Design 1; it is not very handy once you will need to for example rollback to old versions of the records - either automatically or "manually" using administrators console.
I don't really see disadvantages of Design 2. I think the second, History table should contain all columns present in the first, Records table. E.g. in mysql you can easily create table with the same structure as another table (create table X like Y). And, when you are about to change structure of the Records table in your live database, you have to use alter table commands anyway - and there is no big effort in running these commands also for your History table.
Notes
Records table contains only lastest revision;
History table contains all previous revisions of records in Records table;
History table's primary key is a primary key of the Records table with added RevisionId column;
Think about additional auxiliary fields like ModifiedBy - the user who created particular revision. You may also want to have a field DeletedBy to track who deleted particular revision.
Think about what DateModified should mean - either it means where this particular revision was created, or it will mean when this particular revision was replaced by another one. The former requires the field to be in the Records table, and seems to be more intuitive at the first sight; the second solution however seems to be more practical for deleted records (date when this particular revision was deleted). If you go for the first solution, you would probably need a second field DateDeleted (only if you need it of course). Depends on you and what you actually want to record.
Operations in Design 2 are very trivial:
Modify
copy the record from Records table to History table, give it new RevisionId (if it is not already present in Records table), handle DateModified (depends on how you interpret it, see notes above)
go on with normal update of the record in Records table
Delete
do exactly the same as in the first step of Modify operation. Handle DateModified/DateDeleted accordingly, depending on the interpretation you have chosen.
Undelete (or rollback)
take highest (or some particular?) revision from History table and copy it to the Records table
List revision history for particular record
select from History table and Records table
think what exactly you expect from this operation; it will probably determine what information you require from DateModified/DateDeleted fields (see notes above)
If you go for Design 2, all SQL commands needed to do that will be very very easy, as well as maintenance! Maybe, it will be much much easier if you use the auxiliary columns (RevisionId, DateModified) also in the Records table - to keep both tables at exactly the same structure (except for unique keys)! This will allow for simple SQL commands, which will be tolerant to any data structure change:
insert into EmployeeHistory select * from Employe where ID = XX
Don't forget to use transactions!
As for the scaling, this solution is very efficient, since you don't transform any data from XML back and forth, just copying whole table rows - very simple queries, using indices - very efficient!
We have implemented a solution very similar to the solution that Chris Roberts suggests, and that works pretty well for us.
Only difference is that we only store the new value. The old value is after all stored in the previous history row
[ID] [int] IDENTITY(1,1) NOT NULL,
[UserID] [int] NULL,
[EventDate] [datetime] NOT NULL,
[TableName] [varchar](50) NOT NULL,
[RecordID] [varchar](20) NOT NULL,
[FieldName] [varchar](50) NULL,
[NewValue] [varchar](5000) NULL
Lets say you have a table with 20 columns. This way you only have to store the exact column that has changed instead of having to store the entire row.
If you have to store history, make a shadow table with the same schema as the table you are tracking and a 'Revision Date' and 'Revision Type' column (e.g. 'delete', 'update'). Write (or generate - see below) a set of triggers to populate the audit table.
It's fairly straightforward to make a tool that will read the system data dictionary for a table and generate a script that creates the shadow table and a set of triggers to populate it.
Don't try to use XML for this, XML storage is a lot less efficient than the native database table storage that this type of trigger uses.
Ramesh, I was involved in development of system based on first approach.
It turned out that storing revisions as XML is leading to a huge database growth and significantly slowing things down.
My approach would be to have one table per entity:
Employee (Id, Name, ... , IsActive)
where IsActive is a sign of the latest version
If you want to associate some additional info with revisions you can create separate table
containing that info and link it with entity tables using PK\FK relation.
This way you can store all version of employees in one table.
Pros of this approach:
Simple data base structure
No conflicts since table becomes append-only
You can rollback to previous version by simply changing IsActive flag
No need for joins to get object history
Note that you should allow primary key to be non unique.
The way that I've seen this done in the past is have
Employees (EmployeeId, DateModified, < Employee Fields > , boolean isCurrent );
You never "update" on this table (except to change the valid of isCurrent), just insert new rows. For any given EmployeeId, only 1 row can have isCurrent == 1.
The complexity of maintaining this can be hidden by views and "instead of" triggers (in oracle, I presume similar things other RDBMS), you can even go to materialized views if the tables are too big and can't be handled by indexes).
This method is ok, but you can end up with some complex queries.
Personally, I'm pretty fond of your Design 2 way of doing it, which is how I've done it in the past as well. Its simple to understand, simple to implement and simple to maintain.
It also creates very little overhead for the database and application, especially when performing read queries, which is likely what you'll be doing 99% of the time.
It would also be quite easy to automatic the creation of the history tables and triggers to maintain (assuming it would be done via triggers).
Revisions of data is an aspect of the 'valid-time' concept of a Temporal Database. Much research has gone into this, and many patterns and guidelines have emerged. I wrote a lengthy reply with a bunch of references to this question for those interested.
I'm going to share with you my design and it's different from your both designs in that it requires one table per each entity type. I found the best way to describe any database design is through ERD, here's mine:
In this example we have an entity named employee. user table holds your users' records and entity and entity_revision are two tables which hold revision history for all the entity types that you will have in your system. Here's how this design works:
The two fields of entity_id and revision_id
Each entity in your system will have a unique entity id of its own. Your entity might go through revisions but its entity_id will remain the same. You need to keep this entity id in you employee table (as a foreign key). You should also store the type of your entity in the entity table (e.g. 'employee'). Now as for the revision_id, as its name shows, it keep track of your entity revisions. The best way I found for this is to use the employee_id as your revision_id. This means you will have duplicate revision ids for different types of entities but this is no treat to me (I'm not sure about your case). The only important note to make is that the combination of entity_id and revision_id should be unique.
There's also a state field within entity_revision table which indicated the state of revision. It can have one of the three states: latest, obsolete or deleted (not relying on the date of revisions helps you a great deal to boost your queries).
One last note on revision_id, I didn't create a foreign key connecting employee_id to revision_id because we don't want to alter entity_revision table for each entity type that we might add in future.
INSERTION
For each employee that you want to insert into database, you will also add a record to entity and entity_revision. These last two records will help you keep track of by whom and when a record has been inserted into database.
UPDATE
Each update for an existing employee record will be implemented as two inserts, one in employee table and one in entity_revision. The second one will help you to know by whom and when the record has been updated.
DELETION
For deleting an employee, a record is inserted into entity_revision stating the deletion and done.
As you can see in this design no data is ever altered or removed from database and more importantly each entity type requires only one table. Personally I find this design really flexible and easy to work with. But I'm not sure about you as your needs might be different.
[UPDATE]
Having supported partitions in the new MySQL versions, I believe my design also comes with one of the best performances too. One can partition entity table using type field while partition entity_revision using its state field. This will boost the SELECT queries by far while keep the design simple and clean.
If indeed an audit trail is all you need, I'd lean toward the audit table solution (complete with denormalized copies of the important column on other tables, e.g., UserName). Keep in mind, though, that bitter experience indicates that a single audit table will be a huge bottleneck down the road; it's probably worth the effort to create individual audit tables for all your audited tables.
If you need to track the actual historical (and/or future) versions, then the standard solution is to track the same entity with multiple rows using some combination of start, end, and duration values. You can use a view to make accessing current values convenient. If this is the approach you take, you can run into problems if your versioned data references mutable but unversioned data.
If you want to do the first one you might want to use XML for the Employees table too. Most newer databases allow you to query into XML fields so this is not always a problem. And it might be simpler to have one way to access employee data regardless if it's the latest version or an earlier version.
I would try the second approach though. You could simplify this by having just one Employees table with a DateModified field. The EmployeeId + DateModified would be the primary key and you can store a new revision by just adding a row. This way archiving older versions and restoring versions from archive is easier too.
Another way to do this could be the datavault model by Dan Linstedt. I did a project for the Dutch statistics bureau that used this model and it works quite well. But I don't think it's directly useful for day to day database use. You might get some ideas from reading his papers though.
How about:
EmployeeID
DateModified
and/or revision number, depending on how you want to track it
ModifiedByUSerId
plus any other information you want to track
Employee fields
You make the primary key (EmployeeId, DateModified), and to get the "current" record(s) you just select MAX(DateModified) for each employeeid. Storing an IsCurrent is a very bad idea, because first of all, it can be calculated, and secondly, it is far too easy for data to get out of sync.
You can also make a view that lists only the latest records, and mostly use that while working in your app. The nice thing about this approach is that you don't have duplicates of data, and you don't have to gather data from two different places (current in Employees, and archived in EmployeesHistory) to get all the history or rollback, etc).
If you want to rely on history data (for reporting reasons) you should use structure something like this:
// Holds Employee Entity
"Employees (EmployeeId, FirstName, LastName, DepartmentId, .., ..)"
// Holds the Employee revisions in rows.
"EmployeeHistories (HistoryId, EmployeeId, DateModified, OldValue, NewValue, FieldName)"
Or global solution for application:
// Holds Employee Entity
"Employees (EmployeeId, FirstName, LastName, DepartmentId, .., ..)"
// Holds all entities revisions in rows.
"EntityChanges (EntityName, EntityId, DateModified, OldValue, NewValue, FieldName)"
You can save your revisions also in XML, then you have only one record for one revision. This will be looks like:
// Holds Employee Entity
"Employees (EmployeeId, FirstName, LastName, DepartmentId, .., ..)"
// Holds all entities revisions in rows.
"EntityChanges (EntityName, EntityId, DateModified, XMLChanges)"
We have had similar requirements, and what we found was that often times the user just wants to see what has been changed, not necessarily roll back any changes.
I'm not sure what your use case is, but what we have done was create and Audit table that is automatically updated with changes to an business entity, including the friendly name of any foreign key references and enumerations.
Whenever the user saves their changes we reload the old object, run a comparison, record the changes, and save the entity (all are done in a single database transaction in case there are any problems).
This seems to work very well for our users and saves us the headache of having a completely separate audit table with the same fields as our business entity.
It sounds like you want to track changes to specific entities over time, e.g. ID 3, "bob", "123 main street", then another ID 3, "bob" "234 elm st", and so on, in essence being able to puke out a revision history showing every address "bob" has been at.
The best way to do this is to have an "is current" field on each record, and (probably) a timestamp or FK to a date/time table.
Inserts have to then set the "is current" and also unset the "is current" on the previous "is current" record. Queries have to specify the "is current", unless you want all of the history.
There are further tweaks to this if it's a very large table, or a large number of revisions are expected, but this is a fairly standard approach.