What is the best architecture for tracking field changes on objects? - sql

We have a web application that is built on top of a SQL database. Several different types of objects can have comments added to them, and some of these objects need field-level tracking, similar to how field changes are tracked on most issue-tracking systems (such as status, assignment, priority). We'd like to show who the change is by, what the previous value was, and what the new value is.
At a pure design level, it would be most straightforward to track each change from any object in a generic table, with columns for the object type, object primary key, primary key of the user that made the change, the field name, and the old and new values. In our case, these would also optionally have a comment ID if the user entered a comment when making the changes.
However, with how quickly this data can grow, is this the best architecture? What are some methods commonly employed to add this type of functionality to an already large-scale application?
[Edit] I'm starting a bounty on this question mainly because I'd like to find out in particular what is the best architecture in terms of handling scale very well. Tom H.'s answer is informative, but the recommended solution seems to be fairly size-inefficient (a new row for every new state of an object, even if many columns did not change) and not possible given the requirement that we must be able to track changes to user-created fields as well. In particular, I'm likely to accept an answer that can explain how a common issue-tracking system (JIRA or similar) has implemented this.

There are several options available to you for this. You could have audit tables which basically mirror the base tables but also include a change date/time, change type and user. These can be updated through a trigger. This solution is typically better for behind the scenes auditing (IMO) though, rather than to solve an application-specific requirement.
The second option is as you've described. You can have a generic table that holds each individual change with a type code to show which attribute was changed. I personally don't like this solution as it prevents the use of check constraints on the columns and can also prevent foreign key constraints.
The third option (which would be my initial choice with the information given) would be to have a separate historical change table which is updated through the application and includes the PK for each table as well as the column(s) which you would be tracking. It's slightly different from the first option in that the application would be responsible for updating the table as needed. I prefer this over the first option in your case because you really have a business requirement that you're trying to solve, not a back-end technical requirement like auditing. By putting the logic in the application you have a bit more flexibility. Maybe some changes you don't want to track because they're maintenance updates, etc.
With the third option you can either have the "current" data in the base table or you can have each column that is kept historically in the historical table only. You would then need to look at the latest row to get the current state for the object. I prefer that because it avoids the problem of duplicate data in your database or having to look at multiple tables for the same data.
So, you might have:
Problem_Ticket (ticket_id, ticket_name)
Problem_Ticket_History (ticket_id, change_datetime, description, comment, username)
Alternatively, you could use:
Problem_Ticket (ticket_id, ticket_name)
Problem_Ticket_Comments (ticket_id, change_datetime, comment, username)
Problem_Ticket_Statuses (ticket_id, change_datetime, status_id, username)

I'm not sure about the "issue tracking" specific approach, but I wouldn't say there is one ultimate way to do this. There are a number of options to accomplish it, each have their benefits and negatives as illustrated here.
I personally would just create one table that has some meta data columns about the change and a column that stores xml of the serialized version of the old object or whatever you care about. That way if you wanted to show the history of the object you just get all the old versions and the re-hydrate them and your done. One table to rule them all.
One often overlooked solution would be to use Change Data Capture. This might give you more space savings/performance if you really are concerned.
Good luck.

Here is the solution I would recommend to attain your objective.
Design your auditing model as shown below.
---------------- 1 * ------------
| AuditEventType |----------| AuditEvent |
---------------- ------------
| 1 | 1
| |
----------------- -------------
| 0,1 | +
------------------- ----------------
| AuditEventComment | | AuditDataTable |
------------------- ----------------
| 1
|
|
| +
----------------- + 1 --------------
| AuditDataColumn |------------------| AuditDataRow |
----------------- --------------
.
AuditEventType
Contains list of all possible events type in system and generic description for same.
.
AuditEvent
Contains information about particular even that triggerd this action.
.
AuditEventComment
Contains optional custom user comment about the audit event. Comments can be really huge so better store them in CLOB.
.
AuditDataTable
Contains list of one or more tables that were impacted by respective AuditEvent
.
AuditDataRow
Contains list of one or more identifying rows in respective AuditDataTable that was were impacted by respective AuditEvent
.
AuditDataColumn
Contains list of zero or more columns of respective AuditDataRow whose values were changed with it's previous and current values.
.
AuditBuilder
Implement AuditBuilder (Builder pattern). Instantiate it at begining of event and make it available in request context or pass it along with other DTO's. Each time anywhere in your code you make changes to your data, invoke appropriate call on AuditBuilder to notify it about the change. At the end, invoke build() on AuditBuilder to form above structure and then persist it to database.
Make sure all your activity for the event is in a single DB transaction along with persistence of audit data.

It depends on your exact requirements, and this might not be for you, but for general auditing in the database with triggers (so front-end and even the SP interface layer don't matter), we use AutoAudit, and it works very well.

I don't understand the actual usage scenarios for the audited data, though... do you need to just keep track of the changes? Will you need to "rollback" some of the changes? How frequent (and flexible) you want the audit log report/lookup to be?
Personally I'd investigate something like that:
Create AuditTable. This has an ID, a version number, a user id and a clob field.
When Object #768795 is created, add a row in AuditTable, with values:
Id=#768795
Version:0
User: (Id of the user who created the new object)
clob: an xml representation of the whole object. (if space is a problem, and access to this table is not frequent, you could use a blob and "zip" the xml representation on the fly).
Every time you change something, create a new version, and store the whole object "serialized" as an XML
In case you need to create an audit log you have all you need, and can use simple "text compare" tools or libraries to see what changed in time (a bit like Wikipedia does).
If you want to track only a subset of fields either because the rest is immutable, non significant or you are desperate for speed/space, just serialize the subset you care about.

I know this question is very old but Another possiblity which is built into sql is
TRACK CHANGES:
you can find more information on this link:
Introduction to Change Data Capture (CDC) in SQL Server 2008
http://www.simple-talk.com/sql/learn-sql-server/introduction-to-change-data-capture-(cdc)-in-sql-server-2008/

I think Observer is an ideal pattern in this scenario.

Related

Normalization of SQL tables

I am creating some tables for a project and just realized that many of the tables have the same structure (Id, Name), but are used for different things. How far should I go with normalization? Should I build them all into one table or keep them apart for better understanding? How does it affect performence?
Example 1:
TableObjectType (used for types of objects in the log)
Id Name
1 User
2 MobileDevice
3 SIMcard
TableAction (used for types of actions in a log)
Id Name
1 Create
2 Edit
3 Delete
TableStatus (used for a status a device can have)
Id Name
1 Stock
2 Lost
3 Repair
4 Locked
Example 2:
TableConstants
Id Name
1 User
2 MobileDevice
3 SIMcard
4 Create
5 Edit
6 Delete
7 Stock
8 Lost
9 Repair
10 Locked
Ignore the naming, as my tables have other names, but I am using these for clarification.
The downside for using one table for all constants is that if I want to add more later on, they dont really come in "groups", but on the other hand in SQL I should never rely on a specific order when I use the data.
Just because a table has a similar structure to another doesn't mean it stores the data describing identical entities.
There are some obvious reasons not to go with example 2.
Firstly, you may want to limit the values in your ObjectTypeID column to values that are valid object types. The obvious way to do this is to create a foreign key relationship to the ObjectType table. Creating a similar check on TableConstants would be much harder (in most database engines, you can't use the foreign key restraint in this way).
Secondly, it makes the database self describing - someone who is inspecting the schema will understand that "object type" is a meaningful concept in your business domain. This is important for long-lived applications, or applications with large development teams.
Thirdly, you often get specific business logic with those references - for instance, "status" often requires some logic to say "you can't modify a record in status LOCKED". This business logic often requires storing additional data attributes - that's not really possible with a "Constants" table.
Fourthly - "constants" have to be managed. If you have a large schema, very quickly people start to re-use constants to reflect slightly different concepts. Your "create" constant might get applied to a table storing business requests as well as your log events. This becomes almost unintelligible - and if the business decides log events don't refer to "create" but "write", your business transactions all start to look wrong.
What you could do is to use an ENUM (many database engines support this) to model attributes that don't have much logic beyond storing a name. This removes risks 1, 2 and 4, but does mean your logic is encoded in the database schema - adding a new object type is a schema change, not a data insertion.
I think that generally it is better to keep tables apart (it helps documentation too). In some particular cases (your is the choice...) you could "merge" all similar tables into one (of course adding other columns, as TAB_TYPE to distinct them): this could give you some advantage in developing apps and reducing the overall number of tables (it this is a problem for you).
If they are all relatively small table (with not many records), you should have not performance problems.

Best practices for referencing natural and/or surrogate key values in code

I'm modifying some stored procedures that manage status changes when records are updated.
For example, if I have these two tables
Request(RequestID, StatusID)
Status(StatusID, StatusName)
I'm trying to determine the best to handle calling out the statuses in code.
Do I use StatusID or StatusName?
It's not guaranteed that StatusID will match between environments (DEV, PRE, PROD, etc).
Also, StatusName could be changed and I wouldn't want to have to alter code because I needed to change a StatusName.
I could create a 2nd unique column, which would sort of closely resemble StatusID.
I'd make sure this column was matched between regions, but that doesn't seem that clean either and sort of repetitive.
Can anyone suggest a cleaner, simpler way?
The difficulty of matching code to data can only partially be handled with a second column. When someone adds an item, what does this mean? If they re-use a known constant, what does it mean if you don't require this column to be unique?
Often times we will have user modifiable lookup tables, but they will have to be associated with a number of other flags indicating how to interpret the status - "IsTreatedAsExpired", "IsTreatedAsActive" or perhaps other tables which hold the statuses which are treated as certain things.
I think you really need to figure out the scope of what you want to allow with this table first. Because if you have a LOT of code references, you would be better off using a natural key which is in sync with your code on all installations. A possibility to handle this is to use negative numbers for unmovable codes (identity insert to add new unmovable codes) and then have your sequence only add positive ones. But again, this doesn't address the semantics of how your program would handle or use the user-entered extensions.
Again, it's hard to say without getting the full scope sorted out here.
From the information you've given, StatusID may have different values in different databases, presumably because your keys are generated automatically and are not specified by you. If so then obviously it's impossible to use StatusID consistently in your code anyway (without standardizing the values). Therefore the question becomes "is it acceptable/practical/desirable to hard-code StatusName values in my code?"
The obvious answer is yes, what's the alternative? If you have a certain status that represents 'ready' and you want to reference that in code then you must put something in your code that identifies the status unambiguously.
If you add a second key of some kind (as Carlos suggested) you still have the same basic problem that changing a natural key value is changing the identity of the status and therefore changes the meaning of your code. If you change the 'real' natural key (READY) without changing the second key (RDY) then your code will become more confusing and difficult to maintain.
If you do something more complex like extracting 'constants' or 'configuration parameters' into a configuration file or table or even writing a custom preprocessor to insert key values into your scripts at deployment time, you add lots of complexity for very little gain (unless you have other good reasons for doing it). I've seen this approach used, and it was a huge, unmaintainable mess.
In practice, StatusName is most likely to change because a) someone thinks another name would be 'more accurate' or 'look better', or b) you discover that it doesn't correctly represent your requirements. If you're forced to spend time on a) then just change the display name in your front end or reports and leave the database and code alone. If b) comes up then by definition your current data model and code are inaccurate and must be revised and possibly modified anyway. And when b) does happen, it often results in adding a new code, not changing the existing one (e.g. because someone defined a new process step that there is no existing code for).
And if you are open to changing your development and deployment practices there are other ways to look at this issue too, as others have suggested. Can you make your StatusID values the same everywhere? Technically it's possible, so what are the organizational reasons not to? Can you reduce the probability and impact of StatusName changes through change management and code reviews? Can you improve your requirements process to capture certain information more effectively?
Write a user defined function that accepts status name and gives out the status if wherever you are referring the status id
select * from resources where statusid = dbo.getStatusId("COMPLETED");
This would make sure that resolving the status id always happens within the function that you have defined
As a rule of thumb when you have id,value tables (Status, Result, Area, etc..) I usually add a third field that its the record's mnemonic value and always use that, neither the name or the id.
Now the mnemonic value is like a business key (well, it is a business key) in the sense that its a business value and does not depend on the database (for the id) or the way it displayed (the description) so for example for your status table you may have
StatusID,StatusName,StatusMnemo
1 ,COMPLETED ,COM
2 ,REJETED ,REJ
and so forth.
And in your queries you always join by statusId but you add a clause to join against the status table by StatusMnemo. This is a value that's independent across environments and remains constant.
Also in inserts, you always use statusid.
If you have statusID values that need special treatment then they should be the same across environments.
Why would you introduce a statusID that needs special treatment in Prod that has not gone thru Pre and Dev?
What I often do is start iden at 100 and use that for generic status that don't need special treatment.
Then DEV owns the space under 100 for special treatment using IDENTITY INSERT ON.
If deploy from DEV to PRE insert any records under 100.

Db design for data update approval

I'm working on a project where we need to have data entered or updated by some users go through a pending status before being added into 'live data'.
Whilst preparing the data the user can save incomplete records. Whilst the data is in the pending status we don't want the data to affect rules imposed on users editing the live data e.g. a user working on the live data should not run up against a unique contraint when entering the same data that is already in the pending status.
I envisage that sets of data updates will be grouped into a 'data submission' and the data will be re-validated and corrected/rejected/approved when someone quality control the submission.
I've thought about two scenarios with regards to storing the data:
1) Keeping the pending status data in the same table as the live data, but adding a flag to indicate its status. I could see issues here with having to remove contraints or make required fields nullable to support the 'incomplete' status data. Then there is the issue with how to handle updating existing data, you would have to add a new row for an update and link it back to existing 'live' row. This seems a bit messy to me.
2) Add new tables that mirror the live tables and store the data in there until it has been approved. This would allow me to keep full control over the existing live tables while the 'pending' tables can be abused with whatever the user feels he wants to put in there. The downside of this is that I will end up with a lot of extra tables/SPs in the db. Another issue I was thinking about was how might a user link between two records, whereby the record linked to might be a record in the live table or one in the pending table, but I suppose in this situation you could always take a copy of the linked record and treat it as an update?
Neither solutions seem perfect, but the second one seems like the better option to me - is there a third solution?
Your option 2 very much sounds like the best idea. If you want to use referential integrity and all the nice things you get with a DBMS you can't have the pending data in the same table. But there is no need for there to be unstructured data- pending data is still structured and presumably you want the db to play its part in enforcing rules even on this data. Even if you didn't, pending data fits well into a standard table structure.
A separate set of tables sounds the right answer. You can bring the primary key of the row being changed into the pending table so you know what item is being edited, or what item is being linked to.
I don't know your situation exactly so this might not be appropriate, but an idea would be to have a separate table for storing the batch of edits that are being made, because then you can quality control a batch, or submit a batch to live. Each pending table could have a batch key so you know what batch it is part of. You'll have to find a way to control multiple pending edits to the same rows (if you want to) but that doesn't seem too tricky a problem to solve.
I'm not sure if this fits but it might be worth looking into 'Master Data Management' tools such as SQL Server's Master Data Services.
'Unit of work' is a good name for 'data submission'.
You could serialize it to a different place, like (non-relational) document-oriented database, and only save to relational DB on approval.
Depends on how many of live data constraints still need to apply to the unapproved data.
I think second option is better. To manage this, you can use View which will contain both tables and you can work with this structure through view.
Another good approach is to use XML column in a separate table to store necessary data(because of unknown quantity/names of columns). You can create just one table with XML column ad column "Type" do determine which table this document is related with.
First scenerio seems to be good.
Add Status column in the table.There is no need to remove Nullable constraint just add one function to check the required fields based on flag like If flag is 1(incomplete) Null is allowed otherwise Not allowed.
regarding second doubt do you want to append the data or update the whole data.

How do you clone and compare tables in NHibernate?

I have an application where I want to take a snapshot of all entities, creating cloned tables that represent a particular point in time. I then want to be able to compare the differences between these snapshots to see how the data evolves over time.
How would you accomplish this in NHibernate? It seems like NH isn't design for this type of data manipulation and I'm unsure if I'm abusing my database, NH, or both.
(P.S. Due to database engine restrictions I am unable to use views or stored procs.)
Do you really need to save the entirety of each entity in this snapshot? If so, maybe a collection of tables with names like type_snapshot would help. You could save your entities to this table (only inserting, never updating). You could store the original item's identifier, and generate a new identifier for the snapshot itself. And you could save the timestamp with each snapshot. Your item_snapshot table would look something like:
id | snapshot_date | item_id | item_prop1 | item_prop2 ...
123 | 7/16/10 | 15 | "item desc" | "item name" ...
Within your domain, maybe you could work with Snapshot instances (snapshot containing the id and the snapshot date, along with an instance of T)
It may not be ideal, as it'll introduce a second set of mappings, but it is a way to get where you're going. It seems like you might be better off doing something closer to the database engine, but without knowing what you have in mind for these snapshots (from application perspective) its its hard to say.
I wound up augmenting my entities with a snapshot id column and copying the entries in place in the table. Combined with a filter, I can select from any given snapshot. Had to make some patches to legacy code, but it basically works.
We wound up, creating duplicate tables with an extra column of type timestamp for snapshots. Made indexes on main table smaller, as we had 10Million + rows, so adding versions in same table would create many more records. Also version tables in different tablespace ( db file on mssql)

Designing archive in database. Some patterns maybe?

We are currently doing an web application which one of the functionality is to create Events by the user. Those events can be later on deleted by the user or administrator. However the client requires that the event is not really physically deleted from database, but just marked as deleted. User should only see non-deleted events, but administrators should be able to browse also through deleted ones. That's all really the functionality there is.
Now I suggested that we should simply add one more extra column called "status", which would have couple of valid values: ACTIVE and DELETED. This way we can distinguish between normal(active) and deleted events and create really simple queries (SELECT * FROM EVENTS WHERE STATUS = 'ACTIVE').
My colleague however disagreed. He pointed out that regardless of the fact that right now active events and deleted events share same information (thus they can be stored in the same table) in a future requirements my change and client for example will need to store some additional information about deleted Event (like date of deletion, who deleted it, why he did it - sort of comment). He said that to fulfil those requirements in a future we would have to add additional columns in EVENTS table that would hold data specific for the deleted Events and not for active events. He proposed a solution, where additional table is created (like DELETED_EVENTS) with same schema as EVENTS table. Every deleted event would be physical deleted from EVENTS table and be moved to DELETED_EVENTS table.
I strongly disagreed with his idea. Not only would it make SQL query more complex and less efficient but also this totally is against YAGNI. I also disagreed with him that my idea would made us to create additional (not nullable) columns in EVENTS table, if the requirements changed in a future. In my scenario I would simply create new table like DELETED_EVENTS_DATA (that would hold those additional, archive data) and would add reference key in the EVENTS table to maintain one to one relationship between EVETNS and DELETED_EVENTS_DATA tables.
Nevertheless I was struggled by the fact that two developers who commonly share similar view on software and database design could have so radically different opinions about how this requirements should be designed in a database level. I thought that we maybe both going in a wrong direction and there is another (third) solution? Or are there more then just one alternative?
How do you design this sort of requirements? Are there any patterns or guidelines on how to do it properly? Any help will be deeply appreciated
Don't use a status column.
At minimum you should have a datedeleted and a deletedby columns. Just knowing something was removed isn't helpful, even if the client isn't asking for it right now the very first time they go to look at the deleted events they will want to know who in order to discern why.
If the events table is likely to grow pretty large in size it is common to move the deleted / archived data into a different table entirely. Usually you will allocate those tables to a different database file. That file usually lives on a different drive in order to keep performance up. I'm not saying a whole new database, just a different database file.
If you keep it in the same table, all of your queries should have a where clause on (DateDeleted is null). Obviously you don't have that requirement if the information is moved to a different table.. Which is why I recommend that way of doing things.
I found that taking snapshots of an object with every event (creation, update, etc.) and storing those snapshots (along with dates and user info) in another table allows you to meet all kinds of historical tracking needs in the lifetime of an application. You can then present the snapshots to the user, present chronological changes to the user, deduce the state of an object on a given date, etc..
I'm sure there are official design patterns out there - this is just one that I've refined over time and it works well. It's not efficient with disk space however.
EDIT: Also, when user deleted an object, I would flag the record as deleted and take a final snapshot for the history table. You could hide the object from the interface indefinitely or you could choose to show it - depends on usage needs.
OK the way we handle it is as follows.
We have an extra column on every table called 'Deleted' this is a bit field. Then as you rightly have said your queries are quite simple as its just a where clause to filter them out or leave them in. Only thing you need to make sure is that any reporting or stats that you generate filter out the deleted records.
Then for the extra info that you are talking about wanting to capture, just this extra info s would go in a separate 'audit' like table. In our case we have made this extra table quite generic and it can hold this audit info for any table... see below how it works...
Event
EventId EventName ... Deleted
1 Dinner 0
2 Supper 1
3 Lunch 0
4 Lunch 1
Audit
AuditId EntityTypeId EntityId ActionTypeId ActionDateTime ... etc
1 1 (Event) 2 (EventId) 1 (Deleted) 2/1/2010 12:00:00
1 1 (Event) 4 (EventId) 1 (Deleted) 3/1/2010 12:00:00
Now if you have other entities you want to capture (like Location - where Location is a table) as well it would look like this...
Audit
AuditId EntityTypeId EntityId ActionTypeId ActionDateTime ... etc
1 1 (Event) 2 (EventId) 1 (Deleted) 1/1/2010 12:00:00
1 1 (Event) 4 (EventId) 1 (Deleted) 2/1/2010 12:00:00
1 2 (Event) 2 (LocationId) 1 (Deleted) 3/1/2010 12:00:00
1 2 (Event) 8 (LocationId) 1 (Deleted) 4/1/2010 12:00:00
1 2 (Event) 9 (LocationId) 1 (Deleted) 5/1/2010 12:00:00
Then when you want to get out the extra audit data you are talking about its quite simple. The query would look something like this
SELECT *
FROM Event E
INNER JOIN Audit A
ON E.EventId = A.EntityId
WHERE E.Deleted = 1
AND A.EntityTypeId = 1 -- Where 1 stands for events
Also this audit table can capture other events and not just deletes... This is done via using the ActionTypeId column. At the moment it just has 1 (which is delete), but you could have others as well.
Hope this helps
EDIT:
On top of this if we have strong Audit requirements we do the following... None of the above changes but we create a second database called 'xyz_Audit' which captures the pre and post for every action that happens within the database. This second database has the same schema as the first database (without the Audit table) except that every table has 2 extra columns.
The first extra column is a PrePostFlag and the second column is the AuditId. Hence the primary key now goes across 3 columns, 'xyzId', 'PrePostFlag' and 'AuditId'.
By doing this we can give the admins full power to know who did what when, the data that changed and how it changed and to undelete a record we just need to change the deleted flag in the primary database.
Also by having this data in a different database it allows us to have different optimization, storage and management plans to the main transnational database.
I would add the flag field for now, and only bother to plan the rest when you actively know what you will have to do, and the system has also accumulated real-world data and user experiences, so you have some data to base your performance/complexity design decisions on.
It's often a judgement call in situations like this. Not knowing any more than you told me, I would tend to go with your solution though, which is to just have a virtual delete. I believe your application of YAGNI is good. If the user does in the future give requirements for logging stages in the events life, it's likely that at this time you guys will not correctly guess exactly what those requirements will be. This is especially true if the logic for dealing with events in the DB is well encapsulated (easy to change later).
However, if you know this client well, and you know of similar types of historical-type requirements they have had, and the functionality won't be well encapsulated, maybe your colleague is making a good guess. The key here is that whichever one of you is correct, it's not by much. Both sides have merit.
By the way, it will be better to have a boolean (yes/no) IsDeleted column, with an index beginning with that column. That will be quicker, though it perhaps would not make a big enough difference to matter.
Alot of this depends onthe size of the tables and whether you really need additonal information about the deletion.
In most cases, the deleted flag field is all you need. Then you create a view that selects the records where the record has not been deleted. Use the view for all queries for the users instead of directly accessing the tables.
If you have auditing, you already know who marked the record as deleted and when.
If not, you should add those fields to your table.
Periodically, I might remove deleted records to an archive table in order to imporve query performance on the main table. Say move all deleted records that have been deleted more than 6 months. Then have anothe view that combines both the normal table and the archive table for the admins to query on.
This combination of both approaches in conjuncction with using views gets you the best of both worlds, your table stays realtively small for querying, everyone can see just the records they need to see and it is relatively easy to undelete something deleted by accident, archiving old records can happen at a low usage period of the day rather than when the records are marked for deletion.
When a user creates, modifies or deletes an event, create a new transaction object. Store everything about the change in event in the transaction, and add it to a table, with a reference to the event. That way you have an audit log of everything the user has done. This adds minimal complexity but also allows for extension. You could even add an undo feature later on with minimal, if any, change to your data model.
So if the user is viewing the logs, you can retrieve every log without a DELETE transaction associated with it, though administrators would be able to see everything.