How to manage multiple versions of the same record - sql

I am doing short-term contract work for a company that is trying to implement a check-in/check-out type of workflow for their database records.
Here's how it should work...
A user creates a new entity within the application. There are about 20 related tables that will be populated in addition to the main entity table.
Once the entity is created the user will mark it as the master.
Another user can make changes to the master only by "checking out" the entity. Multiple users can checkout the entity at the same time.
Once the user has made all the necessary changes to the entity, they put it in a "needs approval" status.
After an authorized user reviews the entity, they can promote it to master which will put the original record in a tombstoned status.
The way they are currently accomplishing the "check out" is by duplicating the entity records in all the tables. The primary keys include EntityID + EntityDate, so they duplicate the entity records in all related tables with the same EntityID and an updated EntityDate and give it a status of "checked out". When the record is put into the next state (needs approval), the duplication occurs again. Eventually it will be promoted to master at which time the final record is marked as master and the original master is marked as dead.
This design seems hideous to me, but I understand why they've done it. When someone looks up an entity from within the application, they need to see all current versions of that entity. This was a very straightforward way for making that happen. But the fact that they are representing the same entity multiple times within the same table(s) doesn't sit well with me, nor does the fact that they are duplicating EVERY piece of data rather than only storing deltas.
I would be interested in hearing your reaction to the design, whether positive or negative.
I would also be grateful for any resoures you can point me to that might be useful for seeing how someone else has implemented such a mechanism.
Thanks!
Darvis

I've worked on a system like this which supported the static data for trading at a very large bank. The static data in this case is things like the details of counterparties, standard settlement instructions, currencies (not FX rates) etc. Every entity in the database was versioned, and changing an entity involved creating a new version, changing that version and getting the version approved. They did not however let multiple people create versions at the same time.
This lead to a horribly complex database, with every join having to take version and approval state into account. In fact the software I wrote for them was middleware that abstracted this complex, versioned data into something that end-user applications could actually use.
The only thing that could have made it any worse was to store deltas instead of complete versioned objects. So the point of this answer is - don't try to implement deltas!

This looks like an example of a temporal database schema -- Often, in cases like that, there is a distinction made between an entity's key (EntityID, in your case) and the row primary key in the database (in your case, {EntityID, date}, but often a simple integer). You have to accept that the same entity is represented multiple times in the database, at different points in its history. Every database row still has a unique ID; it's just that your database is tracking versions, rather than entities.
You can manage data like that, and it can be very good at tracking changes to data, and providing accountability, if that is required, but it makes all of your queries quite a bit more complex.
You can read about the rationale behind, and design of temporal databases on Wikipedia

You are describing a homebrew Content Management System which was probably hacked together over time, is - for the reasons you state - redundant and inefficient, and given the nature of such systems in firms is unlikely to be displaced without massive organizational effort.

Related

Create trigger upon each table creation in SQL Server 2008 R2

I need to create an Audit table that is going to track the actions (insert, update, delete) of my tables in the database and add new row with date, row id, table name and a few more details, so I will know what action happened and when.
So basically from my understanding I need a trigger for each table which is going to track insert/update/delete and a trigger on the database which is going to track new table creation.
My main problem is understanding how to connect between those things so when a new table is being created a trigger will be created for that table which is going to track the actions and add new rows for the Audit table as needed.
Is it possible to make a DDL trigger for create_table and inside of it another trigger for insert / update / delete ?
What you're hoping for is not possible. And I'd strongly advise that you'd be better off thinking about what you really want to achieve at a business level with auditing. It will yield a much simpler and more practical solution.
First up
...trigger on the database which is going to track new table creation.
I cannot stress enough how terrible this idea is. Who exactly has such unfettered access to you database that they can create tables without going through code-review and QA? Which should of course be on the gated pathway towards production. Once you realise that schema changes should not happen ad-hoc, it's patently obvious that you don't need triggers (which are by their very nature reactive) to do something because the schema changed.
Even if you could write such triggers: it's at a meta-programming level that simply isn't worth the effort of trying to foresee all possible permutations.
Better options include:
Requirements assessment and acceptance: This is new information in the system. What are the audit requirements?
Design review: New table; does it need auditing?
Test design: How to test an audit requirements?
Code Review: You've added a new table. Does it need auditing?
Not to mention features provided by tools such as:
Source Control.
Db deployment utilities (whether home-grown or third party).
Part two
... a trigger will be created for that table which is going to track the actions and add new rows for the Audit table as needed.
I've already pointed out why doing the above automatically is a terrible. Now I'm going a step further to point out that doing the above at all is also a bad idea.
It's a popular approach, and I'm sure to get some flack from people who've nicely compartmentalised their particular flavour of it; swearing blind how much time it "saves" them. (There may even be claims to it being a "business requirement"; which I can assure you is more likely a misstated version of the real requirement.)
There are fundamental problems with this approach:
It's reactive instead of proactive. So it usually lacks context.
You'll struggle to audit attempted changes that get rolled back. (Which can be a nightmare for debugging and usually violates real business audit requirements.)
Interpreting audit will be a nightmare because it's just raw data. The information is lost in the detail.
As columns are added/renamed/deleted your audit data loses cohesion. (This is usually the least of problems though.)
These extra tables that always get updated as part of other updates can wreak havoc on performance.
Usually this style of auditing involves: every time a column is added to the "base" table, it's also added to the "audit" table. (This ultimately makes the "audit" table very much like a poorly architected persistent transaction log.)
Most people following this approach overlook the significance of NULLable columns in the "base" tables.
I can tell you from first hand experience, interpreting such audit trails in any but the simplest of cases is not easy. The amount of time wasted is ridiculous: investigating issues, training others to be able to interpret them correctly, writing utilities to try make working with these audit trails less painful, painstakingly documenting findings (because the information is not immediately apparent in the raw data).
If you have any sense of self-preservation you'll heed my advice.
Make it great
(Sorry, couldn't resist.)
A better approach is to proactively plan for what needs auditing. Push for specific business requirements. Note that different cases may need different auditing techniques:
If user performs action X, record A details about the action for legal traceability.
If user attempts to do Y but it prevented by system rules, record B details to track rule system integrity.
If user fails to log in, record C details for security purposes.
If system is upgraded, record D details for troubleshooting.
If certain system events occur, record E details ...
The important thing is that once you know the real business requirements, you won't be saying: "Uh, let's just track everything. It might be useful." Instead you'll:
Be able to produce a cleaner more appropriate and reliable design for each distinct kind of auditing.
Be able to test that it behaves as required!
Be able to use the audit data more easily whenever it's needed.

Ever ok to remove referential integrity on database design for 'right to be forgotten' deletion of user records?

I'm currently reviewing a database design that in order to deal with the removal of user records, in order to deal with requirements such as DPA and EU GDPR Right to be Forgotten, is proposing not to enforce referential integrity between the user record and 'related' tables, such as Transaction, Communication Event, etc., so that the user record can be deleted when requested but records in related tables (that use a non-identifying key/sequence number) will remain intact.
So, before I push back on this and open up the 'discussion' that will follow, I wondered whether anyone thought it was ever acceptable to remove, or do without, referential integrity in cases such as this, or should other methods be used - such as masking the user details, or changing the user record to a placeholder record to show that the transaction relates to a redacted user.
All thoughts welcome...
This is a complicated topic that goes beyond referential integrity constraints.
My understanding of the EU privacy restrictions (and I stress that I am not a lawyer) is that they relate to personally identifiable information, not to business related "anonymous" relationships. For instance, I think you can still count a removed user as "active" for the period when they were active; you just can't know who they are.
My approach would be to put all PII data into a single table/database. When a user wishes to be forgotten, I would update the record to remove the PII. All the foreign key relationships are then fine. You are just missing the name, address, email address, and whatever else is deemed PII.
Just identifying the PII is very tricky, because email addresses and user names and so on can be embedded in the most unusual places (URLs are one obvious place to look but there can be others).
I don't recommend actually removing all traces of the person from all databases. You will then be in a situation where your reports no longer balance . . . Oh, our reports said we had 1,000,000 customers then, but we can only find 999,900 of them. Let's waste a bunch of people's efforts to figure out what happened.
My suggestion: Be careful. This is a long process and set expectations in your organization accordingly.
Please have a look at retention laws for your industry. People have the right to be forgotten, but businesses also have a legal obligation to retain certain records for a period of time.
At this point it's unclear to me which regulation overrules another so my advice is you bring in a legal expert that will be able to clear up this matter.
From a technical perspective, your application might require business data related to private data, so a good approach is to flag the records as forgotten and replace private data with generated data. This way, your application keeps behaving the same way, but the private information is gone.
This is a simple approach that can be applied on many legacy applications, even automated as a process.
The only thing you must watch out for are the backups taken as your changes might be reverted if data has to be restored from a backup. Keep a separate table with keys pointing to records require to be forgotten so if a backup is overwriting latest changes, you can use your automation script again to remove those wha want to be forgotten.

Table with multiple foreign keys -- only one not null

I'm trying to design a system where an administrator will have to approve changes to the data and other various administrative tasks -- add a user, add an admin etc.
My idea is to have a notification table that contains these notifications, but the problem is that a notification can be any of the previously mentioned types, ie it's data is stored in one of many tables. Here is a picture to describe my current plan -- note I'm sure that it's not a proper ER diagram.
full_screen
Also, the data goes into a pending table, that reflects the table it will eventually wind up in, provided the data is approved -- it's a staging ground of sorts. So, a pending_user is a user that is not in the user table. And as you can see the user table, amongst others, is not shown here, but one can use their imagination.
I'm concerned that the multiple null values in the pending table will have adverse effects that I'm not totally aware of, such as increased space usage and possibly increase query time. Also, I'm not sure how I'll implement the retrieval of these notifications. My naive approach is to select the first X notifications, analyze the rows to find the non-null column, retrieve the appropriate data and then load all the data in a response.
Is there a more straight forward pattern for this type of problem?
Thanks in advance for any help.
I think, the traditional way is to provide various levels of access/read/write rights to users. These access rights define what actions a user can and can't perform. In this traditional approach if a user has access to a certain function, he can do it without further approval.
Also, traditionally there are some kind of audit logs that contain a trace of all important changes to the data. With such logs it would be possible to know who made a change (and when).
If you need to build a two-stage system, where a change has to go through an approval, I'd add a flag column to each important table that would indicate that values in the given row are not final and have to be approved. The table would store all historical changes to the data and with the help of this flag the system would know which variant is the latest approved version and which variant is pending and waiting for approval.
I would not try to make a single universal table that would hold data related to changes in many different tables. Each table is different and approval process for each table is likely to be different. I doubt that you'll have more than a dozen entities that are important enough to go through this approval process.

How to store complex records for referencing historical revisions?

I have a table on my database that outlines complex processes in a work breakdown structure (similar to what's used to create Gantt charts). There are multiple rows for a particular process, each row outlining a hierarchical step of a particular process.
I then have a table with some product types, each being linked to a particular process. When an order for a particular product is placed - it is to be manufactured with the associated process.
In my situation, the processes can be dynamic (steps added or removed, for example).
I'm curious as to what the best way to capture current and historical revisions of each process is, such that even though a process may have evolved over time - I can historically go back to a particular order and determine what the process looked like at that time.
I'm sure there are multiple ways to go about this, using logging or triggers with a new history table - but I've had no experience doing something like this and I'd like to know what worked well for others.

Audit Logging Strategies

I am trying to decide on the best method for audit logging within my application. The main reason for the log is reporting the sequence of events (changes).
I have a hierarchy of Objects, I need to create reports when something changes on any part of that hierarchy, at a latter date.
I think that I have three options:
Have a log for each table and therefore matching the hierarchy of objects then creating a view for the report.
Flatten the hierarchy and de-normalise the table, making reporting easier - simple select statement.
Have one log table and have a record for each change making reporting harder but more flexible to changes.
I am currently leaning towards option 1.
I have to talk to this subject even though it's old.
It is usually a poor idea to have only one audit table as you will create locking problems in the database as everything hits that table. Use separate audit tables for each table.
It is also a poor idea to have the application do the auditing. Audit must be done at the database level or you risk losing some of the information. Data does not change only from applications in most databases; no one is going to change the prices of all their products one at a time from the user interface when you need a 10% increase to all 10,000,000 of them. Auditing should capture all changes not just some of them. This should be done in a trigger in most databases (SQL server 2008 has a built in auditing function). Some of the worst potential possible changes (employees committing fraud or wanting to maliciously destroy data) also are frequently from places other than the application especially if you allow table level access to users (Which you should not do in any financial database or one that contains personal information). Auditing from the application won't catch this. Developers often forget that in protecting their data, outside sources are not the only threat.
An audit log is basically a chronological list of events that occurred, who performed these events, and what the events were.
I think a flat view would be better as it can be easily ordered and queried. So I'm leaning more towards your option #2/#3.
Include things like the transaction type, the time, the user id, a description of what's changed, and other pertinent information related to your product.
You can also add things to your product over time and you won't need to continually modify your audit log module.
If it's for auditing purposes I'd use a true append-only medium rather than a table/tables in the same db.
You suggest it's for change history purposes - in which case I would restructure your application/db to record the actual events in the first place rather than just the current state.
I would go with (2) and (3): create a single table for all Audit entries.
A flat view is good, provided the extra work flattening does not impact performance.
You could look into an AOP framework to help with this. It would allow you to inject logging functionality at the beginning or end of any/all methods. If you go down this road, it might help define what would make sense for storing the log data.