Versioning in relational database

Versioning in relational database - sql

I have a problem to introduce a good versioning in my database design.
Let's make a easy example. A little rental service.
You have a table Person (P_ID, Name), table Computer (C_ID, Type) and a table Rent (R_ID, P_ID, C_ID, FromData, ToData).
I want to be able to change say the user name, create a new version and still have the old stuff at hand if I need it.
My goal is to have some kind of system on my websites witch makes it easy to make a versioning of some records in a table.
More Information:
I have business logic that demands that I can release a record for a version. I also have to be able to rollback to the old ones. The reason is that I want exports for diffrente versions of the data.

Before jumping into the solution it might be a good idea to ask what behaviour are you wanting to achieve? Do you need versioning for some auditing purpose, do you need versioning so that users can rollback changes, do you need versioning for some business rule, or is there another reason?
Once you know this the answer should pretty much jump out at you. I.E., If auditing is your purpose you could add database triggers and store the old and new values in a seperate [Audit] table.

You have made a statement (that you want versioning), but not asked a question (exactly what your problem is). Without a question, it's hard to provide an answer.
In general, you could provide versioning by:
Identifying what entity needs to be versioned. In this case it sounds like you may want to be versioning a "deal" or "rental agreement".
Add a PK column, version number column, and "originalID" column to the table at the top of the model for that entity.
To do versioning, copy top level record to a new PK, placing the original PK in the "originalID" column and incrementing the version number column. Copy the related tables, changing the FK in those tables to match the PK of the new record. Then allow the user to modify the records pertaining to the new-PK version of the record.

You could use triggers:
http://weblogs.asp.net/jgalloway/archive/2008/01/27/adding-simple-trigger-based-auditing-to-your-sql-server-database.aspx

You could create an Archive table that you update via stored procedure or trigger that is populated with a copy of all the fields in a data row in the primary table after every update or insert. The archive table would have its own PK and time stamps for when changes were made.

Related

What is the best method of logging data changes and user activity in an SQL database?

I'm starting a new application and was wondering what the best method of logging is. Some tables in the database will need to have every change recorded, and the user that made the change. Other tables may just need to have the last modified time recorded.
In previous applications I've used different methods to do this but want to hear what others have done.
I've tried the following:
Add a "modified" date-time field to the table to record the last time it was edited.
Add a secondary table just for recording changes in a primary table. Each row in the secondary table represents a changed field in the primary table. So one record update in the primary could create several records in the secondary table.
Add a table similar to no.2 but it records edits across three or fours tables, reference the table it relates to in an additional field.
what methods do you use and would recommend?
Also what is the best way to record deleted data? I never like the idea that a user can permanently delete a record from the DB, so usually I have a boolean field 'deleted' which is changed to true when its deleted, and then it'll be filtered out of all queries at model level. Any other suggestions on this?
Last one.. What is the best method for recording user activity? At the moment I have a table which records logins/logouts/password changes etc, and depending what the action is, gives it a code either 1,2, 3 etc.
Hope I haven't crammed too much into this question. thanks.

I know it's a very old question, but I'd wanted to add more detailed answer as this is the first link I got googling about db logging.
There are basically two ways to log data changes:
on application server layer
on database layer.
If you can, just use logging on server side. It is much more clear and flexible.
If you need to log on database layer you can use triggers, as #StanislavL said. But triggers can slow down your database performance and limit you to store change log in the same database.
Also, you can look at the transaction log monitoring.
For example, in PostgreSQL you can use mechanism of logical replication to stream changes in json format from your database to anywhere.
In the separate service you can receive, handle and log changes in any form and in any database (for example just put json you got to Mongo)

You can add triggers to any tracked table to olisten insert/update/delete. In the triggers just check NEW and OLD values and write them in a special table with columns
table_name
entity_id
modification_time
previous_value
new_value
user
It's hard to figure out user who makes changes but possible if you add changed_by column in the table you listen.

Db design for data update approval

I'm working on a project where we need to have data entered or updated by some users go through a pending status before being added into 'live data'.
Whilst preparing the data the user can save incomplete records. Whilst the data is in the pending status we don't want the data to affect rules imposed on users editing the live data e.g. a user working on the live data should not run up against a unique contraint when entering the same data that is already in the pending status.
I envisage that sets of data updates will be grouped into a 'data submission' and the data will be re-validated and corrected/rejected/approved when someone quality control the submission.
I've thought about two scenarios with regards to storing the data:
1) Keeping the pending status data in the same table as the live data, but adding a flag to indicate its status. I could see issues here with having to remove contraints or make required fields nullable to support the 'incomplete' status data. Then there is the issue with how to handle updating existing data, you would have to add a new row for an update and link it back to existing 'live' row. This seems a bit messy to me.
2) Add new tables that mirror the live tables and store the data in there until it has been approved. This would allow me to keep full control over the existing live tables while the 'pending' tables can be abused with whatever the user feels he wants to put in there. The downside of this is that I will end up with a lot of extra tables/SPs in the db. Another issue I was thinking about was how might a user link between two records, whereby the record linked to might be a record in the live table or one in the pending table, but I suppose in this situation you could always take a copy of the linked record and treat it as an update?
Neither solutions seem perfect, but the second one seems like the better option to me - is there a third solution?

Your option 2 very much sounds like the best idea. If you want to use referential integrity and all the nice things you get with a DBMS you can't have the pending data in the same table. But there is no need for there to be unstructured data- pending data is still structured and presumably you want the db to play its part in enforcing rules even on this data. Even if you didn't, pending data fits well into a standard table structure.
A separate set of tables sounds the right answer. You can bring the primary key of the row being changed into the pending table so you know what item is being edited, or what item is being linked to.
I don't know your situation exactly so this might not be appropriate, but an idea would be to have a separate table for storing the batch of edits that are being made, because then you can quality control a batch, or submit a batch to live. Each pending table could have a batch key so you know what batch it is part of. You'll have to find a way to control multiple pending edits to the same rows (if you want to) but that doesn't seem too tricky a problem to solve.
I'm not sure if this fits but it might be worth looking into 'Master Data Management' tools such as SQL Server's Master Data Services.

'Unit of work' is a good name for 'data submission'.
You could serialize it to a different place, like (non-relational) document-oriented database, and only save to relational DB on approval.
Depends on how many of live data constraints still need to apply to the unapproved data.

I think second option is better. To manage this, you can use View which will contain both tables and you can work with this structure through view.
Another good approach is to use XML column in a separate table to store necessary data(because of unknown quantity/names of columns). You can create just one table with XML column ad column "Type" do determine which table this document is related with.

First scenerio seems to be good.
Add Status column in the table.There is no need to remove Nullable constraint just add one function to check the required fields based on flag like If flag is 1(incomplete) Null is allowed otherwise Not allowed.
regarding second doubt do you want to append the data or update the whole data.

Is it sensible to have a table that does not reference any other in a database design?

I'd like to get some advice on database design. Specifically, consider the following (hypothetical) scenario:
Employees - table holding all employee details
Users - table holding employees that have username and password to access software
UserLog - table to track when users login and logout and calculate
time on software
In this scenario, if an employee leaves the company I also want to make sure I delete them from the Users table so that they can no longer access the software. I can achieve this using ON DELETE CASCADE as part of the FK relationship between EmployeeID in Employees and Users.
However, I don't want to delete their details from the UserLog as I am interested in collating data on how long people spend on the software and the fact that they no longer work at the company does not mean their user behaviour is no longer relevant.
What I am left with is a table UserLog that has no relationships with any other tables in my database. Is this a sensible idea?
Having looked through books etc / googled online I haven't come across any DB schemas with tables that have no relationships with others and so my gut instinct here is saying that my approach is not robust...
I'd appreciate some guidance please.

My personal preference in this case would be to "soft delete" an employee by adding a "DeletedDate" column to the Employees table. This will allow you to maintain referential integrity with your UserLog table and all details for all employees, past and present, remain available in the database.
The downside to this approach is that you need to add application logic to check for active employees.

Yes, this is perfectly sensible. The log is just a raw audit of data that should never change. It doesn't need to be normalized (and shouldn't be) and/or linked to other tables.
Ideally, I would put write-heavy audit logging in a different database entirely than the read-heavy transactional day-to-day stuff. They may grow differently over time. But starting small it's fine to keep them in the same database as long as you understand the fundamental differences between them.
On a side note, I would recommend not deleting the users from the tables. Maybe have some kind of IsActive or IsDeleted bit on them that would effectively blind them from the application, but deleting should be avoided if possible.

The problem you have here is that it's perfectly possible to insert UserLog data for users that have never existed as there's no link to the table that defines valid users.
I would say that perhaps the better course of action would be to mark the users as invalid and remove all their personal details when they leave rather than delete the record entirely.
That's not to say there aren't situations where it is valid to have a table (or tables) on the database that don't reference others.

Is this a sensible idea
The problem is this. Since the data isn't linked you can delete something from the employee table and still have references in the UserLog. After the employee infomration is deleted, you have no way of knowing what Log data ties back to. Is this ok? Technically yes. There is nothing preventing you from doing it, but then why are you keeping the data in the first place? You also have no guarantee that the data in the table actually is about an employee. Someone could accidently enter a wrong EmployeeID in the table that doesn't belong to anyone. Keys help prevent data corruption. It's always better to have extra data than it is to have bad data.
What I've found is that you never want to delete data when possible. Space is cheap, and you can add flags etc. to show the record isn't active. Yes, this does cause more work (this can be quickly remedied by creating a view which only shows active employees), and saying that you should never delete data is far fetched, but you start linking data together. Deleting becomes very difficult. If you are not adding a FK just so you can delete records, it's a tell tale sign you need to rethink your strategy.
Relying on Cascade Delete can be very dangerous too. The model you are stating is that anytime you don't want data deleted you have to know not to add a FK to that table which links it back to users. It doesn't take long for someone to forget this.

What you can do is use logical deletion or disabling a user by adding a bool value Deleted or Disabled to the Users table.
Or replace the EmployeeId with the name of the employee in the UserLog.

An alternative to using the soft delete process, is to store all the historical details you would want about the user at the time the log record is created rather than store the employee id. So you might have username, logintime, logouttime, sessionlength in your table.

Sensible? Sure, as in it makes sense as you've described your need to keep those users indefinitely. The problem you'll run into is maintaining the tables. Instead of doing a cascading update once, you'll have to use at least two updates in order to insert a new user.

I think a table as you are suggesting is perfectly fine. I frequently encounter log tables that are do not have explicit relationships with other tables. Just because a database is "relational" doesn't mean everything has to relate haha.
One thing that I do notice though is that you are using EmployeeID in the log, but not using it as a foreign key to your Employee table. I understand why you don't want that, since you will be dropping employees. But, if you are dropping them completely, then the EmployeeID column is meaningless.
A solution to this would be to keep a flag for employees, such as active, that tracks if they are active or not. That way, the log data is meaningful.

IANADBA but it's generally considered very bad practice indeed to delete almost anything from a DB ever,It would be far better here to have some kind of locked flag / "deleted" datestamp on your users table and preserve your FK.

SQL - Table Design - DateCreated and DateUpdated columns

For my application there are several entity classes, User, Customer, Post, and so on
I'm about to design the database and I want to store the date when the entities were created and updated. This is where it gets tricky. Sure one option is to add created_timestamp and update_timestamp columns for each of the entity tables but that isn't that redudant?
Another possibility could be to create a log table that stores this information, and it could be made to contain keep track of updates for any entity.
Any thoughts? I'm leaning on implementing the latter.

The single-log-table-for-all-tables approach has two main problems that I can think of:
The design of the log table will (probably) constrain the design of all the other tables. Most likely the log table would have one column named TableName and then another column named PKValue (which would store the primary key value for the record you're logging). If some of your tables have compound primary keys (i.e. more than one column), then the design of your log table would have to account for this (probably by having columns like PKValue1, PKValue2 etc.).
If this is a web application of some sort, then the user identity that would be available from a trigger would be the application's account, instead of the ID of the web app user (which is most likely what you really want to store in your CreatedBy field). This would only help you distinguish between records created by your web app code and records created otherwise.
CreatedDate and ModifiedDate columns aren't redundant just because they're defined in each table. I would stick with that approach and put insert and update triggers on each table to populate those columns. If I also needed to record the end-user who made the change, I would skip the triggers and populate the timestamp and user fields from my application code.

I do the latter, with a "log" or "events" table. In my experience, the "updated" timestamp becomes frustrating pretty quick, because a lot of the time you find yourself in a fix where you want not just the very latest update time.

How often will you need to include the created/updated timestamps in your presentation layer? If the answer is anything more than "once in a great great while", I think you would be better served by having those columns in each table.

On a project I worked on a couple of years ago, we implemented triggers which updated what we called an audit table (it stored basic information about the changes being made, one audit table per table). This included modified date (and last modified).
They were only applied to key tables (not joins or reference data tables).
This removed a lot of the normal frustration of having to account for LastCreated & LastModified fields, but introduced the annoyance of keeping the triggers up to date.
In the end the trigger/audit table design worked well and all we had to remember was to remove and reapply the triggers before ETL(!).

It's for a web based CMS I work on. The creation and last updated dates will be displayed on most pages and there will be lists for the last created (and updated) pages. The admin interface will also use this information.

updating primary key of master and child tables for large tables

I have a fairly huge database with a master table with a single column GUID (custom GUID like algorithm) as primary key and 8 child tables that have foreign key relationships with this GUID column. All the tables have approximately 3-8 million records. None of these tables have any BLOB/CLOB/TEXT or any other fancy data types just normal numbers, varchars, dates, and timestamps (about 15-45 columns in each table). No partitions or other indexes other than the primary and foreign keys.
Now, the custom GUID algorithm has changed and though there are no collisions I would like to migrate all the old data to use GUIDs generated using the new algorithm. No other columns need to be changed. Number one priority is data integrity and performance is secondary.
Some of the possible solutions that I could think of were (as you will probably notice they all revolve around one idea only)
add new column ngu_id and populate with new gu_id; disable constraints; update child tables with ngu_id as gu_id; renaname ngu_id->gu_id; re-enable constraints
read one master record and its dependent child records from child tables; insert into the same table with new gu_id; remove all records with old gu_ids
drop constraints; add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids; re-enable constraints
add a trigger to the master table such that all the child tables are updated; start updating old gu_id's with new new gu_ids
create new column ngu_ids on all master and child tables; create foreign key constraints on ngu_id columns; add update trigger to the master table to cascade values to child tables; insert new gu_id values into ngu_id column; remove old foreign key constraints based on gu_id; remove gu_id column and rename ngu_id to gu_id; recreate constraints if necessary;
use on update cascade if available?
My questions are:
Is there a better way? (Can't burrow my head in the sand, gotta do this)
What is the most suitable way to do this? (I've to do this in Oracle, SQL server and mysql4 so, vendor-specific hacks are welcome)
What are the typical points of failure for such an exercise and how to minimize them?
If you are with me so far, thank you and hope you can help :)

Your ideas should work. the first is probably the way I would use. Some cautions and things to think about when doing this:
Do not do this unless you have a current backup.
I would leave both values in the main table. That way if you ever have to figure out from some old paperwork which record you need to access, you can do it.
Take the database down for maintenance while you do this and put it in single user mode. The very last thing you need while doing something like this is a user attempting to make changes while you are in midstream. Of course, the first action once in single user mode is the above-mentioned backup. You probably should schedule the downtime for some time when the usage is lightest.
Test on dev first! This should also give you an idea as to how long you will need to close production for. Also, you can try several methods to see which is the fastest.
Be sure to communicate in advance to users that the database will be going down at the scheduled time for maintenance and when they can expect to have it be available again. Make sure the timing is ok. It really makes people mad when they plan to stay late to run the quarterly reports and the database is not available and they didn't know it.
There are a fairly large number of records, you might want to run the updates of the child tables in batches (one reason not to use cascading updates). This can be faster than trying to update 5 million records with one update. However, don't try to update one record at a time or you will still be here next year doing this task.
Drop indexes on the GUID field in all the tables and recreate after you are done. This should improve the performance of the change.

Create a new table with the old and the new pk values in it. Place unique constraints on both columns to ensure you haven't broken anything so far.
Disable constraints.
Run an updates against all the tables to modify the old value to the new value.
Enable the PK, then enable the FK's.

It's difficult to say what the "best" or "most suitable" approach is as you have not described what you are looking for in a solution. For example, do the tables need to be available for query while you are migrating to new IDs? Do they need to be available for concurrent modification? Is it important to complete the migration as fast as possible? Is it important to minimize the space used for migration?
Having said that, I would prefer #1 over your other ideas, assuming they all met your requirements.
Anything that involves a trigger to update the child tables seems error-prone and over complicated and likely will not perform as well as #1.
Is it safe to assume that new IDs will never collide with old IDs? If not, solutions based on updating the IDs one at a time will have to worry about collisions -- this will get messy in a hurry.
Have you considered using CREATE TABLE AS SELECT (CTAS) to populate new tables with the new IDs? You'll be making a copy of your existing tables and this will require additional space, however it is likely to be faster than updating the existing tables in place. The idea is: (i) use CTAS to create new tables with new IDs in place of the old, (ii) create indexes and constraints as appropriate on the new tables, (iii) drop the old tables, (iv) rename the new tables to the old names.

In fact, it depend on your RDBMS.
Using Oracle, the simpliest choice is to make all of the foreign key constraints "deferred" (check on commit), perform updates in a single transaction, then commit.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas