Entity Revisioning Schema Design

Entity Revisioning Schema Design - sql-server-2005

SQL Server 2005.
In our application, we have an entity with a parent table as well as several child tables. We would like to track revisions made to this entity. After going back and forth, we've narrowed it down to two approaches to choose from.
Have one history table for the entity. Before a sproc updates the table, retrieve the entire current state of the entity from the parent table and all child tables. XMLize it and stick it into the history table as the XML data type. Include some columns to query by, as well as a revision number/created date.
For each table, create a matching history table with the same columns. Also have a revision number/created date. Before a sproc updates a single table, retrieve the existing state of the record for that one table, and copy it into the history table. So, it's a little bit like SVN. If I want to get an entity at revision Y, I need to get the history record in each table with the maximum revision number that is not greater than Y. An entity might have 50 revision records in one table, but only 3 revision records in a child table, etc. I would probably want to persist the revision counter for the entire entity somewhere.
Both approaches seem to have their headaches, but I still prefer solution #2 to solution #1. This is a database that's already huge, and already suffers from performance issues. Bloating it with XML blobs on every revision (and there will be plenty) seems like a horrible way to go. Creating history tables for everything is a cost I'm willing to eat, as long as there's not a better way to do this.
Any suggestions?
Thanks,
Tedderz

Number 2 is almost certainly the way to go, and I do something like this with my history tables, though I use an "events" table as well to correlate the changes with one another instead of using a timestamp. I guess this is what you mean by a "revision counter". My "events" table contains a unique ID, a timestamp (of course), the application user responsible for the change, and an "action" designator which represents the application-level action that the user made which caused the change to happen.
Why #2? Because you can more easily partition the table to archive or roll-off old entries. Because it's easier to index. Because it's a WHOLE lot easier to query. Because it has less overhead than XML and is a lot smaller.
Also, consider using triggers instead of coding a stored procedure to do all of this. Triggers are almost always to be avoided, but for things like this, they're a fairly lightweight and robust way to perform this kind of thing.

Related

SQL schema pattern for keeping history of changes

Consider a database that maintains a list of persons and their contact information, including addresses and such.
Sometimes, the contact information changes. Instead of simply updating the single person record to the new values, I like to keep a history of the changes.
I like to keep the history in a way that when I look at a person's record, I can quickly determine that there are older recordings of that person's data as well. However, I also like to avoid having to build very complicated SQL queries for retrieving only the latest version of each person's records (while this may be easy with a single table, it quickly gets difficult once the table is connected to other tables).
I've come up with a few ways, which I'll add below as answers, but I wonder if there are better ways (While I'm a seasoned code writer, I'm rather new to DB design, so I lack the experience and already ran into a few dead ends).
Which DB? I am currently using sqlite but plan to move to a server based DB engine eventually, probably Postgres. However, I meant this question asked in a more general form, not specific to any particular engine, though suggestions how to solve this in certain engines are appreciated, too, in the general interest.

This is generally referred to as Slowly Changing Dimension and linked Wikipedia page offers several approaches to make this thing work.
Martin Fowler has a list of Temporal Patterns that are not exactly DB-specific, but offer a good starting point.
And finally, Microsoft SQL Server offers Change Data Capture and Change Tracking.

Must you keep structured history information?
Quite often, the history of changes does not have to be structured, because the history is needed for auditing purposes only, and there is no actual need to be able to perform queries against the historical data.
So, what quite often suffices is to simply log each modification that is made to the database, for which you only need a log table with a date-time field and some variable length text field into which you can format human-readable messages as to who changed what, and what the old value was, and what the new value is.
Nothing needs to be added to the actual data tables, and no additional complexity needs to be added to the queries.
If you must keep structured history information:
If you need to able to execute queries against historical data, then you must keep the historical data in the database. Some people recommend separate historical tables; I consider this misguided. Instead, I recommend using views.
Rename each table from "NAME" to "NAME_HISTORY" and then create a view called "NAME" which presents to you only the latest records.
Views are a feature which exists in most RDBMSes. A view looks like a table, so you can query it as if it was a table, but it is read-only, and it can be created by simply defining a query on existing tables (and views.)
So, with a query which orders the rows by history-date, groups by all fields except history-date, selects all fields except history-date, and picks only the first row, you can create a view that looks exactly like the original table before historicity was added.
Any existing code which just performs queries and does not need to be aware of history will continue working as before.
Code that performs queries against historical data, and code that modifies tables, will now need to start using "NAME_HISTORY" instead of "NAME".
It is okay if code which modifies the table is burdened by having to refer to the table as "NAME_HISTORY" instead of "NAME", because that code will also have to take into account the fact that it is not just updating the table, it is appending new historical records to it.
As a matter of fact, since views are read-only, the use of views will prevent you from accidentally modifying a table without taking care of historicity, and that's a good thing.

We use what we call Verity-Block pattern.
The verity contains the periodicity, the block contains immutable data.
In the case of personal data we have the Identity verity that has a validity period, and the IdentificationBlock that contains the data such as Name, LastName, BirthDate
Block are immutable, so whenever we change something the application makes sure to create a new block.
So in case your last name changes on 01/01/2015 from Smits to Johnson then we have a verity Identity valid from [mindate] to 31/12/2014 that is linked to an IdentificationBlock where Lastname = Smits and an Identity that is valid from 01/01/2014 to [maxdate] linked to an IdentificationBlock where LastName = Johnson.
So in the database we have tables:
Identification
ID_Identification [PK]
Identity
ID_Identity [PK]
ID_Identification [FK]
ID_IdentificationBlock [FK]
ValidFrom
ValidTo
IdentificationBlock
ID_IdentificationBlock [PK]
ID_Identification [FK]
FirstName
LastName
BirthDate
A typical query to get the current name would be
Select idb.Name, idb.LastName from IdentificationBlock idb
join Identity i on idb.ID_Identification = i.ID_Identification
where getDate() between i.ValidFrom and i.ValidTo

Add an "active" flag or add a "version" number.
Using a flag requires adding a condition such as active=1 to every query's WHERE clause involving the table.
Using a version number requires adding a subquery such as:
version = (SELECT MAX(version) FROM MyTable t2 WHERE MyTable.id = t2.id)
Pros:
Keeps the database design simple.
Detection of history entries is easy - just remove the extra condition from the queries.
Cons:
Updating data requires setting the active or version values accordingly. (Though this might be handled with SQL triggers, I guess.)
Complicates queries. While this may not affect the performance, it's getting more difficult to write and maintain such queries by hand the more complex the queries get, especially when involving joined queries.
Foreign keys into this table cannot use the rowid to refer to a person because updates to the person create a new entry in the table, thereby effectively changing the rowid of the latest data for the person.
Maintainig a FTS (Full Text Search) table in sqlite only for the most recent versions of data is slightly more difficult due to the triggers for automatic updates to the FTS need to take the active or version values into account in order to make sure that only the latest data is stored, while outdated data gets removed.

Move older versions into a separate "history" table.
By using SQL triggers the old data is automatically written to the "history" table.
Pros:
Queries that ask for only the latest data remain simple.
By using triggers, updating data doesn't need to be concerned with maintaining the history.
Maintainig a FTS (Full Text Search) table in sqlite only for the most recent versions of data is easy because the triggers would be attached only to the "current" (non-history) table, thereby avoiding storing of obsolete data.
Cons:
Detection of history entries requires parsing a separate table (that's not a big issue, though). This may also be alleviated by adding a backlink column as a foreign key to the history table.
Every table that shall maintain a history needs a duplicate table for the history. Makes writing the schema tedious unless program code is written to create such "history" tables dynamically.

We use a history integer column. New rows are always inserted with a history of 0, and any previous rows for that entry have the history incremented by 1.
Depending on how often the historical data is to be used, it might be wise to store history rows in a separate table. A simple view could be used if the combined data is desired, and it should speed things up if you usually just need the current rows.

How Can I Maintain a Unique Identifier Amongst Multiple Database Tables?

I have been tasked with creating history tables for an Oracle 11g database. I have proposed something very much like the record based solution in the first answer of this post What is the best way to keep changes history to database fields?
Then my boss suggested that due to the fact that some tables are clustered i.e Some data from table 1 is related to table 2 (think of this as the format the tables were in before they were normalised), he would like there to be a version number which is maintained between all the tables at this cluster level. The suggested way to generate the version number is by using a SYS_GUID http://docs.oracle.com/cd/B12037_01/server.101/b10759/functions153.htm.
I thought about doing this with triggers so when one of this tables is updated, the other tables version numbers are subsequently updated, but I can see some issues with this such as the following:
How can I stop the trigger from one table, in turn firing the trigger for the other table?(We would end up calling triggers forever here)
How can I stop the race conditions? (i.e When table 1 and 2 are updated at the same time, how do I know which is the latest version number?)
I am pretty new to Oracle database development so some suggestions about whether or not this is a good idea/if there is a better way of doing this would be great.

I think the thing you're looking for is sequence: http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_6015.htm#SQLRF01314
The tables could take the numbers from defined sequence independently, so no race conditions or triggers on your side should occur

Short answer to your first question is "No, you cannot.". The reason for this is that there's no way that users can stop a stated trigger. The only method I can imagine is some store of locking table, for example you create a intermediate table, and select the same row for update among your clustered tables. But this is really a bad way, as you've already mentioned in your second question. It will cause dreadful concurrency issue.
For your second question, you are very right. Different triggers for different original tables to update the same audit table will cause serious contention. It's wise to bear in mind the way triggers work that is they are committed when the rest of transaction commit. So if all related tables will update the same audit table, especially for the same row, simultaneously will render the rational paradigm unused. One benefit of the normalization is performance gain, as when you update different table will not content each other. But in this case if you want synchronize different table's operations in audit table. It will finally work like a flat file. So my suggestion would be trying your best to persuade your boss to use your original proposal.
But if your application always updates these clustered table in a transaction and write one audit information to audit table. You may write a stored procedure to update the entities first and write an audit at end of the transaction. Then you can use sequence to generate the id of audit table. It won't be any contention.

Dynamically creating tables as a means of partitioning: OK or bad practice?

Is it reasonable for an application to create database tables dynamically as a means of partitioning?
For example, say I have a large table "widgets" with a "userID" column identifying the owner of each row. If this table tended to grow extremely large, would it make sense to instead have the application create a new table called "widgets_{username}" for each new user? Assume that the application will only ever have to query for widgets belonging to a single user at a time (i.e. no need to try and join any of these user widget tables together).
Doing this would break up the one large table into more easily-managed chunks, but this doesn't seem like an elegant solution. In my mind, the database schema should be defined when the application is written, and any runtime data is stored as rows, not as additional tables.
As a more general question, is modifying the database schema at runtime ever ok?
Edit: This question is mostly hypothetical; I had a pretty good feeling that creating tables at runtime didn't make sense. That being said, we do have a table with millions of rows in our application. SELECTs perform fine, but things like deleting all rows owned by a particular user can take a while. Basically I'm looking for some solid reasoning why just dynamically creating a table for each user doesn't make sense for when I'm asked.

NO, NO, NO!! Now repeat after me, I will not do this because it will create many headaches and problems in the future! Databases are made to handle large amounts of information. they use indexes to quickly find what you are after. think phone book how effective is the index? would it be better to have a different book for each last name?
This will not give you anything performance wise. Keep a single table, but be sure to index on UserID and you'll be able to get the data fast. however if you split the table up, it becomes impossible/really really hard to get any info that spans multiple users, like search all users for a certain widget, count of all widgets of a certain type, etc. you need to have every query be built dynamically.
If deleting rows is slow, look into that. How many rows at one time are we talking about 10, 1000, 100000? What is your clustered index on this table? Could you use a "soft delete", where you have a status column that you UPDATE to "D" to mark the row as deleted. Can you delete the rows at a later time, with less database activity. is the delete slow because it is being blocked by other activity. look into those before you break up the table.

No, that would be a bad idea. However some DBMSs (e.g. Oracle) allow a single table to be partitioned on values of a column, which would achieve the objective without creating new tables at run time. Having said that, it is not "the norm" to partition tables like this: it is only usually done in very large databases.

Using an index on userID should result nearly in the same performance.
In my opinion, changing the database schema at runtime is bad practice.
Consider, for example, security issues...

Is it reasonable for an application to create database tables
dynamically as a means of partitioning?
No. (smile)

When to Create, When to Modify a Table?

I wanted to know, what should i consider while deciding if i should create a new table or modify an existing table for a sql db. i use both mysql and sqlite.
-Edit- I always thought if i can put a column into a table where it makes sense and can be used by every row then i would always modify it. However at work if its a different 'release' we put it in a different table.

You can modify existing tables, as long as
you are keeping the database Normalized
you are not breaking code that uses the table
You can create new tables even if 1. and 2. are true for the following reasons:
Performance reasons
Clarity in your schema logic.

Not sure if I'm understanding your question correctly, but one thing I always try to consider is the impact on existing data.
Taking the case of an application which relies on a database...
When you update the application (including database schema updates), it is important to ensure that any existing, in-use databases will be either backwards compatible with the application, or there is way to migrate and update the existing database.

Generally if the data is in a one-to-one relationship with the existing data in the table and if the table row size is not too large already and if there aren't too many records in the table, then I usually alter the table to accept the new column.
However, suppose I want to add a column with a default value to a table where it doesn't exist. Adding it to the table with 50 million records might not be so speedy a process and it might lock up the table on production when we move the change up. In this case, putting it into a separate table and adding the records to it may work out better. In general, I wouldn't do this unless my testing has shown that adding and populating the column will take an unacceptably long time. I would prefer to keep the record together where possible.
Same thing with the overall record size. SQL server has a byte limit to the number of bytes that can be in a record, it will allow you to create a structure that is potentially larger than that, but it will not alow you to put more than the byte limit into a specific record. Further, less wide tables tend to be faster to access due to how they are stored. Frequently, people will create a table that has a one-to-one relationship (we call them extended tables in our structure) for additional columns that are not as frequnetly used. If the fields from both tables will be frequently used, often they still create two tables but have a view that will pickout all the columns needed.
And of course if the data is in a one to many relationship, you need a related table not just a new column.
Incidentally, you should always do an alter table through a script and the SSMS GUI as it is more efficient and easier to move to prod.

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks

You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle

Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.

We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.

Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?

I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.

The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.

Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)

Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows

We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.

In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.

Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.

We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.

The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.

No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.

From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.

Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.

This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.

Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas