I have been tasked with creating history tables for an Oracle 11g database. I have proposed something very much like the record based solution in the first answer of this post What is the best way to keep changes history to database fields?
Then my boss suggested that due to the fact that some tables are clustered i.e Some data from table 1 is related to table 2 (think of this as the format the tables were in before they were normalised), he would like there to be a version number which is maintained between all the tables at this cluster level. The suggested way to generate the version number is by using a SYS_GUID http://docs.oracle.com/cd/B12037_01/server.101/b10759/functions153.htm.
I thought about doing this with triggers so when one of this tables is updated, the other tables version numbers are subsequently updated, but I can see some issues with this such as the following:
How can I stop the trigger from one table, in turn firing the trigger for the other table?(We would end up calling triggers forever here)
How can I stop the race conditions? (i.e When table 1 and 2 are updated at the same time, how do I know which is the latest version number?)
I am pretty new to Oracle database development so some suggestions about whether or not this is a good idea/if there is a better way of doing this would be great.
I think the thing you're looking for is sequence: http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_6015.htm#SQLRF01314
The tables could take the numbers from defined sequence independently, so no race conditions or triggers on your side should occur
Short answer to your first question is "No, you cannot.". The reason for this is that there's no way that users can stop a stated trigger. The only method I can imagine is some store of locking table, for example you create a intermediate table, and select the same row for update among your clustered tables. But this is really a bad way, as you've already mentioned in your second question. It will cause dreadful concurrency issue.
For your second question, you are very right. Different triggers for different original tables to update the same audit table will cause serious contention. It's wise to bear in mind the way triggers work that is they are committed when the rest of transaction commit. So if all related tables will update the same audit table, especially for the same row, simultaneously will render the rational paradigm unused. One benefit of the normalization is performance gain, as when you update different table will not content each other. But in this case if you want synchronize different table's operations in audit table. It will finally work like a flat file. So my suggestion would be trying your best to persuade your boss to use your original proposal.
But if your application always updates these clustered table in a transaction and write one audit information to audit table. You may write a stored procedure to update the entities first and write an audit at end of the transaction. Then you can use sequence to generate the id of audit table. It won't be any contention.
Related
Consider a database that maintains a list of persons and their contact information, including addresses and such.
Sometimes, the contact information changes. Instead of simply updating the single person record to the new values, I like to keep a history of the changes.
I like to keep the history in a way that when I look at a person's record, I can quickly determine that there are older recordings of that person's data as well. However, I also like to avoid having to build very complicated SQL queries for retrieving only the latest version of each person's records (while this may be easy with a single table, it quickly gets difficult once the table is connected to other tables).
I've come up with a few ways, which I'll add below as answers, but I wonder if there are better ways (While I'm a seasoned code writer, I'm rather new to DB design, so I lack the experience and already ran into a few dead ends).
Which DB? I am currently using sqlite but plan to move to a server based DB engine eventually, probably Postgres. However, I meant this question asked in a more general form, not specific to any particular engine, though suggestions how to solve this in certain engines are appreciated, too, in the general interest.
This is generally referred to as Slowly Changing Dimension and linked Wikipedia page offers several approaches to make this thing work.
Martin Fowler has a list of Temporal Patterns that are not exactly DB-specific, but offer a good starting point.
And finally, Microsoft SQL Server offers Change Data Capture and Change Tracking.
Must you keep structured history information?
Quite often, the history of changes does not have to be structured, because the history is needed for auditing purposes only, and there is no actual need to be able to perform queries against the historical data.
So, what quite often suffices is to simply log each modification that is made to the database, for which you only need a log table with a date-time field and some variable length text field into which you can format human-readable messages as to who changed what, and what the old value was, and what the new value is.
Nothing needs to be added to the actual data tables, and no additional complexity needs to be added to the queries.
If you must keep structured history information:
If you need to able to execute queries against historical data, then you must keep the historical data in the database. Some people recommend separate historical tables; I consider this misguided. Instead, I recommend using views.
Rename each table from "NAME" to "NAME_HISTORY" and then create a view called "NAME" which presents to you only the latest records.
Views are a feature which exists in most RDBMSes. A view looks like a table, so you can query it as if it was a table, but it is read-only, and it can be created by simply defining a query on existing tables (and views.)
So, with a query which orders the rows by history-date, groups by all fields except history-date, selects all fields except history-date, and picks only the first row, you can create a view that looks exactly like the original table before historicity was added.
Any existing code which just performs queries and does not need to be aware of history will continue working as before.
Code that performs queries against historical data, and code that modifies tables, will now need to start using "NAME_HISTORY" instead of "NAME".
It is okay if code which modifies the table is burdened by having to refer to the table as "NAME_HISTORY" instead of "NAME", because that code will also have to take into account the fact that it is not just updating the table, it is appending new historical records to it.
As a matter of fact, since views are read-only, the use of views will prevent you from accidentally modifying a table without taking care of historicity, and that's a good thing.
We use what we call Verity-Block pattern.
The verity contains the periodicity, the block contains immutable data.
In the case of personal data we have the Identity verity that has a validity period, and the IdentificationBlock that contains the data such as Name, LastName, BirthDate
Block are immutable, so whenever we change something the application makes sure to create a new block.
So in case your last name changes on 01/01/2015 from Smits to Johnson then we have a verity Identity valid from [mindate] to 31/12/2014 that is linked to an IdentificationBlock where Lastname = Smits and an Identity that is valid from 01/01/2014 to [maxdate] linked to an IdentificationBlock where LastName = Johnson.
So in the database we have tables:
Identification
ID_Identification [PK]
Identity
ID_Identity [PK]
ID_Identification [FK]
ID_IdentificationBlock [FK]
ValidFrom
ValidTo
IdentificationBlock
ID_IdentificationBlock [PK]
ID_Identification [FK]
FirstName
LastName
BirthDate
A typical query to get the current name would be
Select idb.Name, idb.LastName from IdentificationBlock idb
join Identity i on idb.ID_Identification = i.ID_Identification
where getDate() between i.ValidFrom and i.ValidTo
Add an "active" flag or add a "version" number.
Using a flag requires adding a condition such as active=1 to every query's WHERE clause involving the table.
Using a version number requires adding a subquery such as:
version = (SELECT MAX(version) FROM MyTable t2 WHERE MyTable.id = t2.id)
Pros:
Keeps the database design simple.
Detection of history entries is easy - just remove the extra condition from the queries.
Cons:
Updating data requires setting the active or version values accordingly. (Though this might be handled with SQL triggers, I guess.)
Complicates queries. While this may not affect the performance, it's getting more difficult to write and maintain such queries by hand the more complex the queries get, especially when involving joined queries.
Foreign keys into this table cannot use the rowid to refer to a person because updates to the person create a new entry in the table, thereby effectively changing the rowid of the latest data for the person.
Maintainig a FTS (Full Text Search) table in sqlite only for the most recent versions of data is slightly more difficult due to the triggers for automatic updates to the FTS need to take the active or version values into account in order to make sure that only the latest data is stored, while outdated data gets removed.
Move older versions into a separate "history" table.
By using SQL triggers the old data is automatically written to the "history" table.
Pros:
Queries that ask for only the latest data remain simple.
By using triggers, updating data doesn't need to be concerned with maintaining the history.
Maintainig a FTS (Full Text Search) table in sqlite only for the most recent versions of data is easy because the triggers would be attached only to the "current" (non-history) table, thereby avoiding storing of obsolete data.
Cons:
Detection of history entries requires parsing a separate table (that's not a big issue, though). This may also be alleviated by adding a backlink column as a foreign key to the history table.
Every table that shall maintain a history needs a duplicate table for the history. Makes writing the schema tedious unless program code is written to create such "history" tables dynamically.
We use a history integer column. New rows are always inserted with a history of 0, and any previous rows for that entry have the history incremented by 1.
Depending on how often the historical data is to be used, it might be wise to store history rows in a separate table. A simple view could be used if the combined data is desired, and it should speed things up if you usually just need the current rows.
Scenario:
Entity 1 can have 0 or more Entity 2.
What trying to do:
When a field in Entity 1 is updated, a field in Entity 2 is consecutively updated.
What I'm doing:
Update field in Entity 1 by update sql, then querying related Entity 2 records (using SELECT ATTR FROM ENTITY2 WHERE ENTITY1.ID = ENTITY2.ENT1_ID) just to get the old value of ENTITY2 attr before doing an update on that records. Type of update (e.g. Subtract or add) on ENTITY2 record is based on the update value on ENTITY1.
Alternative :
Using triggers to consecutively update these related records.
(I still plan to study to implement triggers but I am not sure if it is worth it.
Any help from this also please? Or links?)
Is it better to use triggers? Or just stick to my current solution (which I think is quite slow due to the number of sql executions but easier to track down).
There are people, such as Tom Kyte who believe triggers should be used as little as possible, if at all.
There are others, such as Toon Koppelaars who believe they should be used, if their use is considered carefully.
I am of the second camp and believe triggers may be used. However, this use should not be to 'automagically' cause cascade actions such as you are suggesting. Instead these triggers may be used to enforce integrity constraints that cannot be declared using the standard mechanism of a table constraint clause i.e. the triggers themselves do no DML other than SELECT from tables.
(note: there are other mechanisms by which these constraints may be enforced, including materialized views or the introduction of additional columns and the use of specific indexing strategies) Therefore, I would suggest another alternative. Create triggers - or use these alternative mechanisms - to ensure no data that breaks your integrity constraints can be committed. Then create APIs, using PL/SQL, that encapsulate the multi-table data amendments that are required to ensure the integrity constraints are not broken and use these as your update path.
In this way you can be assured that no invalid data exists in the database, but also that the actual DML required to achieve this is not hidden across the database in multiple program units and triggers but is stated explicitly in one place.
Tom Kyte is brilliant. But he is, at heart, still just a DBA. Always keep that in mind when considering his advice on table design.
Can triggers be overused? Of course. But here's the rub: anything can be overused. I lean toward triggers because there is just no way to guarantee that all data manipulation will go through your app or any single channel. Or, if possible, define a foreign key relationship and let "cascade update" take care of everything. Tricky, I admit, and could be problematic, but don't reject any solution out of hand.
Having said that, I don't know if a trigger for this purpose is called for. I don't know why you are duplicating the data to a field in a different table. Without knowing your overall design and what you are trying to accomplish, there is no way to judge. But consider keeping data in one field in one table then use a view to expose that field as part of a second "table." Change the data where it resides, viola, it is now changed wherever it appears.
Performance hit? Yes. But keeping duplicate data in different places and keeping them synchronized is a data integrity hit. Only you know (or are in a position to find out) which way this balance tilts.
Oh, can views be overused? Of course. But there's always that rub I mentioned; and besides, views are so chronically underused in most databases that overuse would be a long way away.
SQL Server 2005.
In our application, we have an entity with a parent table as well as several child tables. We would like to track revisions made to this entity. After going back and forth, we've narrowed it down to two approaches to choose from.
Have one history table for the entity. Before a sproc updates the table, retrieve the entire current state of the entity from the parent table and all child tables. XMLize it and stick it into the history table as the XML data type. Include some columns to query by, as well as a revision number/created date.
For each table, create a matching history table with the same columns. Also have a revision number/created date. Before a sproc updates a single table, retrieve the existing state of the record for that one table, and copy it into the history table. So, it's a little bit like SVN. If I want to get an entity at revision Y, I need to get the history record in each table with the maximum revision number that is not greater than Y. An entity might have 50 revision records in one table, but only 3 revision records in a child table, etc. I would probably want to persist the revision counter for the entire entity somewhere.
Both approaches seem to have their headaches, but I still prefer solution #2 to solution #1. This is a database that's already huge, and already suffers from performance issues. Bloating it with XML blobs on every revision (and there will be plenty) seems like a horrible way to go. Creating history tables for everything is a cost I'm willing to eat, as long as there's not a better way to do this.
Any suggestions?
Thanks,
Tedderz
Number 2 is almost certainly the way to go, and I do something like this with my history tables, though I use an "events" table as well to correlate the changes with one another instead of using a timestamp. I guess this is what you mean by a "revision counter". My "events" table contains a unique ID, a timestamp (of course), the application user responsible for the change, and an "action" designator which represents the application-level action that the user made which caused the change to happen.
Why #2? Because you can more easily partition the table to archive or roll-off old entries. Because it's easier to index. Because it's a WHOLE lot easier to query. Because it has less overhead than XML and is a lot smaller.
Also, consider using triggers instead of coding a stored procedure to do all of this. Triggers are almost always to be avoided, but for things like this, they're a fairly lightweight and robust way to perform this kind of thing.
I wanted to know, what should i consider while deciding if i should create a new table or modify an existing table for a sql db. i use both mysql and sqlite.
-Edit- I always thought if i can put a column into a table where it makes sense and can be used by every row then i would always modify it. However at work if its a different 'release' we put it in a different table.
You can modify existing tables, as long as
you are keeping the database Normalized
you are not breaking code that uses the table
You can create new tables even if 1. and 2. are true for the following reasons:
Performance reasons
Clarity in your schema logic.
Not sure if I'm understanding your question correctly, but one thing I always try to consider is the impact on existing data.
Taking the case of an application which relies on a database...
When you update the application (including database schema updates), it is important to ensure that any existing, in-use databases will be either backwards compatible with the application, or there is way to migrate and update the existing database.
Generally if the data is in a one-to-one relationship with the existing data in the table and if the table row size is not too large already and if there aren't too many records in the table, then I usually alter the table to accept the new column.
However, suppose I want to add a column with a default value to a table where it doesn't exist. Adding it to the table with 50 million records might not be so speedy a process and it might lock up the table on production when we move the change up. In this case, putting it into a separate table and adding the records to it may work out better. In general, I wouldn't do this unless my testing has shown that adding and populating the column will take an unacceptably long time. I would prefer to keep the record together where possible.
Same thing with the overall record size. SQL server has a byte limit to the number of bytes that can be in a record, it will allow you to create a structure that is potentially larger than that, but it will not alow you to put more than the byte limit into a specific record. Further, less wide tables tend to be faster to access due to how they are stored. Frequently, people will create a table that has a one-to-one relationship (we call them extended tables in our structure) for additional columns that are not as frequnetly used. If the fields from both tables will be frequently used, often they still create two tables but have a view that will pickout all the columns needed.
And of course if the data is in a one to many relationship, you need a related table not just a new column.
Incidentally, you should always do an alter table through a script and the SSMS GUI as it is more efficient and easier to move to prod.
Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.
I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL
You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.
Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).
The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.
if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table
That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?
Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)
You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.
you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.
Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy
Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.
I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...
Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345
#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete