Database history for client usage - vb.net

I'm trying to figure out what would be the best way to have a history on a database, to track any Insert/Delete/Update that is done. The history data will need to be coded into the front-end since it will be used by the users. Creating "history tables" (a copy of each table used to store history) is not a good way to do this, since the data is spread across multiple tables.
At this point in time, my best idea is to create a few History tables, which the tables would reflect the output I want to show to the users. Whenever a change is made to specific tables, I would update this history table with the data as well.
I'm trying to figure out what the best way to go about would be. Any suggestions will be appreciated.
I am using Oracle + VB.NET

I have used very successfully a model where every table has an audit copy - the same table with a few additional fields (time stamp, user id, operation type), and 3 triggers on the first table for insert/update/delete.
I think this is a very good way of handling this, because tables and triggers can be generated from a model and there is little overhead from a management perspective.
The application can use the tables to show an audit history to the user (read-only).

We've got that requirement in our systems. We added two tables, one header, one detail called AuditRow and AuditField. The AuditRow contains one row per row changed in any other table, and the AuditField contains one row per column changed with old value and new value.
We have a trigger on every table that writes a header row (AuditRow) and the needed detail rows (one per changed colum) on each insert/update/delete. This system does rely on the fact that we have a guid on every table that can uniquely represent the row. Doesn't have to be the "business" or "primary" key, but it's a unique identifier for that row so we can identify it in the audit tables. Works like a champ. Overkill? Perhaps, but we've never had a problem with auditors. :-)
And yes, the Audit tables are by far the largest tables in the system.

If you are lucky enough to be on Oracle 11g, you could also use the Flashback Data Archive

Personally, I would stay away from triggers. They can be a nightmare when it comes to debugging and not necessarily the best if you are looking to scale out.
If you are using an PL/SQL API to do the INSERT/UPDATE/DELETEs you could manage this in a simple shift in design without the need (up front) for history tables.
All you need are 2 extra columns, DATE_FROM and DATE_THRU. When a record is INSERTed, the DATE_THRU is left NULL. If that record is UPDATEd or DELETEd, just "end date" the record by making DATE_THRU the current date/time (SYSDATE). Showing the history is as simple as selecting from the table, the one record where DATE_THRU is NULL will be your current or active record.
Now if you expect a high volume of changes, writing off the old record to a history table would be preferable, but I still wouldn't manage it with triggers, I'd do it with the API.
Hope that helps.

Related

Is it practical to record table creation/modificaion events in a separate table?

I'm about to begin designing a database for an MVC web app and am considering using a table called 'changes' or 'events' to keep track of all changes made to the database. It would have a field for the table name, record id, user making the change, the timestamp and whether it was a creation or modification event.
My question is whether this is a poor design practice compared to having 'created by', 'created on', 'modified by', 'modified on' fields in each table. I was thinking that in the parent Model class, I would use a before-save function recorded every change. I can see a pitfall might be if many records were updated at once, it might be difficult to get the function to save the changes properly.
This is a matter of weighing up the benefit of having the granular info against the overhead of writing to this table every time something in the database changes and the additional storage required.
If you are concerned about using a function on the model class then an alternative would be database triggers. This would be quite robust, but would be more work to set up as you would need to define triggers for each table unless the database you are using has a facility to log DML changes generically.
Finally, I would also advise considering archiving as this table has the potential to get very big very quickly.
I think your approach is fine. One advantage of having these events in a separate table is you can capture multiple edits. If you just have ModifiedDate/ModifiedBy columns you only see the last edit.
The main thing to be aware of is table size since audit tables can get VERY big. You may also decide to split into multiple audit tables (e.g. use the same table name with an _audit suffix) to improve query performance.
It depends on what you need. The reason for creating and maintaining an events table is to maintain an audit trail of changes.
For some applications, the created / updated fields at the end of a row are a sufficient audit trail.
For more secure applications, you need an events table. You also need to include the actual change (before / after) in your events table.
Also consider a time-related table, where each record has a start and end date, along with a created by. Any change actually sets the end date for the previous record and creates a new record, with a NULL end date.
Your current record is the one with a NULL end date.
Basically, every record has a life-span.

SQL Server Auditing Data in the Same Table

A project I'm working on requires that a record be digitally "signed" and after that any modifications would create a new "version" of the row. The "signed" record can't be modified for regulatory reasons and new versions shouldn't be modified very often. In the past, done so by creating a separate logging table with the same schema as the main table with some extra columns for tracking who modified it and when.
However, after doing some work with SharePoint where ALL data (including different versions) is put into the same table I thought of a different approach which I can't find any examples of people doing: I could put the new version of the row right in the same table and increment the version number. Then add the version number to the PK.
PROS:
Implementation is easy, just create an "Instead of update" trigger which performs an insert instead of an update of the row is "signed". I could easily add a IsCurrentVersion column to be updated in the trigger.
Querying for older versions is easy, just get all the records with
the ID I want let the user choose from the list.
A trigger is nice because it guarantees that a row CAN'T be updated if signed (for regulatory and audit purposes).
Schema changes to the table don't have to be replicated to the mirror "logging" table.
CONS:
The table could get a bit larger but most of the time the record won't be changed after "signing" it. The client estimated around 100,000 rows/year max at current usage levels. SQL Server can handle hundreds of millions of rows so this doesn't seem too bad.
Indexing and performance could be an issue. SharePoint adds a tp_CalculatedVersion int to the PK where the calculated number is always 0 for the latest version. I could do the same and calculate it based off the Version number. Would that help performance?
There is an extra step in querying the data to make sure you get the latest version but that could be handled in a SP.
What other cons are there in this scenario. Am I missing anything??
I've seen this pattern used in an enterprise system before,and IMO it wasn't successful.
You are Mixing two different concerns here, viz storage of live and audit data. Queries to this table will always need to keep in mind whether they are seeking leaf or audit data (e.g. reports) - new team members may find this non intuitive. You would likely need to encapsulate this complexity with views etc.
As you mentioned performance will always be a concern. Inserting a new record will also need to update the previous record to mark it as inactive.You may also need to consider changing your clustered index to keep all versions on the same page.
Foreign keys to this table are going to be problematic. Do you
reference an exact version record? Do you then fix up the foreign
keys to point to the new live leaf record?
The one benefit I can think of doing this is that the audit table DDL will always be in synch with the live table - often with the 2 table strategy changes are made to the live, and the audit and trigger DDL isn't updated accordingly.
Overall, I would still recommend keeping your audit table separate.
If the requirement is that the signed data not be changed, then you should move it to another table. In fact, I might suggest moving it to another database/schema, where the only operation allowed on the table is inserting and reading records. You can use both permissions and triggers, if you really want to prevent updates.
You don't want to mess around with regulatory requirements. A complex schema that uses a combination of primary key with version, along with triggers, is a sign that there might be a simpler way.
The historical records can affect performance of the current records. If you end up in a situation where every record has changed 100 times, then keeping them in the same table is just going to slow down queries. Of course, you can embark on more complexity, in the form of partitioning the data. In the end, the solution is simpler: keep the data that cannot be changed in another table where it cannot be changed. You don't want to have to upgrade the hardware just because lots of history has accumulated.
I would also suggest including an effective and end date in the history records. This will allow you to reconstruct all the data as of a particular date, something that users might find useful in the future.
That's right. Audit trails can stay in an application for internal reporting/audit but infosec best practice mandates getting audit logs off the system where they are generated into your log management / SIEM solution.

Finding changed records in a database table

I have a problem that I haven't been able to come up with a solution for yet. I have a database (actually thousands of them at customer sites) that I want to extract data from periodically. I'd like to do a full data extract one time (select * from table) then after that only get rows that have changed.
The challenge is that there aren't any updated date columns in most of the tables that could be used to constrain the SQL query. I can't use a trigger based approach nor change the application that writes to the database since it's another group that develops the app and they are way backed up already.
I may be able to write to the database tables when doing the data extract, but would prefer not to do that. Does anyone have any ideas for how we might be able to do this?
You will have to programatically mark the records. I see suggestions of an auto-incrementing field but that will only get newly inserted records. How will you track updated or deleted records?
If you only want newly inserted that an autoincrementing field will do the job; in subsequent data dumps grab every thing since the last value of the autoincrment field and then recrod the current value.
If you want updates the minimum I can see is to have a last_update field and probably a trigger to populare it. If the last_update is later the the last data dump grab that record. This will get inserts and updates but not deletes.
You could try something like a 'instead of delete' trigger if your RDBMS supports it and NULL the last_update field. On subsequent data dumps grap all recoirds where this field is NULL and then delete them. But there would be problems with this (e.g. how to stop the app seeing them between the logical and physical delete)
The most fool proof method I can see is aset of history (audit) tables and ech change gets written to them. Then you select your data dump from there.
By the way do you only care about know the updates have happened? What about if 2 (or more) updates have happened. The history table is the only way that I can see you capturing this scenario.
This should isolate rows that have changed since your last backup. Assuming DestinationTable is a copy of SourceTable even on the key fields; if not you could list out the important fields.
SELECT * FROM SourceTable
EXCEPT
SELECT * FROM DestinationTable

Best practice for auditing data in SQL Server and retrieving point in time data

I've been doing history tables for some time now in databases, but never put too much effort or thought into it. I wonder what is the best practice out there.
My main goal is to record any changes to a record for a particular day. If more than one change happens in a day then then only one history record will exist. I need to record the date the record was changed, also when I retrieve data I need to pull the correct record from history as it was at a particular time. So for example I have a customers table and want to pull out what their address was for a particular date. My Sprocs like get Cust details will take in an optional date and if no date is passed in then it returns the most recent record.
So here's what I was looking for advice on:
Do I keep the history table in the same table and use a logical delete flag to hide the historical ones? I normally don't do this as some tables can change a lot and have lots of records. Do I use a separate table that mirrors the main table? I usually do this. Should I only put change records into the history table and not the current one? What is the most efficient way given a date to pull out the right record at a point in time, get every record for a customer <= date passed in, and then sort by most recent date and take the top?
Thanks for all the help... regards M
Suggestion is to use trigger based auditing and create triggers for all tables you need to audit.
With triggers you can accomplish the requirement for not storing more than one record update per day.
I’d suggest you check out ApexSQL Audit that generates triggers for you and try to reverse engineer what triggers they use, how storage tables look like and such.
This will give you a good start and you can work form there.
Disclaimer: not affiliated with ApexSQL but I do use their tools on a daily basis.
I'm no expert in the field but a good sql consultant once told me that a good aproach is generally to use the same table if all data can be changed. Otherwise have the original table contain only core nonchangable data and the historical table contain only stuff that can be changed.
You should defintely read this article on managing bitemporal data. The nice thing about this approach is it enables an auditable way of correcting historical data.
I beleive this will address your concerns about modidying the history data
I've always used a modified version of the audit table described in this article. While it does require you to pivot data so that it resembles your table's native structure, it is resilient against changes to the schema.
You can create a UDF that returns a table and accepts a table name (varchar) and point in time (datetime) as parameters. The UDF should rebuild the table using the audit (historical values) and give you the effective values at that date & time.

How to keep track of which rows have been imported in SQL?

Let's say I want to import all the customers (or all the rows in some other specific table) to some external system. Not all at once but every one after they have been created in database. To do that I have to keep record of all the rows that have already been reported because I want to find only the ones that have not been reported yet. Is it generally better to add a column to do that or to create some kind of a batchlog table?
I'm using MS SQL Server if that is relevant
A Simplified example:
select * from Customer where reportedToExternalSystem is null
or
select * from Customer where cus_id not in (select cus_id from integrationBatchLog)
or is there maybe some more ways to do that that might be even better? This is the first time I do something like this so I don't know the best practise yet.
The simple solution is to add a column that marks the row as imported. A status int (0/1) or if you want to keep track of when it was imported an imported date. This solution does have some limitations:
You can only import the row once. Do you need to import the customer again when the record is updated? Are you going to clear the update field when the customer is updated?
It causes the row to be locked when you update the row status. Are you sure the application that inserts the customer record will be happy with your code locking the records?
On some system it causes the entire row to be written to the log system for recovery. Depending on the size of the row this can be a lot of log writing for just one field.
In a highly parallel import system you can have a lot of contention for resources. If one import program is locking the table, think how bad it would be if many import programs are locking the table at the same time.
If the customer data is updated several times between your import polling interval, you will only see the latest data and will skip over the intermediate updates. This is only an issue if you care about the intermedaite updates. For customers you might not care, for order statuses you might care a lot.
You have to modify the table structure. This might not be allowed by the source application due to data/support/political issues.
Besides putting a status column in the table, one technique that works well is to put a trigger on the table and mirror the import data to a second table. You would then 'consume' the data in the second table. This has several advantages:
It keeps the locking issues contained to the second table.
It allows you to process every update to the main table.
You can add an index to the second table that is used to keep track of the update statuses without the issues of changing the main table.
If you delete the rows from the second table (either immediately as they are consumed or after a short audit period) the size of the table/index will be kep to a minimum.
When I use this technique in Sql Server I put the second table in a seperate schema. Since most apps store their tables in dbo, you can end up with dbo.Customers and Import.Customers. This can help you to keep track of which tables you are importing and keeps you from having to come up with new names for your import tables.
Unless you have to complicate implementation, go with the simplest solution possible. One important thing you should consider, is how hard would it be to refactor this simple to more general one, in case if you need it.
In your case I see only one problem in upgrading from column to table. If you would need history of imports. Solution: make reportedToExternalSystem column of DateTime (or Timestamp) type
I would use a separate table indicating, say, import date cross-referenced to the key of the record in the table you're tracking. In other words, a table with 3 columns: auto-increment key, record-id-from-other-table, import-date. Something like that. This also allows the case if a record is ever re-imported later. You'd have track of all the imports by date.
I Prefer having a column for importing status. Maintaining a separate log leads to time consumable results with growing table size. I do have conceptual idea on SQL Servers but seems that it works. Keep posting!