Suggested techniques for storing multiple versions of SQL row data

Suggested techniques for storing multiple versions of SQL row data - sql

I am developing an application that is required to store previous versions of database table rows to maintain a history of changes. I am recording the history in the same table but need the most current data to be accessible by a unique identifier that doesn't change with new versions. I have a few ideas on how this could be done and was just looking for some ideas on the best way of doing this or whether there is any reason not to use one of my ideas:
Create a new row for each row version, with a field to indicate which row was the current row. The drawback of this is that the new version has a different primary key and any references to the old version will not return the current version.
When data is updated, the old row version is duplicated to a new row, and the new version replaces the old row. The current row can be accessed by the same primary key.
Add a second table with only a primary key, add a column to the other table which is foreign key to new table's primary key. Use same method as described in option 1 for storing multiple versions and create a view which finds the current version by using the new table's primary key.

PeopleSoft uses (used?) "effective dated records". It took a little while to get the hang of it, but it served its purpose. The business key is always extended by an EFFDT column (effective date). So if you had a table EMPLOYEE[EMPLOYEE_ID, SALARY] it would become EMPLOYEE[EMPLOYEE_ID, EFFDT, SALARY].
To retrieve the employee's salary:
SELECT e.salary
FROM employee e
WHERE employee_id = :x
AND effdt = (SELECT MAX(effdt)
FROM employee
WHERE employee_id = :x
AND effdt <= SYSDATE)
An interesting application was future dating records: you could give every employee a 10% increase effective Jan 1 next year, and pre-poulate the table a few months beforehand. When SYSDATE crosses Jan 1, the new salary would come into effect. Also, it was good for running historical reports. Instead of using SYSDATE, you plug in a date from the past in order to see the salaries (or exchange rates or whatever) as they would have been reported if run at that time in the past.
In this case, records are never updated or deleted, you just keep adding records with new effective dates. Makes for more verbose queries, but it works and starts becoming (dare I say) normal. There are lots of pages on this, for example: http://peoplesoft.wikidot.com/effective-dates-sequence-status

#3 is probably best, but if you wanted to keep the data in one table, I suppose you could add a datetime column that has a now() value populated for each new row and then you could at least sort by date desc limit 1.
Overall though - multiple versions needs more info on what you want to do effectively as much as programatically...ie need more info on what you want to do.
R

Have you considered using AutoAudit?
AutoAudit is a SQL Server (2005, 2008) Code-Gen utility that creates
Audit Trail Triggers with:
Created, CreatedBy, Modified, ModifiedBy, and RowVersion (incrementing INT) columns to table
Insert event logged to Audit table
Updates old and new values logged to Audit table
Delete logs all final values to the Audit tbale
view to reconstruct deleted rows
UDF to reconstruct Row History
Schema Audit Trigger to track schema changes
Re-code-gens triggers when Alter Table changes the table

For me, history tables are always separate. So, definitely I would go with that, but why create some complex versioning thing where you need to look at the current production record. In reporting, this results in nasty unions that are really unnecessary.
Table has a primary key and who cares what else.
TableHist has these columns: incrementing int/bigint primary key, history written date/time, history written by, record type (I, U, D for insert, update, delete), the PK from Table as an FK on TableHist, the remaining columns all other columns with the same name are in the TableHist table.
If you create this history table structure and populate it via triggers on Table, you will have all versions of every row in the tables you care about and can easily determine the original record, every change, and the deletion records as well. AND if you are reporting, you only need to use your historical tables to get all of the information you'd like.

create table table1 (
Id int identity(1,1) primary key,
[Key] varchar(max),
Data varchar(max)
)
go
create view view1 as
with q as (
select [Key], Data, row_number() over (partition by [Key] order by Id desc) as 'r'
from table1
)
select [Key], Data from q where r=1
go
create trigger trigger1 on view1 instead of update, insert as begin
insert into table1
select [Key], Data
from (select distinct [Key], Data from inserted) a
end
go
insert into view1 values
('key1', 'foo')
,('key1', 'bar')
select * from view1
update view1
set Data='updated'
where [Key]='key1'
select * from view1
select * from table1
drop trigger trigger1
drop table table1
drop view view1
Results:
Key Data
key1 foo
Key Data
key1 updated
Id Key Data
1 key1 bar
2 key1 foo
3 key1 updated
I'm not sure if the disctinct is needed.

Related

How do I update a SQL table with daily records?

I have a SQL database where some of my tables are updated daily. I want to create another table which is updated daily with records of what tables (table name, modified/updated date) were updated. I also do not want this table to get too big, so I want this table to only keep records for the last 31 days. How would I write the code for this?
I have already created a table (tUpdatedTables) but i would like this table to be updated daily & keep these records for 31 days
This is how I created the table
Select *
Into tUpdatedTables
from sys.tables
order by modify_date desc
I have tried inserting an "Update" code to update the table but I get an error
update tUpdatedTables
set [name]
,[object_id]
,[principal_id]
,[schema_id]
,[parent_object_id]
,[type]
,[type_desc]
,[create_date]
,[modify_date]
,[is_ms_shipped]
,[is_published]
,[is_schema_published]
,[lob_data_space_id]
,[filestream_data_space_id]
,[max_column_id_used]
,[lock_on_bulk_load]
,[uses_ansi_nulls]
,[is_replicated]
,[has_replication_filter]
,[is_merge_published]
,[is_sync_tran_subscribed]
,[has_unchecked_assembly_data]
,[text_in_row_limit]
,[large_value_types_out_of_row]
,[is_tracked_by_cdc]
,[lock_escalation]
,[lock_escalation_desc]
,[is_filetable]
,[is_memory_optimized]
,[durability]
,[durability_desc]
,[temporal_type]
,[temporal_type_desc]
,[history_table_id]
,[is_remote_data_archive_enabled]
,[is_external]
--Into tUpdatedTables
from sys.tables
where modify_date >= GETDATE()
order by modify_date desc
Msg 2714, Level 16, State 6, Line 4 There is already an object named
'tUpdatedTables' in the database.

I want to create another table which is updated daily with records of what tables (table name, modified/updated date) were updated.
If this is all you want, I would suggest instead simply doing daily backups. You should be doing that anyway.
Beyond that, what you're looking for is an audit log. Most languages and frameworks have libraries to do this for you. For example, paper_trail.
If you want to do this yourself, follow the basic pattern of paper_trail.
id as an autoincrementing primary key
item_type which would be the table, or perhaps something more abstract
item_id is the primary key of the item
event are you storing a create, an update, or a delete?
bywho identify who made the change
object a json field containing a dump of the data
created_at when this happened (use a default)
Using JSON is key to making this table generic. Rather than trying to store every possible column of every possible table, and having to keep that up to date as the tables change, you store a JSON dump of the row using FOR JSON. This means the audit table doesn't need to change as other tables change. And it will save a lot of disk space as it avoids the audit table having a lot of unused columns.
For example, here's how you'd record creating ID 5 of some_table by user 23. (I might be a bit off as I don't use SQL Server).
insert into audit_log (item_type, item_id, event, bywho, object)
values(
'some_table', 5, 'create', 23, (
select * from some_table where id = 5 for json auto
)
)
Because the audit table doesn't care about the structure of the thing being recorded, you use insert, update, and delete triggers to each table to record their changes in the audit log. Just change the item_type.
As for not getting too big, don't worry about it until it's a problem. Proper indexing means it won't be a problem: a composite index on (item_type, item_id) will make listing the changes to a particular thing fast. Indexing bywho will make searches for changes made by a particular thing fast. You shouldn't be referencing this thing in production. If you are, that probably requires a different design.
Partitioning the table by month could also stave off scaling issues.
And if it does get too big, you can backup the table and use created_at to delete old entries.
delete from audit_log
where created_at < dateadd(day, -31, getdate())

How to Never Retrieve Different Rows in a Changing Table

I have a table of millions of rows that is constantly changing(new rows are inserted, updated and some are deleted). I'd like to query 100 new rows(I haven't queried before) every minute but these rows can't be ones I've queried before. The table has a about 2 dozen columns and a primary key.
Happy to answer any questions or provide clarification.

A simple solution is to have a separate table with just one row to store the last ID you fetched.
Let's say that's your "table of millions of rows":
-- That's your table with million of rows
CREATE TABLE test_table (
id serial unique,
col1 text,
col2 timestamp
);
-- Data sample
INSERT INTO test_table (col1, col2)
SELECT 'test', generate_series
FROM generate_series(now() - interval '1 year', now(), '1 day');
You can create the following table to store an ID:
-- Table to keep last id
CREATE TABLE last_query (
last_quey_id int references test_table (id)
);
-- Initial row
INSERT INTO last_query (last_quey_id) VALUES (1);
Then with the following query, you will always fetch 100 rows never fetched from the original table and maintain a pointer in last_query:
WITH last_id as (
SELECT last_quey_id FROM last_query
), new_rows as (
SELECT *
FROM test_table
WHERE id > (SELECT last_quey_id FROM last_id)
ORDER BY id
LIMIT 100
), update_last_id as (
UPDATE last_query SET last_quey_id = (SELECT MAX(id) FROM new_rows)
)
SELECT * FROM new_rows;
Rows will be fetched by order of new IDs (oldest rows first).

You basically need a unique, sequential value that is assigned to each record in this table. That allows you to search for the next X records where the value of this field is greater than the last one you got from the previous page.
Easiest way would be to have an identity column as your PK, and simply start from the beginning and include a "where id > #last_id" filter on your query. This is a fairly straightforward way to page through data, regardless of underlying updates. However, if you already have millions of rows and you are constantly creating and updating, an ordinary integer identity is eventually going to run out of numbers (a bigint identity column is unlikely to run out of numbers in your great-grandchildren's lifetimes, but not all DBs support anything but a 32-bit identity).
You can do the same thing with a "CreatedDate" datetime column, but as these dates aren't 100% guaranteed to be unique, depending on how this date is set you might have more than one row with the same creation timestamp, and if those records cross a "page boundary", you'll miss any occurring beyond the end of your current page.
Some SQL system's GUID generators are guaranteed to be not only unique but sequential. You'll have to look into whether PostgreSQL's GUIDs work this way; if they're true V4 GUIDs, they'll be totally random except for the version identifier and you're SOL. If you do have access to sequential GUIDs, you can filter just like with an integer identity column, only with many more possible key values.

What is the best way to query deleted records with SQL Server 2016 temporal tables?

I'm looking at SQL Server 2016 temporal tables and can't find any efficient way to query for all historical records that are now deleted.
I prefer not to soft-delete or moving to a 'deleted items table', as I feel with temporal tables it is redundant.
Can this can be achieved with temporal tables in an efficient way?

Temporal tables are intended to give you a point-in-time view of your data, not a state view - it doesn't actually understand state. Nothing is exposed to users to determine how a row arrived in the temporal history table.
If you did not temporarily pause/stop system versioning on your temporal table then you just need to find the delta between the history table and the active table. All remaining rows in the history table that don't have a corresponding row in the active table are deleted rows.
For example, if you have tblCustCalls and it's enabled for temporal with a tblCustCallsHistory, something like SELECT * FROM tblCustCallsHistory WHERE ID NOT IN (SELECT ID FROM tblCustCalls). In this example, ID is the primary key. You can optimize the TSQL if the tables are very large but the base concept doesn't change.

There is a way to detect it via the ValidTo column of your temporal table.
The latest ValidTo for the record will be less than the current date.
Or another way to look at it, an undeleted record will have a ValidTo that equals '9999-12-31 18:59:59.9900000'. I don't trust this value enough to hard code looking for it, so I just look for ValidTo > current date.
Don't forget it's UTC.
I write in the last updated by user id on the record before I delete it so that essentially becomes a snapshot of who deleted it and when.

You could also add a column [Action] containing the action. This resolves in the following process:
- Adding a new row: just add the row with [Action] = 'Inserted'
- Updating an existing row: just update the row with [action] = 'Updated'
- Deleting a row: First update the row with [Action] = 'Deleted' and delete the row
Like this you can find easily the unchanged rows in your basetable (where [action] = 'Inserted') and the deleted rows in your historytable (where [action] = 'Deleted')
Be aware this will create 2 rows in the history table!! (1 update and 1 delete statement)

Select on Row Version

Can I select rows on row version?
I am querying a database table periodically for new rows.
I want to store the last row version and then read all rows from the previously stored row version.
I cannot add anything to the table, the PK is not generated sequentially, and there is no date field.
Is there any other way to get all the rows that are new since the last query?
I am creating a new table that contains all the primary keys of the rows that have been processed and will join on that table to get new rows, but I would like to know if there is a better way.
EDIT
This is the table structure:
Everything except product_id and stock_code are fields describing the product.

You can cast the rowversion to a bigint, then when you read the rows again you cast the column to bigint and compare against your previous stored value. The problem with this approach is the table scan each time you select based on the cast of the rowversion - This could be slow if your source table is large.
I haven't tried a persisted computed column of this, I'd be interested to know if it works well.
Sample code (Tested in SQL Server 2008R2):
DECLARE #TABLE TABLE
(
Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL,
LastChanged ROWVERSION NOT NULL
)
INSERT INTO #TABLE(Data)
VALUES('Hello'), ('World')
SELECT
Id,
Data,
LastChanged,
CAST(LastChanged AS BIGINT)
FROM
#TABLE
DECLARE #Latest BIGINT = (SELECT MAX(CAST(LastChanged AS BIGINT)) FROM #TABLE)
SELECT * FROM #TABLE WHERE CAST(LastChanged AS BIGINT) >= #Latest
EDIT: It seems I've misunderstood, and you don't actually have a ROWVERSION column, you just mentioned row version as a concept. In that case, SQL Server Change Data Capture would be the only thing left I could think of that fits the bill: http://technet.microsoft.com/en-us/library/bb500353(v=sql.105).aspx
Not sure if that fits your needs, as you'd need to be able to store the LSN of "the last time you looked" so you can query the CDC tables properly. It lends itself more to data loads than to typical queries.

Assuming you can create a temporary table, the EXCEPT command seems to be what you need:
Copy your table into a temporary table.
The next time you look, select everything from your table EXCEPT everything from the temporary table, extract the keys you need from this
Make sure your temporary table is up to date again.
Note that your temporary table only needs to contain the keys you need. If this is just one column, you can go for a NOT IN rather than EXCEPT.

Audit Triggers: Use INSERTED or DELETED system tables

The topic of how to audit tables has recently sprung up in our discussions... so I like your opinion on whats the best way to approach this. We have a mix of both the approaches (which is not good) in our database, as each previous DBA did what he/she believed was the right way. So we need to change them to follow any one model.
CREATE TABLE dbo.Sample(
Name VARCHAR(20),
...
...
Created_By VARCHAR(20),
Created_On DATETIME,
Modified_By VARCHAR(20),
Modified_On DATETIME
)
CREATE TABLE dbo.Audit_Sample(
Name VARCHAR(20),
...
...
Created_By VARCHAR(20),
Created_On DATETIME,
Modified_By VARCHAR(20),
Modified_On DATETIME
Audit_Type VARCHAR(1) NOT NULL
Audited_Created_On DATETIME
Audit_Created_By VARCHAR(50)
)
Approach 1: Store, in audit tables, only those records that are replaced/deleted from the main table ( using system table DELETED). So for each UPDATE and DELETE in the main table, the record that is being replaced is INSERTED into the audit table with 'Audit_Type' column as wither 'U' ( for UPDATE ) or 'D' ( for DELETE)
INSERTs are not Audited. For current version of any record you always query the main table. And for history you query audit table.
Pros: Seems intutive, to store the previous versions of records
Cons: If you need to know the history of a particular record, you need to join audit table with main table.
Appraoch 2: Store, in audit table, every record that goes into main table ( using system table INSERTED).
Each record that is INSERTED/UPDATED/DELETED to main table is also stored in audit table. So when you insert a new record it is also inserted into audit table. When updated, the new version (from INSERTED) table is stored in Audit table. When deleted, old version (from DELETED) table is stored in audit table.
Pros: If you need to know the history of a particular record, you have everything in one location.
Though I did not list all of them here, each approach has its pros and cons?

I'd go with :
Appraoch 2: Store, in audit table, every record that goes into main table
( using system table INSERTED).
is one more row per item really going to kill the DB? This way you have the complete history together.
If you purge out rows (a range all older than X day) you can still tell if something has changed or not:
if an audit row exists (not purged) you can see if the row in question changed.
if no audit rows exist for the item (all were purged) nothing changed (since any change writes to the audit table, including completely new items)
if you go with Appraoch 1: and purge out a range, it will be hard (need to remember purge date) to tell new inserts vs. ones where all rows were purged.

A third approach we use alot is to only audit the interesting columns, and save both 'new' and 'old' value on each row.
So if you have your "name" column, the audit table would have "name_old" and "name_new".
In INSERT trigger, "name_old" is set to blank/null depending on your preference and "name_new" is set from INSERTED.
In UPDATE trigger, "name_old" is set from DELETED and "name_new" from INSERTED
In DELETE trigger, "name_old" is set from DELETED and "new_name" to blank/null.
(or you use a FULL join and one trigger for all cases)
For VARCHAR fields, this might not look like such a good idea, but for INTEGER, DATETIME, etc it provides the benefit that it's very easy to see the difference of the update.
I.e. if you have a quantity-field in your real table and update it from 5 to 7, you'd have in audit table:
quantity_old quantity_new
5 7
Easily you can calculate that the quantity was increased by 2 on the specific time.
If you have separate rows in audit table, you will have to join one row with "the next" to calculate difference - which can be tricky in some cases...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas