Finding changed records in a database table - sql

I have a problem that I haven't been able to come up with a solution for yet. I have a database (actually thousands of them at customer sites) that I want to extract data from periodically. I'd like to do a full data extract one time (select * from table) then after that only get rows that have changed.
The challenge is that there aren't any updated date columns in most of the tables that could be used to constrain the SQL query. I can't use a trigger based approach nor change the application that writes to the database since it's another group that develops the app and they are way backed up already.
I may be able to write to the database tables when doing the data extract, but would prefer not to do that. Does anyone have any ideas for how we might be able to do this?

You will have to programatically mark the records. I see suggestions of an auto-incrementing field but that will only get newly inserted records. How will you track updated or deleted records?
If you only want newly inserted that an autoincrementing field will do the job; in subsequent data dumps grab every thing since the last value of the autoincrment field and then recrod the current value.
If you want updates the minimum I can see is to have a last_update field and probably a trigger to populare it. If the last_update is later the the last data dump grab that record. This will get inserts and updates but not deletes.
You could try something like a 'instead of delete' trigger if your RDBMS supports it and NULL the last_update field. On subsequent data dumps grap all recoirds where this field is NULL and then delete them. But there would be problems with this (e.g. how to stop the app seeing them between the logical and physical delete)
The most fool proof method I can see is aset of history (audit) tables and ech change gets written to them. Then you select your data dump from there.
By the way do you only care about know the updates have happened? What about if 2 (or more) updates have happened. The history table is the only way that I can see you capturing this scenario.

This should isolate rows that have changed since your last backup. Assuming DestinationTable is a copy of SourceTable even on the key fields; if not you could list out the important fields.
SELECT * FROM SourceTable
EXCEPT
SELECT * FROM DestinationTable

Related

Passing Existing Record For Update Again In Update Table Component In Abinitio

What happens if an already existing record is sent for an update query as it is to an update table component? Does it go unused?
I have an abinitio output file which has records to be updated(not inserted). I need to collect only those records which are actually updated. So how can we separate the records which are exactly same as before in DB (not updated) and which have at least one field updated?
This is an ETL question as DB2 will do the update whether the row has changes or not. I do not know Ab Initio in detail but you have to do a change detection upfront the DB2 Update.
Ususally ETL tools have some kind of "Change Capture" / "Compare" / "Difference" functions to detect changes.
You can try to play with the Unused port in the Update Table component, Also look at ActionRequired Flag.
An easy way to determine if an update will occur, or better yet, only feed in updates only is to join data from the database (select statement that will select every record that is to be updated by your update file), and join it with the database on all the fields in the record as key. Those records that come out on the unused port pertaining to the update file are the ones that will perform an update action.
My first approach was the same as Alex suggested but it seems like Join with DB with all fields as key will take more time and resources. A better approach is to compare existing values and new values to be updated in a reformat select parameter or Filter by expression for each field. This will give only those records which will actually be updated.
Also Michael is right, DB2 will update irrespective of it is an actual update or not. So, unused port will not give the records which are not updated.

SQL - When was my table last change?

I want to find when the last INSERT, UPDATE or DELETE statement was performed on a table (for now, in the future I want to do this in multiple tables) in an Oracle database.
I created a table and then I updated one of its rows. Now I've the following query:
SELECT SCN_TO_TIMESTAMP(ora_rowscn) from test_table;
This query returns the timestamps of each row, and for each of them it gives the time when they were first created.
But the row that I've updated have the same timestamp as the others. Why? Shouldn't the timestamp be updated?
ORA_ROWSCN is not the right solution for this. It is not necessarily reliable at the row level. Moreover, it's not going to be useful at all for deleted rows.
If you have a real need to know when DML changes were made to a table, you should look at Oracle's auditing feature.
An alternative is to use triggers to record when changes are made to the table. Since you say you only care about the time of the most recent change, you can just create a single-column table to record the time, and write a trigger that fires on any DML statement to maintain it. If you're doing this in a production environment or even just in one where more than one session might be modifying the table, you'd want to think about how it should work when concurrent changes are made. You could force the table to have at most one row, but that would serialize every change to the table. You could allow each session to insert a separate row and take the max value when querying it, but then you probably want to think about clearing out old rows from time to time.

DB schema for updating downstream sources?

I want a table to be sync-able by a web API.
For example,
GET /projects?sequence_latest=2113&limit=10
[{"state":"updated", "id":12,"sequence":2116},
{"state":"deleted" "id":511,"sequence":2115}
{"state":"created", "id":601,"sequence":2114}]
What is a good schema to achieve this?
I intend this for Postgresql with Django ORM, which uses surrogate keys. Presence of an ORM may kill answers like unions.
I can come up with only half-solutions.
I could have a modified_time column, but we cannot convey deletions.
I could have a table for storing deleted IDs, when returning 10 new/updated rows, I could return all the deleted rows between them. But this works only when the latest change is an insert/update and there are a moderate number of deleted rows.
I could set a deleted flag on the row and null the rest, but its kinda bad schema design to set all columns nullable.
I could have another table that stores ID, modification sequence number and state(new, updated, deleted), but its another table to maintain and setting sequence numbers cause contentions; imagine n concurrent requests querying for latest ID.
If you're using an ORM you want simple(ish) and if you're serving the data via an API you want quick.
To go through your suggested options:
Correct, so this doesn't help you. You could have a deleted flag in your main table though.
This seems quite a random way of doing it and breaks your insistence that there be no UNION queries.
Not sure why you would need to NULL the rest of the column here? What benefit does this bring?
I would strongly advise against having a table that has a modification sequence number. Either this means that you're performing a lot of analytic queries in order to find out the most recent state or you're updating the same rows multiple times and maintaining a table with the same PK as your normal one. At that point you might as well have a deleted flag in your main table.
Essentially the design of your API gives you one easy option; you should have everything in the same table because all data is being returned through the same method. I would follow your point 2 and Wolph's suggestion, have a deleted_on column in your table; making it look like:
create table my_table (
id ... primary key
, <other_columns>
, created_on date
, modified_on date
, deleted_on date
);
I wouldn't even bother updating all the other columns to be NULL. If you want to ensure that you return no data create a view on top of your table that nulls data where the deleted_on column has data in it. Then, your API only accesses the table through the view.
If you are really, really worried about space and the volume of records and will perform regular database maintenance to ensure that both are controlled then maybe go with option 4. Create a second table that has the state of each ID in your main table and actually delete the data from your main table. You then can do a LEFT OUTER JOIN to the main table to get the data. When there is no data that ID has been deleted. Honestly, this is overkill until you know whether you will definitely require it.
You don't mention why you're using an web API for data-transfers; but, if you're going to be transferring a lot of data or using this for internal systems only it might be worth using a lower-level transfer mechanism.

What is the best way to track when table(s) are updated in SQL?

In the past I have just added an field to each table and updated it with GETDATE() on every update/insert. The problem is now I have to keep track of delete too. I was thinking of just having a table that I would update when anything changed and add a trigger to all of the other tables. Ideas??? Thanks!
If you have a history table (A table with the same columns as the original table, plus an auto-increment ID column), you can track everything about changes to the original table. You can track inserts, deletes, and every change. Use triggers for insert, update, and deletes to put a row into the history table. If you don't need all these options, then use those that you do need.
If you choose to use an IsDeleted flag in the original table, it complicates every query, and leaves your active table with lots of unneeded rows. But that can work, depending on your needs.
I've seen tables designed with a bit field as IsDeleted and default value of course is set to false. When an item is deleted this value is set to true. All queries would then need to take this into affect:
SELECT blah FROM myTable WHERE IsDeleted=0
This way if you "accidentally" deleted a row, you should be able to bring it back. You could also purge records on say a weekly / monthly / yearly basis.
That is just an idea for you.
If you are using SQL Server 2008, you can take advantage of the new auditing features.
Flag the records as deleted=1 and do not delete it. Do a trigger on delete instead update...
I've also seen a duplicate table with a standardized prefix added to the name. All the deleted rows are moved to the duplicate table. This removes the overhead of keeping but ignoring the rows in the original table.
All actions (insert - update - delete) should be logged in a journalling table. I always log the action, timestamp and user who triggered the action. Adding an Isdelete column to the original table is bad practice.
If you are using SQL 2008 then you can use CDC(Change Data Capture) for the tracking.
The below link gives the full details. If you are enabling the cdc for particular table then automatically delete data's will be collected.
http://www.simple-talk.com/sql/learn-sql-server/introduction-to-change-data-capture-%28cdc%29-in-sql-server-2008/

Database history for client usage

I'm trying to figure out what would be the best way to have a history on a database, to track any Insert/Delete/Update that is done. The history data will need to be coded into the front-end since it will be used by the users. Creating "history tables" (a copy of each table used to store history) is not a good way to do this, since the data is spread across multiple tables.
At this point in time, my best idea is to create a few History tables, which the tables would reflect the output I want to show to the users. Whenever a change is made to specific tables, I would update this history table with the data as well.
I'm trying to figure out what the best way to go about would be. Any suggestions will be appreciated.
I am using Oracle + VB.NET
I have used very successfully a model where every table has an audit copy - the same table with a few additional fields (time stamp, user id, operation type), and 3 triggers on the first table for insert/update/delete.
I think this is a very good way of handling this, because tables and triggers can be generated from a model and there is little overhead from a management perspective.
The application can use the tables to show an audit history to the user (read-only).
We've got that requirement in our systems. We added two tables, one header, one detail called AuditRow and AuditField. The AuditRow contains one row per row changed in any other table, and the AuditField contains one row per column changed with old value and new value.
We have a trigger on every table that writes a header row (AuditRow) and the needed detail rows (one per changed colum) on each insert/update/delete. This system does rely on the fact that we have a guid on every table that can uniquely represent the row. Doesn't have to be the "business" or "primary" key, but it's a unique identifier for that row so we can identify it in the audit tables. Works like a champ. Overkill? Perhaps, but we've never had a problem with auditors. :-)
And yes, the Audit tables are by far the largest tables in the system.
If you are lucky enough to be on Oracle 11g, you could also use the Flashback Data Archive
Personally, I would stay away from triggers. They can be a nightmare when it comes to debugging and not necessarily the best if you are looking to scale out.
If you are using an PL/SQL API to do the INSERT/UPDATE/DELETEs you could manage this in a simple shift in design without the need (up front) for history tables.
All you need are 2 extra columns, DATE_FROM and DATE_THRU. When a record is INSERTed, the DATE_THRU is left NULL. If that record is UPDATEd or DELETEd, just "end date" the record by making DATE_THRU the current date/time (SYSDATE). Showing the history is as simple as selecting from the table, the one record where DATE_THRU is NULL will be your current or active record.
Now if you expect a high volume of changes, writing off the old record to a history table would be preferable, but I still wouldn't manage it with triggers, I'd do it with the API.
Hope that helps.