Which Job inserted this record / row into the table? - sql

I have dozens of different SQL Jobs calling different Sprocs, which insert rows into a common table.
Is there any way, given a row in the table, to retrieve the job which triggered the insert?
Input: Row ID, TableName, DBName
Output: Job ID which inserted Row

Not generally, as far as I'm aware. You could have the insert query include that data. Or you could get it from a log, maybe based on the primary key or another unique key, if your inserts are unique. You might be able to turn on some SQL server equivalent of the general log; but that's devastating to high volume performance and you'd still have to pull it from a log file. I recommend you consider whether you can diagnose your components from their logs in addition to their effects in the database.

Related

Keeping BigQuery table data up-to-date

This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active

Capture log of fetched table rows

I am using db2 (if you have a solution with another database, I am still interested), and am trying to identify every row that is fetched from a specific table. The solution needs to be at the database level, because I do not have access to the actual SELECT statements that cause the fetch. I would at a minimum like to capture one or more column values into a log/table for every row that is fetched from a specific table.
Here's an example:
Table1 structure
CustNo (primary key)
CustName
Table 1 (two rows)
12345, Joe's Crab Shack
98765, Morton's The Steakhouse
Process
1) Before select, log file is empty
2) Execute: SELECT CustName from Table1 where CustNo=12345
3) After select, log file contains:
LogFile1
---------
12345
4) Execute: SELECT * from Table1
5) After select, log file contains:
LogFile1
---------
12345
12345
98765
Thank you for any advice/recommendations....
If you're willing to call a SP to log this info, you might want simply to add a *READ trigger. It's rarely a good idea to try to get some function to run whenever any record is read from a file, but a *READ trigger is possibly the most efficient way possible.
ADDPFTRG FILE(X) TRGTIME(*AFTER) TRGEVENT(*READ) PGM(Y)
Use that form of the command to add your "read-only" trigger program (Y) to a file (X). Program Y should probably do something fast like push the relevant data items onto a data queue. Then have multiple batch instances of a program that pulls entries off the queue and writes them to a log file. You really don't want a read-only trigger doing any more work than possible, and database I/O should be off the list.
Expect performance to suffer some.
You can get the operations on the database via db2audit, but you cannot get the values used. Using the values for audit or logging will compromise sensitive data.
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.admin.cmd.doc/doc/r0002072.html
Actually, if you put the ID of all read rows for a given table, it is like copying the ID column several times in the log table. At the same time, it does not give any order context, because the order in which the rows are inserted is not the same order as the rows will be stored or retrieved.
You have to rethink your logging strategy, because just inserting the 'fetched' ID is not enough. You also have to insert some context information, like who (user), when (date), from where (machine) in order to exploit that data.
Another thing you can do is to wrap the select in a stored procedure and insert the ID values in the log table, before returning the opening cursor to the caller.

Reverting a database insertion with log files?

I am working on a program that is supposed to insert hundreds of rows to the database per run.
The problem is that once the inserted data is wrong, how can we recover from that run? Currently I only have a log file (I created the format), which records the raw data get inserted (no metadata nor primary keys). Is there a way we can create a log that database can understand it, and once we want to undo the insertion we feed the database with that log file.
Or, if there is alternative mechanism of undoing an operation from a program, kindly let me know, thanks.
The fact, that this is only hundreds of rows, makes it succeptible to the great-grandmother of all undo mechanisms:
have a table importruns with a row for each run you do. I assume it has an integer auto-increment PK
add a field to your data table, that identifies carries the PK of the import run
for insert-only runs, you just need to DELETE FROM sometable WHERE importid=$whatever
If you also have replace/update imports, go one step further
for each data table have a corresponding table, that has one field more: superseededby
for each row you update/replace, place an original copy of the row in this table plus the import id in superseededby
to revert, you now have to add INSERT INTO originaltable SELECT * FROM superseededtable WHERE superseededby=$whatever
You can clean up superseededtable for known-good imports, to make sure, storage doesn't grow unlimited.
You have several options. Depending on when you notice the error.
If you know there is an error with the data, the you can use the transactions API to rollback to changes of the current transaction.
In case you know there was an error only later, then you can create your own log. Make an index identifying the transaction, and add a field to the relevant table where that id would be inserted. This would allow you to identify exactly which transaction it came from. You can also create a stored procedure that deletes rows according to the given transaction id.

Finding changed records in a database table

I have a problem that I haven't been able to come up with a solution for yet. I have a database (actually thousands of them at customer sites) that I want to extract data from periodically. I'd like to do a full data extract one time (select * from table) then after that only get rows that have changed.
The challenge is that there aren't any updated date columns in most of the tables that could be used to constrain the SQL query. I can't use a trigger based approach nor change the application that writes to the database since it's another group that develops the app and they are way backed up already.
I may be able to write to the database tables when doing the data extract, but would prefer not to do that. Does anyone have any ideas for how we might be able to do this?
You will have to programatically mark the records. I see suggestions of an auto-incrementing field but that will only get newly inserted records. How will you track updated or deleted records?
If you only want newly inserted that an autoincrementing field will do the job; in subsequent data dumps grab every thing since the last value of the autoincrment field and then recrod the current value.
If you want updates the minimum I can see is to have a last_update field and probably a trigger to populare it. If the last_update is later the the last data dump grab that record. This will get inserts and updates but not deletes.
You could try something like a 'instead of delete' trigger if your RDBMS supports it and NULL the last_update field. On subsequent data dumps grap all recoirds where this field is NULL and then delete them. But there would be problems with this (e.g. how to stop the app seeing them between the logical and physical delete)
The most fool proof method I can see is aset of history (audit) tables and ech change gets written to them. Then you select your data dump from there.
By the way do you only care about know the updates have happened? What about if 2 (or more) updates have happened. The history table is the only way that I can see you capturing this scenario.
This should isolate rows that have changed since your last backup. Assuming DestinationTable is a copy of SourceTable even on the key fields; if not you could list out the important fields.
SELECT * FROM SourceTable
EXCEPT
SELECT * FROM DestinationTable

Some sort of “different auto-increment indexes” per a primary key values

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.