Capture log of fetched table rows - sql

I am using db2 (if you have a solution with another database, I am still interested), and am trying to identify every row that is fetched from a specific table. The solution needs to be at the database level, because I do not have access to the actual SELECT statements that cause the fetch. I would at a minimum like to capture one or more column values into a log/table for every row that is fetched from a specific table.
Here's an example:
Table1 structure
CustNo (primary key)
CustName
Table 1 (two rows)
12345, Joe's Crab Shack
98765, Morton's The Steakhouse
Process
1) Before select, log file is empty
2) Execute: SELECT CustName from Table1 where CustNo=12345
3) After select, log file contains:
LogFile1
---------
12345
4) Execute: SELECT * from Table1
5) After select, log file contains:
LogFile1
---------
12345
12345
98765
Thank you for any advice/recommendations....

If you're willing to call a SP to log this info, you might want simply to add a *READ trigger. It's rarely a good idea to try to get some function to run whenever any record is read from a file, but a *READ trigger is possibly the most efficient way possible.
ADDPFTRG FILE(X) TRGTIME(*AFTER) TRGEVENT(*READ) PGM(Y)
Use that form of the command to add your "read-only" trigger program (Y) to a file (X). Program Y should probably do something fast like push the relevant data items onto a data queue. Then have multiple batch instances of a program that pulls entries off the queue and writes them to a log file. You really don't want a read-only trigger doing any more work than possible, and database I/O should be off the list.
Expect performance to suffer some.

You can get the operations on the database via db2audit, but you cannot get the values used. Using the values for audit or logging will compromise sensitive data.
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.admin.cmd.doc/doc/r0002072.html
Actually, if you put the ID of all read rows for a given table, it is like copying the ID column several times in the log table. At the same time, it does not give any order context, because the order in which the rows are inserted is not the same order as the rows will be stored or retrieved.
You have to rethink your logging strategy, because just inserting the 'fetched' ID is not enough. You also have to insert some context information, like who (user), when (date), from where (machine) in order to exploit that data.
Another thing you can do is to wrap the select in a stored procedure and insert the ID values in the log table, before returning the opening cursor to the caller.

Related

Get exact ID within thousands of concurrent transactions

I have a very crowd table with thousands of logs. Every time a new log is inserted in a database, I need to do update some tables using the ID of new log. So I get the last ID using these two lines of code:
objcon.execute "insert into logs (member,refer) values (12,12345)"
objcon.execute "select top 1 id from logs order by id desc"
I am afraid if the second line get another ID from a most recent order because there are thousands of new logs in one second.
This is a sample scenario and I know that there is built in methods to get the ID of recently inserted row. But my exact question is that if there is a logical order of transactions in a server (both IIS and SQL server) or it is possible that a new transaction finishes before an old transaction so the second line, get the ID of another log?
It is definitely possible that your second query will get id from another transaction. I strongly suggest that you use SCOPE_IDENTITY(). These kind of methods are provided in DBMS just for this exact scenario where you insert a row and then do select the last row from that table, but in between these 2 operations other connections might have inserted new rows.
Yes. Concurrent transactions can cause problems with what you are trying to do.
The right solution is the output clause. The code looks like this:
declare #ids table (id int);
insert into logs (member, refer)
output inserted.id into #ids
values (12, 12345);
select *
from #ids;
You can find multiple discussions on the web about why OUTPUT is better. Here are some reasons:
You can return multiple columns, not just the identity.
It is session- and table- safe.
It handles multiple rows.
It is the same syntax for SELECT, UPDATE, and DELETE.
If you don't specify a WHERE clause on the SELECT query, you would need to execute these queries in a transaction and under the SNAPSHOT isolation level before committing the changes. That way, only changes made by the current transaction are visible.
It would be better to use SCOPE_IDENTITY() to return the last identity value generated in the outermost scope of current connection. This differs from ##IDENTITY in that the value is not affected by triggers that might also generate identify values.
objcon.execute "insert into logs (member,refer) values (12,12345)"
objcon.execute "select SCOPE_IDENTITY() AS id;"

Which Job inserted this record / row into the table?

I have dozens of different SQL Jobs calling different Sprocs, which insert rows into a common table.
Is there any way, given a row in the table, to retrieve the job which triggered the insert?
Input: Row ID, TableName, DBName
Output: Job ID which inserted Row
Not generally, as far as I'm aware. You could have the insert query include that data. Or you could get it from a log, maybe based on the primary key or another unique key, if your inserts are unique. You might be able to turn on some SQL server equivalent of the general log; but that's devastating to high volume performance and you'd still have to pull it from a log file. I recommend you consider whether you can diagnose your components from their logs in addition to their effects in the database.

Reverting a database insertion with log files?

I am working on a program that is supposed to insert hundreds of rows to the database per run.
The problem is that once the inserted data is wrong, how can we recover from that run? Currently I only have a log file (I created the format), which records the raw data get inserted (no metadata nor primary keys). Is there a way we can create a log that database can understand it, and once we want to undo the insertion we feed the database with that log file.
Or, if there is alternative mechanism of undoing an operation from a program, kindly let me know, thanks.
The fact, that this is only hundreds of rows, makes it succeptible to the great-grandmother of all undo mechanisms:
have a table importruns with a row for each run you do. I assume it has an integer auto-increment PK
add a field to your data table, that identifies carries the PK of the import run
for insert-only runs, you just need to DELETE FROM sometable WHERE importid=$whatever
If you also have replace/update imports, go one step further
for each data table have a corresponding table, that has one field more: superseededby
for each row you update/replace, place an original copy of the row in this table plus the import id in superseededby
to revert, you now have to add INSERT INTO originaltable SELECT * FROM superseededtable WHERE superseededby=$whatever
You can clean up superseededtable for known-good imports, to make sure, storage doesn't grow unlimited.
You have several options. Depending on when you notice the error.
If you know there is an error with the data, the you can use the transactions API to rollback to changes of the current transaction.
In case you know there was an error only later, then you can create your own log. Make an index identifying the transaction, and add a field to the relevant table where that id would be inserted. This would allow you to identify exactly which transaction it came from. You can also create a stored procedure that deletes rows according to the given transaction id.

Finding changed records in a database table

I have a problem that I haven't been able to come up with a solution for yet. I have a database (actually thousands of them at customer sites) that I want to extract data from periodically. I'd like to do a full data extract one time (select * from table) then after that only get rows that have changed.
The challenge is that there aren't any updated date columns in most of the tables that could be used to constrain the SQL query. I can't use a trigger based approach nor change the application that writes to the database since it's another group that develops the app and they are way backed up already.
I may be able to write to the database tables when doing the data extract, but would prefer not to do that. Does anyone have any ideas for how we might be able to do this?
You will have to programatically mark the records. I see suggestions of an auto-incrementing field but that will only get newly inserted records. How will you track updated or deleted records?
If you only want newly inserted that an autoincrementing field will do the job; in subsequent data dumps grab every thing since the last value of the autoincrment field and then recrod the current value.
If you want updates the minimum I can see is to have a last_update field and probably a trigger to populare it. If the last_update is later the the last data dump grab that record. This will get inserts and updates but not deletes.
You could try something like a 'instead of delete' trigger if your RDBMS supports it and NULL the last_update field. On subsequent data dumps grap all recoirds where this field is NULL and then delete them. But there would be problems with this (e.g. how to stop the app seeing them between the logical and physical delete)
The most fool proof method I can see is aset of history (audit) tables and ech change gets written to them. Then you select your data dump from there.
By the way do you only care about know the updates have happened? What about if 2 (or more) updates have happened. The history table is the only way that I can see you capturing this scenario.
This should isolate rows that have changed since your last backup. Assuming DestinationTable is a copy of SourceTable even on the key fields; if not you could list out the important fields.
SELECT * FROM SourceTable
EXCEPT
SELECT * FROM DestinationTable

collecting mysql statistics

What would be the easiest way to count the new records that are inserted into a database? Is it possible to include a count query in with the load query?
Or is something more complex needed, such as recording the existing last record and counting everything added after it?
edit:
I have a cron job, that uses LOAD DATA INFILE in a script that is passed directly to mysql. This data is used with a php web application. As part of the php web application, I need to generate weekly reports, including how many records were inserted in the last week.
I am unable to patch mysql, or drastically change the database schema/structure, but I am able to add in new tables or fields. I would prefer not to count records from the csv file and store this result in a textfile or something. INstead, I would prefer to do everything from within PHP with queries.
Assuming your using Mysql 5 or greater, you could create a trigger which would fire upon inserting into a specific table. Note that an "insert" trigger also fires with the "LOAD" command.
Using a trigger would require you to persist the count information in a separate table. Basically you'd need to create a new table with 1 row/column to hold the count. The trigger would then update that value with the amount of data loaded.
Here's the MySQL manual page on triggers, the syntax is fairly straight forward. http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
edit
Alternatively, if you don't want to persist the data within the database you could perform your "Load" operations within a stored procedure. This would allow you to perform a select count() on the table before you begin the Load and after the Load is complete. You would just need to subtract the resulting values to determine how many rows were inserted during the Load.
Here's the MySQL manual page on procedures.
http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html
That would probably depend on what is determined as being new. Is it entries entered into the database in the last five minutes or 10 minutes etc? Or is it any record past a certain Auto ID?
If you are looking at time based method of determining what's new, you can have a field (probably of type datetime) that records the time when the record was inserted and to get the number, you simply do a...
select count(*) from table where currentTime > 'time-you-consider-to-be-new'
If you don't want to go by recording the time, you can use an auto increment key and simply keep track of the last inserted ID and count the ones that come after that at any given time window. so if one hour ago the ID was 10000 then a number of records have been inserted since then. You will need to count all records greater than 10000 and keep track of the last insert ID and repeat whenever needed.
If you are not looking at a specific table, you can use the following:
show global status like "Com_%";
This will show you statistics for every type of query. These numbers just keep on counting, so if you want to use them, record the initial number when starting to track the queries, and subtract this from your final number (but yea, that's a given).
If you are looking for pure statistics, I can recommend using Munin with the MySQL plugins.
From where do you load the data? You might consider to count them befor you insert them into the database. If it's a sqlscript you might write a quick and dirty bash script (with grep or something similar) to count the fields.
You say you can't change the structure. Does that mean you can't change the table you are inserting into, or you can't change the database at all? If you can add a table, then just create a table with 2 columns - a timestamp and the key of the table you are loading. Before you load your csv file, create another csv file with just those two columns, and load that csv after your main one.
This might be simpler than you want, but what about a Nagios monitor to track the row count? (Also consider asking around on serferfault.com; this stuff is totally up their alley.)
Perhaps you could write a small shell script that queries the database for the number of rows. You could then have a Cron job that runs every minute/hour/day etc and outputs the COUNT to a log file. Over time, you could review the log file and see the rate at which the database is growing. If you also put a date in the log file, you could review it easier over longer periods.
See if this is the kind of MySQL data collection you're interested in: http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring.
If that is the case, Google offers a MySQL patch (to apply to a clean mysql directory source) at http://google-mysql-tools.googlecode.com/svn/trunk/mysql-patches/all.v4-mysql-5.0.37.patch.gz. You can read more about the patch at http://code.google.com/p/google-mysql-tools/wiki/Mysql5Patches.
If this is not what you're looking for, I suggest you explain yourself a little more in order for us to help you better.
Could you use a trigger on the table which will insert into a table you created, which in the structure has a timestamp?
You could then use a date calculation on a period range to find the information needed.
I dont know what version of mysql you are using, but here is link to the syntax for trigger creation in version 5.0: http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
Good luck,
Matt
Well, if you need exhaustive information: which rows were inserted, updated or deleted, it might make sense to create an additional audit table to store those things with a timestamp. You could do this with triggers. I would also write a stored procedure which would execute as event and erase old entries (whatever you consider old).
Refer to the link posted by Lima on how to create triggers in MySQL.
Refer to page 655 of "MySQL Cookbook" by Paul Dubois (2nd Edition) or page 158 of "SQL for smarties" by Joe Celko.
so the 'load' will only insert new data in the table ? or rewrite the whole table ?
If it will load new data, then you can do a select count(*) from yourtable
once before the loading and once after the loading ... the difference will show you how many new records where inserted..
If on the other hand you rewrite the whole table and want to find the different records from the previous version .. then you would need a completely different approach..
Which one is it ?
Your question is a bit ambiguous but they mysql c APIs provide a function "mysql_affected_rows" that you can call after each query to get the number of affected rows. For an insert it returns the number of rows inserted. Be aware that for updates it returns the number of rows changed not the number of rows that matched the where clause.
If you are performing a number of queries and need to know how many were inserted the most reliable way would probably be doing a count before and after the queries.
As noted in sobbayi's answer adding a "created at" timestamp to your tables would allow you to query for records created after (or before) a given time.
UPDATE:
OK here is what you need to do to get a count before and after:
create a table for the counts:
create table row_counts (ts timestamp not null, row_count integer not null);
in your script add the following before and after your load file inline query:
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
load file inline......
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
the row_counts table will now have the count before and after your load.
show global status like 'Com_insert';
flush status and show session status... will work for just the current connection.
see http://dev.mysql.com/doc/refman/5.1/en/server-status-variables.html#statvar_Com_xxx
Since you asked for the easiest way, I would suggest you to use a trigger on insert. You could use a single column, single row table as a counter and update it with the trigger.