How can I monitor an SQL Server database for changes to a table without using triggers or modifying the structure of the database in any way? My preferred programming environment is .NET and C#.
I'd like to be able to support any SQL Server 2000 SP4 or newer. My application is a bolt-on data visualization for another company's product. Our customer base is in the thousands, so I don't want to have to put in requirements that we modify the third-party vendor's table at every installation.
By "changes to a table" I mean changes to table data, not changes to table structure.
Ultimately, I would like the change to trigger an event in my application, instead of having to check for changes at an interval.
The best course of action given my requirements (no triggers or schema modification, SQL Server 2000 and 2005) seems to be to use the BINARY_CHECKSUM function in T-SQL. The way I plan to implement is this:
Every X seconds run the following query:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*))
FROM sample_table
WITH (NOLOCK);
And compare that against the stored value. If the value has changed, go through the table row by row using the query:
SELECT row_id, BINARY_CHECKSUM(*)
FROM sample_table
WITH (NOLOCK);
And compare the returned checksums against stored values.
Take a look at the CHECKSUM command:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM sample_table WITH (NOLOCK);
That will return the same number each time it's run as long as the table contents haven't changed. See my post on this for more information:
CHECKSUM
Here's how I used it to rebuild cache dependencies when tables changed:
ASP.NET 1.1 database cache dependency (without triggers)
Unfortunately CHECKSUM does not always work properly to detect changes.
It is only a primitive checksum and no cyclic redundancy check (CRC) calculation.
Therefore you can't use it to detect all changes, e. g. symmetrical changes result in the same CHECKSUM!
E. g. the solution with CHECKSUM_AGG(BINARY_CHECKSUM(*)) will always deliver 0 for all 3 tables with different content:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM
(
SELECT 1 as numA, 1 as numB
UNION ALL
SELECT 1 as numA, 1 as numB
) q
-- delivers 0!
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM
(
SELECT 1 as numA, 2 as numB
UNION ALL
SELECT 1 as numA, 2 as numB
) q
-- delivers 0!
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM
(
SELECT 0 as numA, 0 as numB
UNION ALL
SELECT 0 as numA, 0 as numB
) q
-- delivers 0!
Why don't you want to use triggers? They are a good thing if you use them correctly. If you use them as a way to enforce referential integrity that is when they go from good to bad. But if you use them for monitoring, they are not really considered taboo.
How often do you need to check for changes and how large (in terms of row size) are the tables in the database? If you use the CHECKSUM_AGG(BINARY_CHECKSUM(*)) method suggested by John, it will scan every row of the specified table. The NOLOCK hint helps, but on a large database, you are still hitting every row. You will also need to store the checksum for every row so that you tell one has changed.
Have you considered going at this from a different angle? If you do not want to modify the schema to add triggers, (which makes a sense, it's not your database), have you considered working with the application vendor that does make the database?
They could implement an API that provides a mechanism for notifying accessory apps that data has changed. It could be as simple as writing to a notification table that lists what table and which row were modified. That could be implemented through triggers or application code. From your side, ti wouldn't matter, your only concern would be scanning the notification table on a periodic basis. The performance hit on the database would be far less than scanning every row for changes.
The hard part would be convincing the application vendor to implement this feature. Since this can be handles entirely through SQL via triggers, you could do the bulk of the work for them by writing and testing the triggers and then bringing the code to the application vendor. By having the vendor support the triggers, it prevent the situation where your adding a trigger inadvertently replaces a trigger supplied by the vendor.
Unfortunately, I do not think that there is a clean way to do this in SQL2000. If you narrow your requirements to SQL Server 2005 (and later), then you are in business. You can use the SQLDependency class in System.Data.SqlClient. See Query Notifications in SQL Server (ADO.NET).
Have a DTS job (or a job that is started by a windows service) that runs at a given interval. Each time it is run, it gets information about the given table by using the system INFORMATION_SCHEMA tables, and records this data in the data repository. Compare the data returned regarding the structure of the table with the data returned the previous time. If it is different, then you know that the structure has changed.
Example query to return information regarding all of the columns in table ABC (ideally listing out just the columns from the INFORMATION_SCHEMA table that you want, instead of using *select ** like I do here):
select * from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'ABC'
You would monitor different columns and INFORMATION_SCHEMA views depending on how exactly you define "changes to a table".
Wild guess here: If you don't want to modify the third party's tables, Can you create a view and then put a trigger on that view?
Check the last commit date. Every database has a history of when each commit is made. I believe its a standard of ACID compliance.
Related
We face the following situation (Teradata):
Business layer frequently executes long-running queries on Table X_Past UNION ALL Table X_Today.
Table X_Today gets updated frequently, say once every 10 minutes. X_Past only once after midnight (per full-load).
Writing process should not block reading process.
Writing should happen as soon as new data is available.
Proposed approach:
2 "Today" and a "past" table, plus a UNION ALL view that selects from one of them based on the value in a load status table.
X_Today_1
X_Today_0
X_Past
loading process with load in X_Today_1 and set the active_table value in the load status table to "X_Today_1"
next time it will load X_Today_0 and set the active_table value to "X_Today_0"
etc.
The view that is used to select on the table will be built as follows:
select *
from X_PAST
UNION ALL
select td1.*
from X_Today_1 td1
, ( select active_table from LOAD_STATUS ) active_tab1
where active_tab1.te_active_table = 'X_Today_1'
UNION ALL
select td0.*
from X_Today_0 td0
, ( select active_table from STATUS_LOG ) active_tab0
where active_tab1.te_active_table = 'X_Today_0'
my main questions:
when executing the select, will there be a lock on ALL tables, or only on those that are actually accessed for data? Since because of the where clause, data from one of the Today_1/0 tables will always be ignored and this table should be availablew for loading;
do we need any form of locking or is the default locking mechanism that what we want (which I suspect it is)?
will this work, or am I overlooking something?
It is important that the loading process will wait in case the reading process takes longer than 20 minutes and the loader is about to refresh the second table again. The reading process should never really be blocked, except maybe by itself.
Any input is much appreciated...
thank you for your help.
A few comments to your questions:
Depending on the query structure, the Optimizer will try to get the default locks (in this case a READ lock) at different levels -- most likely table or row-hash locks. For example, if you do a SELECT * FROM my_table WHERE PI_column = 'value', you should get a row-hash lock and not a table lock.
Try running an EXPLAIN on your SELECT and see if it gives you any locking info. The Optimizer might be smart enough to determine there are 0 rows in one of the joined tables and reduce the lock requests. If it still locks both tables, see the end of this post for an alternative approach.
Your query written as-is will result in READ locks, which would block any WRITE requests on the tables. If you are worried about locking issues / concurrency, have you thought about using an explicit ACCESS lock? This would allow your SELECT to run without ever having to wait for your write queries to complete. This is called a "dirty read", since there could be other requests still modifying the tables while they are being read, so it may or may not be appropriate depending on your requirements.
Your approach seems feasible. You could also do something similar, but instead of having two UNIONs, have a single "X_Today" view that points to the "active" table. After your load process completes, you could re-point the view to the appropriate table as needed via a MACRO call:
-- macros (switch between active / loading)
REPLACE MACRO switch_to_today_table_0 AS
REPLACE VIEW X_Today AS SELECT * FROM X_Today_0;
REPLACE MACRO switch_to_today_table_1 AS
REPLACE VIEW X_Today AS SELECT * FROM X_Today_1;
-- SELECT query
SELECT * FROM X_PAST UNION ALL SELECT * FROM X_Today;
-- Write request
MERGE INTO x_today_0...;
-- Switch active "today" table to must recently loaded one
EXEC switch_to_today_table_0;
You'd have to manage which table to write to (or possible do that using a view too) and which "switch" macro to call within your application.
One thing to think about is that having two physical tables that logically represent the same table (i.e. should have the same data) may potentially allow for situations where one table is missing data and needs to be manually synced.
Also, if you haven't looked at them already, a few ideas to optimize your SELECT queries to run faster: row partitioning, indexes, compression, statistics, primary index selection.
I want a table to be sync-able by a web API.
For example,
GET /projects?sequence_latest=2113&limit=10
[{"state":"updated", "id":12,"sequence":2116},
{"state":"deleted" "id":511,"sequence":2115}
{"state":"created", "id":601,"sequence":2114}]
What is a good schema to achieve this?
I intend this for Postgresql with Django ORM, which uses surrogate keys. Presence of an ORM may kill answers like unions.
I can come up with only half-solutions.
I could have a modified_time column, but we cannot convey deletions.
I could have a table for storing deleted IDs, when returning 10 new/updated rows, I could return all the deleted rows between them. But this works only when the latest change is an insert/update and there are a moderate number of deleted rows.
I could set a deleted flag on the row and null the rest, but its kinda bad schema design to set all columns nullable.
I could have another table that stores ID, modification sequence number and state(new, updated, deleted), but its another table to maintain and setting sequence numbers cause contentions; imagine n concurrent requests querying for latest ID.
If you're using an ORM you want simple(ish) and if you're serving the data via an API you want quick.
To go through your suggested options:
Correct, so this doesn't help you. You could have a deleted flag in your main table though.
This seems quite a random way of doing it and breaks your insistence that there be no UNION queries.
Not sure why you would need to NULL the rest of the column here? What benefit does this bring?
I would strongly advise against having a table that has a modification sequence number. Either this means that you're performing a lot of analytic queries in order to find out the most recent state or you're updating the same rows multiple times and maintaining a table with the same PK as your normal one. At that point you might as well have a deleted flag in your main table.
Essentially the design of your API gives you one easy option; you should have everything in the same table because all data is being returned through the same method. I would follow your point 2 and Wolph's suggestion, have a deleted_on column in your table; making it look like:
create table my_table (
id ... primary key
, <other_columns>
, created_on date
, modified_on date
, deleted_on date
);
I wouldn't even bother updating all the other columns to be NULL. If you want to ensure that you return no data create a view on top of your table that nulls data where the deleted_on column has data in it. Then, your API only accesses the table through the view.
If you are really, really worried about space and the volume of records and will perform regular database maintenance to ensure that both are controlled then maybe go with option 4. Create a second table that has the state of each ID in your main table and actually delete the data from your main table. You then can do a LEFT OUTER JOIN to the main table to get the data. When there is no data that ID has been deleted. Honestly, this is overkill until you know whether you will definitely require it.
You don't mention why you're using an web API for data-transfers; but, if you're going to be transferring a lot of data or using this for internal systems only it might be worth using a lower-level transfer mechanism.
This is a design/algorithm question.
Here's the outline of my scenario:
I have a large table (say, 5 mil. rows) of data which I'll call Cars
Then I have an application, which performs a SELECT * on this Cars table, taking all the data and packaging it into a single data file (which is then uploaded somewhere.)
This data file generated by my application represents a snapshot, what the table looked like at an instant in time.
The table Cars, however, is updated sporadically by another process, regardless of whether the application is currently generating a package from the table or not. (There currently is no synchronization.)
My problem:
This table Cars is becoming too big to do a single SELECT * against. When my application retrieves all this data at once, it quickly overwhelms the memory capacity for my machine (let's say, 2GB.) Also, simply performing chained SELECTs with LIMIT or OFFSET fails the condition of synchronization: the table is frequently updated and I can't have the data change between SELECT calls.
What I'm looking for:
A way to pull the entirety of this table into an application whose memory capacity is smaller than the data, assuming the data size could approach infinity. Particularly, how do I achieve a pagination/segmented effect for my SQL selects? i.e. Make recurring calls with a page number to retrieve the next segment of data. The ideal solution allows for scalability in data size.
(For the sake of simplifying my scenario, we can assume that when given a segment of data, the application can process/write it then free up the memory used before requesting the next segment.)
Any suggestions you may be able to provide would be most helpful. Thanks!
EDIT: By request, my implementation uses C#.NET 4.0 & MSSQL 2008.
EDIT #2: This is not a SQL command question. This is design-pattern related question: what is the strategy to perform paginated SELECTs against a large table? (Especially when said table receives consistent updates.)
What database are you using? In MySQL for example the following would select 20 rows beginning from row 40 but this is mysql-only clause (edit: it seems Postgres also allows this)
select * from cars limit 20 offset 40
If you want a "snapshot" effect you have to copy the data into holding table where it will not get updated. You can accomplish some nice things with various types of change-tracking, but that's not what you stated you wanted. If you need a snapshot of the exact table state then take the snapshot and write it to a seperate table and use the limit and offset (or whatever) to create pages.
And at 5 million rows, I think it is likely the design requirement that might need to be modified...if you have 2000 clients all taking 5 million-row snapshots you are going to start having some size issues if you don't watch out.
You should provide details of the format of the resultant data file. Depending on the format this could be possible directly in your database, with no app code involved eg for mysql:
SELECT * INTO OUTFILE "c:/mydata.csv"
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY "\n"
FROM my_table;
For oracle there would be export, for sqlserver/sybase it would be BCP, etc.
Or alternatively achievable by streaming the data, without holding it all in memory, this would vary depending on the app language.
In terms of paging, the easy option is to just use the limit clause (if mysql) or the equivelent in whatever rdbms you are using, but this is a last resort:
select * from myTable order by ID LIMIT 0,1000
select * from myTable order by ID LIMIT 1000,1000
select * from myTable order by ID LIMIT 2000,1000
...
This selects the data in 1000 row chunks.
Look at this post on using limit and offset to create paginated results from your sql query.
http://www.petefreitag.com/item/451.cfm
You would have to first:
SELECT * from Cars Limit 10
and then
SELECT * from Cars limit 10 offset 10
And so on. You will have to figure out the best pagination for this.
Need to query a database for 12 million rows, process this data and then insert the filtered data into another database.
I can't just do a SELECT * from the database for obvious reasons - far too much data would be returned for my program to handle, and also this is a live database (customer order details) and I can't have the database crawl to a halt for 10 minutes while it runs my query.
I'm looking for inspiration on how to write this program. I have to process each row. I was thinking it might be best to get a count on the rows. Then grab X at a time, wait for Y seconds, and repeat, until the dataset is complete. This way I'm not overloading the database, and since X will be sufficiently small, will run nicely in memmory.
Other suggestions or feedback ?
I'd recommend you read the doc about SELECT...INTO OUTFILE and LOAD DATA FROM INFILE.
These are very fast ways of dumping data to a flat file and then importing it to another database.
You could dump into the flat file, and then run an offline script to process your rows, and then once that's done import the result to the new database.
See also:
http://dev.mysql.com/doc/refman/5.1/en/select.html (search for "INTO OUTFILE")
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Spreading the load over time seems the only practicable solution. Exactly how to do it depends to some extent on your schema, how records change over time in the "live database", and what consistency semantics your processing must have.
In the worst case -- any record can be changed at any time, there is nothing in the schema that lets you easily and speedily check for "recently modified, inserted, or deleted records", and you nevertheless need to be consistent in what you process -- the task is simply unfeasible, unless you can count on some special support from your relational engine and/or OS (such as volume or filesystem "snapshots", like in Linux's LVM, that let you cheaply and speedily "freeze in time" a copy of the volumes on which the DB resides, for later leisurely fetching with another, read-only, database configured to read from the snapshot volume).
But presumably you do have some constraints, something in the schema that helps with the issue, or else, one can hope, you can afford some inconsistency generated by changes in the DB happening at the same time as your processing -- some lines processed twice, some not processed, some processed in older versions and others in newer versions... unfortunately, you have told us next to nothing about any of these issues, making it essentially unfeasible to offer much more help. If you edit your question to provide a LOT more information on platform, schema, and DB usage patterns, maybe more help can be offered.
A flat file or a snapshot are both ideal.
If a flat file does not suit or you do not have access to snapshots theny you could use a sequential id field or create a sequential id in a temp table and then iterate using that.
Something like
#max_id = 0
while exists (select * from table where seq_id > #max_id)
select top n * from table where seq_id > #max_id order by seq_id
... process...
set #max_id = #max seq_id from the last lot
end
If there is no sequential id then you can create a temp table that holds the order like
insert into some_temp_table
select unique_id from table order by your_ordering_scheme
then process like this
... do something with top n from table join some_temp_table on unique_id ...
delete top n from some_temp_table
this way temp_table holds the record identifiers that still need to be processed.
You don't mention which db you are using, but I doubt any db that can hold 12 million rows would actually try to return all the data to your program at once. Your program essentially streams the data in small blocks (say 1000 rows) something that is usually handled by the database driver.
RDBMSs have different transaction levels which can be used to reduce the effort the database spends maintaining consistency guarantees, which will avoid locking up the table.
Databases can also create snapshots of tables to a file for later analysis.
In your position, I would try the simplest thing first, and see how that scales (on a development copy of the db with simulated user access.)
What would be the easiest way to count the new records that are inserted into a database? Is it possible to include a count query in with the load query?
Or is something more complex needed, such as recording the existing last record and counting everything added after it?
edit:
I have a cron job, that uses LOAD DATA INFILE in a script that is passed directly to mysql. This data is used with a php web application. As part of the php web application, I need to generate weekly reports, including how many records were inserted in the last week.
I am unable to patch mysql, or drastically change the database schema/structure, but I am able to add in new tables or fields. I would prefer not to count records from the csv file and store this result in a textfile or something. INstead, I would prefer to do everything from within PHP with queries.
Assuming your using Mysql 5 or greater, you could create a trigger which would fire upon inserting into a specific table. Note that an "insert" trigger also fires with the "LOAD" command.
Using a trigger would require you to persist the count information in a separate table. Basically you'd need to create a new table with 1 row/column to hold the count. The trigger would then update that value with the amount of data loaded.
Here's the MySQL manual page on triggers, the syntax is fairly straight forward. http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
edit
Alternatively, if you don't want to persist the data within the database you could perform your "Load" operations within a stored procedure. This would allow you to perform a select count() on the table before you begin the Load and after the Load is complete. You would just need to subtract the resulting values to determine how many rows were inserted during the Load.
Here's the MySQL manual page on procedures.
http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html
That would probably depend on what is determined as being new. Is it entries entered into the database in the last five minutes or 10 minutes etc? Or is it any record past a certain Auto ID?
If you are looking at time based method of determining what's new, you can have a field (probably of type datetime) that records the time when the record was inserted and to get the number, you simply do a...
select count(*) from table where currentTime > 'time-you-consider-to-be-new'
If you don't want to go by recording the time, you can use an auto increment key and simply keep track of the last inserted ID and count the ones that come after that at any given time window. so if one hour ago the ID was 10000 then a number of records have been inserted since then. You will need to count all records greater than 10000 and keep track of the last insert ID and repeat whenever needed.
If you are not looking at a specific table, you can use the following:
show global status like "Com_%";
This will show you statistics for every type of query. These numbers just keep on counting, so if you want to use them, record the initial number when starting to track the queries, and subtract this from your final number (but yea, that's a given).
If you are looking for pure statistics, I can recommend using Munin with the MySQL plugins.
From where do you load the data? You might consider to count them befor you insert them into the database. If it's a sqlscript you might write a quick and dirty bash script (with grep or something similar) to count the fields.
You say you can't change the structure. Does that mean you can't change the table you are inserting into, or you can't change the database at all? If you can add a table, then just create a table with 2 columns - a timestamp and the key of the table you are loading. Before you load your csv file, create another csv file with just those two columns, and load that csv after your main one.
This might be simpler than you want, but what about a Nagios monitor to track the row count? (Also consider asking around on serferfault.com; this stuff is totally up their alley.)
Perhaps you could write a small shell script that queries the database for the number of rows. You could then have a Cron job that runs every minute/hour/day etc and outputs the COUNT to a log file. Over time, you could review the log file and see the rate at which the database is growing. If you also put a date in the log file, you could review it easier over longer periods.
See if this is the kind of MySQL data collection you're interested in: http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring.
If that is the case, Google offers a MySQL patch (to apply to a clean mysql directory source) at http://google-mysql-tools.googlecode.com/svn/trunk/mysql-patches/all.v4-mysql-5.0.37.patch.gz. You can read more about the patch at http://code.google.com/p/google-mysql-tools/wiki/Mysql5Patches.
If this is not what you're looking for, I suggest you explain yourself a little more in order for us to help you better.
Could you use a trigger on the table which will insert into a table you created, which in the structure has a timestamp?
You could then use a date calculation on a period range to find the information needed.
I dont know what version of mysql you are using, but here is link to the syntax for trigger creation in version 5.0: http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
Good luck,
Matt
Well, if you need exhaustive information: which rows were inserted, updated or deleted, it might make sense to create an additional audit table to store those things with a timestamp. You could do this with triggers. I would also write a stored procedure which would execute as event and erase old entries (whatever you consider old).
Refer to the link posted by Lima on how to create triggers in MySQL.
Refer to page 655 of "MySQL Cookbook" by Paul Dubois (2nd Edition) or page 158 of "SQL for smarties" by Joe Celko.
so the 'load' will only insert new data in the table ? or rewrite the whole table ?
If it will load new data, then you can do a select count(*) from yourtable
once before the loading and once after the loading ... the difference will show you how many new records where inserted..
If on the other hand you rewrite the whole table and want to find the different records from the previous version .. then you would need a completely different approach..
Which one is it ?
Your question is a bit ambiguous but they mysql c APIs provide a function "mysql_affected_rows" that you can call after each query to get the number of affected rows. For an insert it returns the number of rows inserted. Be aware that for updates it returns the number of rows changed not the number of rows that matched the where clause.
If you are performing a number of queries and need to know how many were inserted the most reliable way would probably be doing a count before and after the queries.
As noted in sobbayi's answer adding a "created at" timestamp to your tables would allow you to query for records created after (or before) a given time.
UPDATE:
OK here is what you need to do to get a count before and after:
create a table for the counts:
create table row_counts (ts timestamp not null, row_count integer not null);
in your script add the following before and after your load file inline query:
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
load file inline......
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
the row_counts table will now have the count before and after your load.
show global status like 'Com_insert';
flush status and show session status... will work for just the current connection.
see http://dev.mysql.com/doc/refman/5.1/en/server-status-variables.html#statvar_Com_xxx
Since you asked for the easiest way, I would suggest you to use a trigger on insert. You could use a single column, single row table as a counter and update it with the trigger.