I work as part of a two man DBA team running SQL Server 2008 R2 with me being somewhat of an accidental DBA. We recently had an issue where a small table we hardly ever use ended up getting truncated. Both of us swear we didn't do it, but it happened nonetheless.
To avoid the situation in the future, we're interested in implementing change tracking. It's not really necessary for us to preserve the data that was changed so we decided against using change data capture.
With that said, the things I'm reading about change tracking seem to be more about using it to synchronize data with an application rather than simply recording all the changes. Can I use change tracking to simply keep a list of all the changes made in the last 6 months or something? Once I enable it for each database in the SQL Server GUI, where is the info stored? Any other info you may have on implementing this correctly would be great.
Thanks!
From the documentation for change tracking:
The values of the primary key column is only information from the
tracked table that is recorded with the change information. These
values identify the rows that have been changed. To obtain the latest
data for those rows, an application can use the primary key column
values to join the source table with the tracked table.
So, if you're going to try to use this as a mechanism to somehow recover accidental deletions, I think you'll find it lacking. But don't take my word for it. Set up the following test:
In a test environment, set up a dummy table and enable change tracking on it.
Insert some data into the dummy table.
Delete the data
Recover the data using change tracking.
Despite having already dismissed CDC as an option, it sounds more in line with what you're after. CDC does keep track of non-primary key columns so if someone does a data modification accidentally, CDC will keep track of all of the values in the row(s) affected. It has the added benefit of not allowing table truncation because of how it's implemented (it uses the replication log reader).
Additionally, you can configure the CDC cleanup job to automatically purge data after any any amount of time you want (it sounds like 6 months is your retention period, which is completely doable).
Related
We are replicating a very large database to a data warehouse server (DELETE is off) via publisher/subscriber. The data warehouse team want to change some of the data at their subscriber side by wiping to NULL a few non-key data columns (name, address etc.).
I have tried searching via Google etc. but can't find any information on whether this is possible without breaking replication or causing unknown issues.
We do NOT want these changes propagated back to the publisher.
Anybody know what impact this will have?
I could change the Sp_MS* stored procedures but I reckon on reinitialize of the replication they could be overwritten by the MS standard ones again – I don’t really want to do this as it seems a messy solution.
I can test the idea out but was more concerned even if it did work with no issues that some unknown factor would cause an issue at some point.
I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.
I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.
Say there is a database with 100+ tables and a major feature is added, which requires 20 of existing tables to be modified and 30 more added. The changes were done over a long time (6 months) by multiple developers on the development database. Let's assume the changes do not make any existing production data invalid (e.g. there are default values/nulls allowed on added columns, there are no new relations or constraints that could not be fulfilled).
What is the easiest way to publish these changes in schema to the production database? Preferably, without shutting the database down for an extended amount of time.
Write a T-SQL script that performs the needed changes. Test it on a copy of your production database (restore from a recent backup to get the copy). Fix the inevitable mistakes that the test will discover. Repeat until script works perfectly.
Then, when it's time for the actual migration: lock the DB so only admins can log in. Take a backup. Run the script. Verify results. Put DB back online.
The longest part will be the backup, but you'd be crazy not to do it. You should know how long backups take, the overall process won't take much longer than that, so that's how long your downtime will need to be. The middle of the night works well for most businesses.
There is no generic answer on how to make 'changes' without downtime. The answer really depends from case to case, based on exactly what are the changes. Some changes have no impact on down time (eg. adding new tables), some changes have minimal impact (eg. adding columns to existing tables with no data size change, like a new nullable column that doe snot increase the null bitmap size) and other changes will wreck havoc on down time (any operation that will change data size will force and index rebuild and lock the table for the duration). Some changes are impossible to apply without *significant * downtime. I know of cases when the changes were applies in parallel: a copy of the database is created, replication is set up to keep it current, then the copy is changed and kept in sync, finally operations are moved to the modified copy that becomes the master database. There is a presentation at PASS 2009 given by Michelle Ufford that mentions how godaddy gone through such a change that lasted weeks.
But, at a lesser scale, you must apply the changes through a well tested script, and measure the impact on the test evaluation.
But the real question is: is this going to be the last changes you ever make to the schema? Finally, you have discovered the perfect schema for the application and the production database will never change? Congratulation, once you pull this off, you can go to rest. But realistically, you will face the very same problem in 6 months. the real problem is your development process, with developers and making changes from SSMS or from VS Server Explored straight into the database. Your development process must make a conscious effort to adopt a schema change strategy based on versioning and T-SQL scripts, like the one described in Version Control and your Database.
Use a tool to create a diff script and run it during a maintenance window. I use RedGate SQL Compare for this and have been very happy with it.
I've been using dbdeploy successfully for a couple of years now. It allows your devs to create small sql change deltas which can then be applied against your database. The changes are tracked by a changelog table within database so that it knows what to apply.
http://dbdeploy.com/
i have a table in an access database
this access database is used on a regular basis, basically from 9-5
someone else has a copy of this exact table. sometimes records are added, sometimes deleted, and sometimes data within the records is updated.
i need to update the access database table with the offsite table every hour or so. what is the best algorithm of updating the data? there are about 5000 records.
would it severely lock up the table for a few seconds every hour?
i would like to publicly apologize for my rude comment to david fenton
My impression is that this question ties together pieces you've been exploring with your previous questions:
a file "listener" to detect the presence of a new file and do something with it when found
list files with some extension in a folder
DoCmd.TransferText to pull file data into your database
Insert, Update, Delete records in a table based on an imported set of records
Maybe it's time to give us a more detailed picture of what you're dealing with.
Tony asked if both sites are on the same WAN (Wide Area Network). You replied they are on Windows. Elsewhere you said you're using a network. Please tell us about the network.
I'm still unsure whether you need a one-way or two-way data exchange. You've talked about importing changes from the remote table into the local master table. Do you need to do the same type of operation at the remote site: import changes made to the table at the master site?
Tell us what needs to happen regarding the issue James raised. Can local and remote users ever edit the same record? If they can, how will you resolve the conflict? Similarly, what should happen if a remote user updates a record and a local user deletes their copy of that record?
Based on what you've told us so far, this sounds like a real challenge for Access, made more challenging by the rate of record changes (5,000 per hour). I like the outline Kevin suggested. However your challenge will be more complicated since you also need to account for record deletions at both sites.
It seems like you may have to create something which duplicates Access' Replication feature. Maybe you should look at the Jet Replication Wiki to see if you can modify your design to take advantage of Replication. I can't help you there, and unfortunately you appear to have frustrated David Fenton who is a leading authority on Jet Replication.
If a few seconds performance is critical, you'd rather move to a better database engine (like Sqlite, MySQL, MS SQL server). If you want a single file, then Sqlite is the best for you. All these use by-single-record locks, so you can read and write simultaneously.
If you stay with access, you will probably have to implement a timer to update only a few records at a time.
Before you do anything else you need to establish the "rules" as far as collisions go.
If a row in the local copy is updated and the same row in the remote copy is updated which one is the "correct" version? Ditto for deletions, inserts are even more of a pain as you can have the "same" set of values but perhaps a different key.
After you have worked out how to handle each of these cases you can then go on to thinking about the implementation.
As other posters have suggested the way to completely avoid these issues is switch to SQLServer or any other "proper" database which can be updated over the network by all users and where concurrency issues are handled by the DBMS when the updates are applied.
Other users have already suggested switching to a server based database i.e. SQL server etc. I would echo this and say it is the best way to go however if you are stuck with access and have no choice then I would suggest you add a field (with an index) along the lines of “Last Updated”. You could then export all records that have been modified within a particular time frame. Export this file as a CSV, ship it over to the remote site and import it into the “master” access database. With a bit of scripting you could automate this process.