Best way to archive/backup tables and changes in a large database - sql

I have an interesting issue and requirement for a large multi-schema database.
-The database is around 130Gb in Size.
-It is a multi Schema database, each customer has a schema.
-We currently have 102,247 tables in the system.
-Microsoft SQL Server 2k8 r2
This is due to customisation requirements of customers, all using a single defined front end.
The issue we have is that our database backups become astronomical and getting a database restore done for retrieval of lost/missing/incorrect data is a nightmare. The initial product did not have defined audit trails and we don't have 'changes' to data stored, we simply have 1 version of data.
getting lost data back basically means restoring a full 130GB backup and loading differentials/transaction files to get the data.
We want to introduce a 'Changeset' for each important table within each schema. essentially holding a set of the data, then any modified/different data as it is saved - every X number of minutes. This will have to be a SQL job initially, but I want to know what would be the best method.
Essentially I would run a script to insert the 'backup' tables into each schema for the tables we wish to keep backed up.
Then run a job every X minutes to cycle through each schema and insert current - then new/changed data as it spots a change. (based on the modifiedDate of the row) It will then retain this changelog for around a month before self-overwriting.
We still have our larger backups, but we wont need to keep a larger retention period. My point is, what is the best and most efficient method of checking for a changed data and performing an insert.
My gut feeling would be :
INSERT INTO BACKUP_table (UNIQUE ID, col1,col2,col3)
select col1,col2,col3 from table where and ModifiedDate < DATEADD(mi,+90,Current_TimeStamp)
*rough SQL
This would have to be in a loop to go through all schemas and run this. A number of tables wont have changed data.
Is this even a good method?
What does SO think?

My first response would be to consider keeping each customer in their own database instead of their own schema within a massive database. The key benefits to doing this are:
much less stress on the metadata for a single database
you can perform backups for each customer on whatever schedule you like
when a certain customer has high activity you can move them easily
I managed such a system for several years at my previous job and managing 500 databases was no more complex than managing 10, and the only difference to your applications is the database part of the connection string (which is actually easier to make queries adapt to than a schema prefix).
If you're really committed to keeping everyone in a single database, then what you can consider doing is storing your important tables inside of each schema within their own filegroup, and move everything out of the primary filegroup. Now you can backup those filegroups independently and, based on solely the full primary backup and a piecemeal restore of the individual filegroup backup, you can bring just that customer's schema online in another location, and retrieve the data you're after (maybe copying it over to the primary database using import/export, BCP, or simple DML queries), without having to completely restore the entire database. Moving all user data out of the primary filegroup minimizes the time it takes to restore that initial backup and get you on to restoring the specific customer's filegroup. While this makes your backup/recovery strategy a little more complex, it does achieve what you're after I believe.
Another option is to use a custom log shipping implementation with an intentional delay. We did this for a while by shipping our logs to a reporting server, but waiting 12 hours before applying them. This gave us protection from customers shooting themselves in the foot and then requiring a restore - if they contacted us within 12 hours of their mistake, we likely already had the "before-screw-up" data online on the reporting server, making it trivial to fix it on the primary server. It also doubled as a reporting server for reports looking at data older than 12 hours, taking substantial load away from the primary server.
You can also consider change data capture but you will obviously need to test the performance and the impact on the rest of your workload. This solution also will depend on the edition of SQL Server you're using, since it is not available in Standard, Web, Workgroup, etc.

Related

What is the best approach to pull "Delta" data into Analytics DB from a highly transactional DB?

What is the best approach to load only the Delta into the analytics DB from a highly transactional DB?
Note:
We have a highly transactional system and we are building an analytic database out of it. At present, we are wiping off all the fact and dimension tables from the analytics DB and loading the entire "processed" data at midnight. Problem with this approach is that, we are loading the same data again and again every time along with the few new data that got added/updated on that particular day. We need to load the "Delta" alone (rows which are inserted newly & the old rows which got updated). Any efficient way to do this?
It is difficult to tell something without knowing the details e.g. the database schema, the database engine... However the most natural approach for me is to use timestamps. This solution assumes that entities (single record in a table, or group of related records) that are loaded/migrated from a transactional DB into an analytic one have a timestamp.
This timestamp says when given entity was created or updated the last time. While loading/migrating data you should take into account only these entities for each the timestamp > the date of the last migration. This approach has this advantage that is quite simple and does not require any specific tool. The question is if you already have timestamps in your DB.
Another approach might be to utilize some kind of change tracking mechanism. For example MMSQL server has something like that (see this article). However, I have to admit that I've never used it so I'm not sure if it is suitable in this case. If your database doesn't support change tracking, you can try to create it on your own based on triggers, but in general it is not easy thing to do.
We need to load the "Delta" alone (rows which are inserted newly & the old rows which got updated). Any efficient way to do this?
You forgot rows that got deleted. And that is the crux of the problem. Having a updated_at field on every table and polling for rows with updated_at > #last_poll_time works, more or less, but polling like this does not give you a transaction ally consistent image because each table is polled at a different moment. Tracking deleted rows induces complications at app/data model layer, as rows have to be either logically deleted (is_deleted) or moved to an archive table (for each table!).
Another solution is to write triggers in the database, attach a trigger to each table, and have the trigger write into table_history the changes that occurred. Again, for each table. These solutions are notoriously difficult to maintain long term in presence of schema changes (columns added, modified, tables dropped etc etc)
But there are database specific solution that can help. For instance SQL Server has Change Tracking and Change Data Capture. These can be leveraged to build an ETL pipeline that maintains an analytical data warehouse. Database schema changes are still a pain, though.
There is no silver bullet, no pixie dust.

How can I generate reports form Azure SQL that does not slow the system down?

I hope someone can give me advice or point me to some readings for this. I generate business reports for my team. We host a subscription website so we need to track several things sometimes on a daily basis. A lot of sql queries are involved.The problem is querying a large volume information from the live database will slow or cause timeouts to our website.
My current solution requires me to run bcp scripts that copy new rows to a backup database, (that I use purely for reports) daily. Then I use an application I made to generate reports from there. The output is ultimately an excel file or several (for the benefit of the business teams, it's easier for them to read.) There several problems in my temporary solution though,
It only adds new rows. Updates to previous rows are not copied. and
It doesn't seem very efficient.
Is there another way to do this? My main concern is that the generation or the querying should not slow down our site.
I can think of three options for you, each of which could have various implementation methods. The first one is Azure SQL Data Sync Services, the second is the AS COPY COPY operation and the third is rides on top of a backup.
The Sync Services are a good option if you need more real time reporting capability; meaning if you need to run your reports multiple times a day, at just about any time, and you need your data as real time as you can get it. Sync Services could have a performance impact on your primary database because it runs based off of triggers, but with this option you can choose what to sync; in other words you can replicate a filtered set of data, which minimizes the performance impact. Then you can reports on the sync'ed database. Another important shortcoming of this approach is that you would end up maintaining a sync service; if your primary database schema changes, you may need to recreate some or all of the sync configuration.
The second option, AS COPY OF, is a simply database copy operation which essentially gives you a clone of your primary database. Depending on the size of the database, this could take some time, so testing is key. However, if you are performing a morning report for yesterday's activities and having the latest data is not as important, then you could run the AS COPY OF operation on a schedule after hours (or when the activity on your database is the lowest) and run your report on the secondary database. You may need to build a small script, or use third party tools to help you automate this. There would be little to no performance impact on your primary database. In addition, the AS COPY OF operation provides transactional consistency, if this important to you.
The third option could be to use a backup mechanism (such as the Azure Export, or Azure backup tools), and restore the latest backup before running your reports. This has the advantage to leverage your backup strategy without much additional effort.

Temporary Tables Quick Guide

I have a structured database and software to handle it and I wanted to setup a demo version based off of a simple template version. I'm reading through some resources on temporary tables but I have questions.
What is the best way to go about cloning a "temporary" database while keeping a clean list of databases?
From what I've seen, there are two ways to do this - temporary local versions that are terminated at the end of the session, and tables that are stored in the database until deleted by the client or me.
I think I would prefer the 2nd option, because I would like to be able to see what they do with it. However, I do not want add a ton of throw-away databases and clutter my system.
How can I a) schedule these for deletion after say 30 days and b) if possible, keep these all under one umbrella, or in other words, is there a way to keep them out of my main list of databases and grouped by themselves.
I've thought about having one database and then serving up the information by using a unique ID for the user and 'faux indexes' so that it appears as 1,2,3 instead of 556,557,558 to solve B. I'm unsure how I could solve A, other than adding a date and protected columns and having a script that runs daily and deletes if over 30 days and not protected.
I apologize for the open-ended question, but the resources I've found are a bit ambiguous.
These aren't true temp tables in the sense that your DBMS knows them. What you're looking for is a way to have a demo copy of your database, probably with a cut-down data set. It's really no different from having any other non-production copy of your database.
Don't do this on your production database server.
Do not do this on your production database server.
Script the creation of your database schema. Depending on the DBMS you're using, this may be pretty easy. If you've got a good development/deployment/maintenance process for your system, this should already exist.
Create your database on the non-production server using the script(s) generated in the previous step. Use an easily-identifiable naming convention, like starting the database name with demo.
Load any data required into the tables.
Point the demo version of your app (that's running on your non-production servers) at this new database.
Create a script/process/job which looks at your database server and drops any databases that match your demo DB naming convention and were created more than 30 days ago.
Without details about your actual environment, people can't give concrete examples/sample code/instructions.
If you cannot run a second, independent database server for these demos, then you will have to make do with your production server. This is still a bad idea because of potential security exposures and performance impact on your production database (constrained resources).
Create a complete copy of your database (or at least the schema, with a reduced data set) for each demo.
Create a unique set of credentials for each of these demo databases. This account should have access to only its demo database.
Configure the demo instance(s) of your application to connect to the demo database
Here's why I'm pushing so hard for separate databases: If you keep copying your "demo" tables within the database, you will have to update your application code to point at those tables each time you do a new demo. Once you start doing this, you're taking a big risk with your demos - the code you keep changing isn't really the application you're running in production anymore. And if you miss one of those changes, you'll get unexpected results at best, and mangling of your production data at worst.

Database Filegroups - Only restore 1 filegroup on new server

Is there a way to backup certain tables in a SQL Database? I know I can move certain tables into different filegroups and preform a backup on these filegroup. The only issue with this is I believe you need a backup of all the filegroups and transaction logs to restore the database on a different server.
The reason why I need to restore the backup on a different server is these are backups of customers database. For example we may have a remote customer and need to get a copy of they 4GB database. 90% of this space is taken up by two tables, we don’t need these tables as they only store images. Currently we have to take a copy of the database and upload it to a FTP site…With larger databases this can take a lot of the time and we need to reduce the database size.
The other way I can think of doing this would be to take a full backup of the DB and restore it on the clients SQL server. Then connect to the new temp DB and drop the two tables. Once this is done we could take a backup of the DB. The only issue with this solution is that it could use a lot of system restores at the time of running the query so its less than ideal.
So my idea was to use two filegroups. The primary filegroup would host all of the tables except the two tables which would be in the second filegroup. Then when we need a copy of the database we just take a backup of the primary filegroup.
I have done some testing but have been unable to get it working. Any suggestions? Thanks
Basically your approach using 2 filegroups seems reasonable.
I suppose you're working with SQL Server on both ends, but you should clarify for each which whether that is truly the case as well as which editions (enterprise, standard, express etc.), and which releases 2000, 2005, 2008, (2012 ? ).
Table backup in SQL Server is here a dead horse that still gets a good whippin' now and again. Basically, that's not a feature in the built-in backup feature-set. As you rightly point out, the partial backup feature can be used as a workaround. Also, if you just want to transfer a snapshot from a subset of tables to another server, using ftp you might try working with the bcp utility as suggested by one of the answers in the above linked post, or the export/import data wizards. To round out the list of table backup solutions and workarounds for SQL Server, there is this (and possibly other ? ) third party software that claims to allow individual recovery of table objects, but unfortunately doesn't seem to offer individual object backup, "Object Level Recovery Native" by Red Gate". (I have no affiliation or experience using this particular tool).
As per your more specific concern about restore from partial database backups :
I believe you need a backup of all the filegroups and transaction logs
to restore the database on a different server
1) You might have some difficulties your first time trying to get it to work, but you can perform restores from partial backups as far back as SQL Server 2000, (as a reference see here
2) From 2005 and onward you have the option of partially restoring today, and if you need to you can later restore the remainder of your database. You don't need to include all filegroups-you always include the primary filegroup and if your database is simple recovery mode you need to add all read-write filegroups.
3) You need to apply log backups only if your db is in bulk or full recovery mode and you are restoring changes to a readonly filegroup that since last restore has become read-write. Since you are expecting changes to these tables you will likely not be concerned about read only filegroups, and so not concerned about shipping and applying log backups
You might also investigate some time whether any of the other SQL Server features, merge replication, or those mentioned above (bcp, import/export wizards) might provide a solution that is more simple or more adequately meets your needs.

Synchronizing databases

I am developing an Adobe AIR application which stores data locally using a SQLite database.
At any time, I want the end user to synchronize his/her local data to a central MySQL database.
Any tips, advice for getting this right?
Performance and stability is the key (besides security ;))
I can think of a couple of ways:
Periodically, Dump your MySQL database and create a new SQLite database from the dump. You can then serve the SQLite database (SQLite databases are contained in a single file) for your users client to download and replace the current database.
Create a diff script that generates the necessary statements to bring the current database up to speed (various INSERT, UPDATE and DELETE statements). To do this, you must record the time of each change continuously in your database (the time of creation and update for each row, and keep a history of deleted rows).
User's client will download the diff file (a text file of the various statements) and apply it on the local database.
Both approaches have their own pros and cons - by dumping the entire database, you make sure all the data gets through. It is also much easier than creating the diff, however it might put more load on the server, depending on how often does the database gets updated between dumps.
On the other hand, diffing between the database will give you just the data that changed (hopefully), but it is more open to logical errors. It will incur an additional overhead on the client as well, since it will have to create/update all the necessary records instead of just copying a file.
If you're just sync'ing from the server to client, Eran's solution should work.
If you're just sync'ing from the client to the server, just reverse it.
If you're sync'ing both ways, have fun. You'll at minimum probably want to keep change logs, and you'll need to figure out how to deal with conflicts.