How can I generate reports form Azure SQL that does not slow the system down? - sql

I hope someone can give me advice or point me to some readings for this. I generate business reports for my team. We host a subscription website so we need to track several things sometimes on a daily basis. A lot of sql queries are involved.The problem is querying a large volume information from the live database will slow or cause timeouts to our website.
My current solution requires me to run bcp scripts that copy new rows to a backup database, (that I use purely for reports) daily. Then I use an application I made to generate reports from there. The output is ultimately an excel file or several (for the benefit of the business teams, it's easier for them to read.) There several problems in my temporary solution though,
It only adds new rows. Updates to previous rows are not copied. and
It doesn't seem very efficient.
Is there another way to do this? My main concern is that the generation or the querying should not slow down our site.

I can think of three options for you, each of which could have various implementation methods. The first one is Azure SQL Data Sync Services, the second is the AS COPY COPY operation and the third is rides on top of a backup.
The Sync Services are a good option if you need more real time reporting capability; meaning if you need to run your reports multiple times a day, at just about any time, and you need your data as real time as you can get it. Sync Services could have a performance impact on your primary database because it runs based off of triggers, but with this option you can choose what to sync; in other words you can replicate a filtered set of data, which minimizes the performance impact. Then you can reports on the sync'ed database. Another important shortcoming of this approach is that you would end up maintaining a sync service; if your primary database schema changes, you may need to recreate some or all of the sync configuration.
The second option, AS COPY OF, is a simply database copy operation which essentially gives you a clone of your primary database. Depending on the size of the database, this could take some time, so testing is key. However, if you are performing a morning report for yesterday's activities and having the latest data is not as important, then you could run the AS COPY OF operation on a schedule after hours (or when the activity on your database is the lowest) and run your report on the secondary database. You may need to build a small script, or use third party tools to help you automate this. There would be little to no performance impact on your primary database. In addition, the AS COPY OF operation provides transactional consistency, if this important to you.
The third option could be to use a backup mechanism (such as the Azure Export, or Azure backup tools), and restore the latest backup before running your reports. This has the advantage to leverage your backup strategy without much additional effort.

Related

How can I Snapshot a database without losing undeleted data?

We have a shop floor database OPERATION that replicates selected data to a database BUSINESS that is used for reporting. The data in OPERATION is deleted daily by the third-party shop floor application so in order to retain the data on BUSINESS I've set the Article Property for DELETE delivery format to be Do not replicate DELETE statements.
This works well, but occasionally somebody wants something extra/different to be replicated. Depending on the nature of the change to the Publication it may prompt for Reinitialization of the snapshot which would of course blow away the database on BUSINESS (as I sadly did one day).
What's the best way around this?
I would suggest you implement an ETL process instead of replication.
You can use SSIS to extract data out of OPERATION database and copy it to BUSINESS database. In the SSIS package you have full control over the logic. For example, you can append the data to existing data in BUSINESS. You can use MERGE, to insert new records and modify existing ones (this way it would be safe to run it repeatedly as the unchanged data would not be overwritten).
If someone requests additional data, you would just wrote a new SSIS package to transfer additional data without affecting your main process.
SSIS can be scheduled to run from a SQL agent job (use dtexec for example).

Progress DB: backup restore and query individual tables

Here is the use-case: we need to backup some of the tables from a client server, copy it to our servers, restore it, then running some queries using ODBC.
I managed to do this process for the entire database by using probkup for backup, prorest for restore and proserve to make it accessible for SQL queries.
However, some of the databases are big (> 8GB), so we are looking for a solution to do the backup for only the tables we need. I didn't find anything with the documentation of probkup how this can be done.
Progress only supports full database backups.
To get the effect that you are looking for you could dump (export) the tables that you want and then load them into an empty database.
"proutil dump" and "proutil load" are where you want to start digging.
The details will vary depending on exactly what you want to do and what resources and capabilities you have available to you.
Another option would be to replicate the tables in question to a partial database. Progress has a product called "pro2" that can help with that. It is usually pointed at SQL targets but you could also point it at a Progress database.
Or, if you have programming skills, you could put together a solution using replication triggers (under the covers that's what pro2 does...)
probkup and prorest are block-level programs and can't do a backup or restore by table.
To do what you're asking for, you'll need to do a dump the data from the source db's tables and then load it into the target db.
If your object is simply to maintain a copy of the DB, you might also try incremental backups. Depending upon your situation, that might speed things up a bit.
Other options include various forms of DB replication, which allow you to keep real- or near-real-time copies of your database.
OpenEdge Replication. With the correct license, you can do query-only access on the replication target, which is good for reporting and analysis.
Third-party replication products. These can be more flexible in terms of both target DBs and limiting the tables to be replicated.
Home-grown replication (by copying and applying AI files). This is not terribly complicated, but you have to factor the cost of doing the work and maintaining the system. There are some scripts out there that can get you started.
Or, as Tom said, you can get clever with replication via triggers.

datawarehouse data security?

I started at a company as a junior sql developer on a datawarehouse. Ever since I have been going through the code and learning the dimensional models etc. I struggle to see security measures outside of rights that the developer has on the environment.
but if someone would to write code that influences the data in the warehouse in a significant way, update to the wrong values, insert false data, delete records that should be there and hits that code with a commit statement, wouldn't there be a massive impact on the business intelligence aspect of the warehouse? Like if they were to pull data to create statistics and there is bad data, then they will have bad statistics.
We have about 7 billlion records and changes made in this way would be really hard to pick up if it can be seen at all.
Maybe this is a simple question, but I can't really find an answer, since in the datawarehouse you don't have the rigorous relational constraints to check data validity, especially when you move around big data and the database administrators drop the triggers and indexes as well. The transactional side we get the source data from also doesn't keep history (that's our job).
Any views and suggestions on this subject will be highly appreciated, thank you.
When working with databases or writing code in general, mistakes happen. That is why you ALWAYS separate your development environment from your production environment. Most of us also have an intermediate test environment, where new code is tested and data is validated, before the code is deployed to production.
Furthermore, before any deployment, a full backup is taken. That way, if an error is discovered after deployment, a restore of the backup can be made.
Preferably, your development and production environments run on separate, but identical, servers. If that is not possible, at least keep the data in separate databases, and use the security of your database server, to ensure that no one can make changes to the production database, unless a deployment is happening.
Now for the deployment itself, make sure you have a sort of checklist to go over, every time you make a deployment. First step on the checklist should be to backup the existing production environment. Write scripts to automate parts of the deployment, whenever possible. Use tools such as SQL Schema Compare, to identify differences between the development and production database, etc. Ideally, deployment should be a matter of pressing one button, and then everything deploys automagically, and you can go back to developing without worrying.

Best way to archive/backup tables and changes in a large database

I have an interesting issue and requirement for a large multi-schema database.
-The database is around 130Gb in Size.
-It is a multi Schema database, each customer has a schema.
-We currently have 102,247 tables in the system.
-Microsoft SQL Server 2k8 r2
This is due to customisation requirements of customers, all using a single defined front end.
The issue we have is that our database backups become astronomical and getting a database restore done for retrieval of lost/missing/incorrect data is a nightmare. The initial product did not have defined audit trails and we don't have 'changes' to data stored, we simply have 1 version of data.
getting lost data back basically means restoring a full 130GB backup and loading differentials/transaction files to get the data.
We want to introduce a 'Changeset' for each important table within each schema. essentially holding a set of the data, then any modified/different data as it is saved - every X number of minutes. This will have to be a SQL job initially, but I want to know what would be the best method.
Essentially I would run a script to insert the 'backup' tables into each schema for the tables we wish to keep backed up.
Then run a job every X minutes to cycle through each schema and insert current - then new/changed data as it spots a change. (based on the modifiedDate of the row) It will then retain this changelog for around a month before self-overwriting.
We still have our larger backups, but we wont need to keep a larger retention period. My point is, what is the best and most efficient method of checking for a changed data and performing an insert.
My gut feeling would be :
INSERT INTO BACKUP_table (UNIQUE ID, col1,col2,col3)
select col1,col2,col3 from table where and ModifiedDate < DATEADD(mi,+90,Current_TimeStamp)
*rough SQL
This would have to be in a loop to go through all schemas and run this. A number of tables wont have changed data.
Is this even a good method?
What does SO think?
My first response would be to consider keeping each customer in their own database instead of their own schema within a massive database. The key benefits to doing this are:
much less stress on the metadata for a single database
you can perform backups for each customer on whatever schedule you like
when a certain customer has high activity you can move them easily
I managed such a system for several years at my previous job and managing 500 databases was no more complex than managing 10, and the only difference to your applications is the database part of the connection string (which is actually easier to make queries adapt to than a schema prefix).
If you're really committed to keeping everyone in a single database, then what you can consider doing is storing your important tables inside of each schema within their own filegroup, and move everything out of the primary filegroup. Now you can backup those filegroups independently and, based on solely the full primary backup and a piecemeal restore of the individual filegroup backup, you can bring just that customer's schema online in another location, and retrieve the data you're after (maybe copying it over to the primary database using import/export, BCP, or simple DML queries), without having to completely restore the entire database. Moving all user data out of the primary filegroup minimizes the time it takes to restore that initial backup and get you on to restoring the specific customer's filegroup. While this makes your backup/recovery strategy a little more complex, it does achieve what you're after I believe.
Another option is to use a custom log shipping implementation with an intentional delay. We did this for a while by shipping our logs to a reporting server, but waiting 12 hours before applying them. This gave us protection from customers shooting themselves in the foot and then requiring a restore - if they contacted us within 12 hours of their mistake, we likely already had the "before-screw-up" data online on the reporting server, making it trivial to fix it on the primary server. It also doubled as a reporting server for reports looking at data older than 12 hours, taking substantial load away from the primary server.
You can also consider change data capture but you will obviously need to test the performance and the impact on the rest of your workload. This solution also will depend on the edition of SQL Server you're using, since it is not available in Standard, Web, Workgroup, etc.

Synchronizing databases

I am developing an Adobe AIR application which stores data locally using a SQLite database.
At any time, I want the end user to synchronize his/her local data to a central MySQL database.
Any tips, advice for getting this right?
Performance and stability is the key (besides security ;))
I can think of a couple of ways:
Periodically, Dump your MySQL database and create a new SQLite database from the dump. You can then serve the SQLite database (SQLite databases are contained in a single file) for your users client to download and replace the current database.
Create a diff script that generates the necessary statements to bring the current database up to speed (various INSERT, UPDATE and DELETE statements). To do this, you must record the time of each change continuously in your database (the time of creation and update for each row, and keep a history of deleted rows).
User's client will download the diff file (a text file of the various statements) and apply it on the local database.
Both approaches have their own pros and cons - by dumping the entire database, you make sure all the data gets through. It is also much easier than creating the diff, however it might put more load on the server, depending on how often does the database gets updated between dumps.
On the other hand, diffing between the database will give you just the data that changed (hopefully), but it is more open to logical errors. It will incur an additional overhead on the client as well, since it will have to create/update all the necessary records instead of just copying a file.
If you're just sync'ing from the server to client, Eran's solution should work.
If you're just sync'ing from the client to the server, just reverse it.
If you're sync'ing both ways, have fun. You'll at minimum probably want to keep change logs, and you'll need to figure out how to deal with conflicts.