I want to understand the best approach for SQL Server architecture on production environment.
Here is my problem:
I have database which has on average around 20,000 records being inserted every second in various tables.
We have reports also implemented for the same, now what's happening is whenever reports is searched by user, performance of other application steeps down.
We have implemented
Table Partitioning
Indexing
And all other required things.
My question is: can anyone suggest an architecture that have different SQL Server databases for reports and application, and they can sync themselves online every time when new data is entered in master SQL Server?
Some what like Master and Slave Architecture. I understand Master and Slave architecture, however need to get more idea around it.
Our main tables are having around 40 millions rows (table partitioning done)
In SQL Server 2008R2 you have database mirroring and replication available, which will keep two databases in sync.
A schema which is efficient for OLTP is unlikely to be efficient for large volume reporting. The 'live' and 'reporting' databases should have different schema with an ETL process moving data from one to the other. I'd would like to negotiate with the business just how synchronised the reporting database needs to be. If the reports are processing large amounts of data they will take some time to run so a lag in data replication will not be noticed, I would suggest. In extremis you could construct a solution using Service Broker to move the data and processing on the reporting server to distribute it amonst the reporting tables.
The numbers you quote (20,000 inserts per second, 40 millions rows in largest table) suggests a record doesn't reside in the DB for long. You would have a significant load performing DELETEs. Optimising these out of peak hours could be sufficient to solve your problems.
Related
Typically on an on-premise SQL server ETL workflow via SSIS, we load data from anywhere into staging tables and then apply validation and transformations to load/merge them into downstream data warehouse tables.
My question is if we should do something similar on Azure, where we have set of staging tables and downstream tables in azure SQL database or use azure storage area as staging and move data from there into final downstream tables via ADF.
As wild is it may seem, we also have a proposal to have separate staging database and downstream database, between which we move using ADF.
There are different models for doing data movement pipelines and no single one is perfect. I'll make a few comments on the common patterns I see in case that will help you make decisions on your application.
For many data warehouses where you are trying to stage in data and create dimensions, there is often a process where you load the raw source data into some other database/tables as raw data and then process it into the format you want to insert into your fact and dimension tables. That process is complicated by the fact that you may have data arrive late or data that is corrected on a later day, so often these systems are designed using partitioned tables on the target fact tables to allow re-processing of a partition worth of data (e.g. a day) without having to reprocess the whole fact table. Furthermore, the transformation process on that staging table may be intensive if the data itself is coming in a form far away from how you want to represent it in your DW. Often in on-premises systems, these are handled in a separate database (potentially on the same SQL Server) to isolate it from the production system. Furthermore, it is sometimes the case that these staging tables are re-creatable from original source data (CSV files or similar), so it is not the store of record for that source material. This allows you to consider using simple recovery mode on that database (which reduces the Log IO requirements and recovery time compared to full recovery). While not every DW uses full recovery mode for the processed DW data (some do dual load to a second machine instead since the pipeline is there), the ability to use full recovery plus physical log replication (AlwaysOn Availability Groups) in SQL Server gives you the flexibility to create a disaster recovery copy of the database in a different region of the world. (You can also do query read scale-out on that server if you would like). There are variations on this basic model, but a lot of on-premises systems have something like this.
When you look at SQL Azure, there are some similarities and some differences that matter when considering how to set up an equivalent model:
You have full recovery on all user databases (but tempdb is in simple recovery). You also have quorum-commit of your changes to N replicas (like in Availability Groups) when using v-core or premium dbs which matters a fair amount because you often have a more generic network topology in public cloud systems vs. a custom system you build yourself. In other words, log commit times may be slower than your current system. For batch systems it does not necessarily matter too much, but you need to be careful to use large enough batch sizes so that you are not waiting on the network all the time in your application. Given that your staging table may also be a SQL Azure database, you need to be aware that it also has quorum commit so you may want to consider which data is going to stay around day-over-day (stays in SQL Azure DB) vs. which can go into tempdb for lower latencies and be re-created if lost.
There is no intra-db resource governance model today in SQL Azure (other than elastic pools which is partial and is targeting a different use case than DW). So, having a separate staging database is a good idea since it isolates your production workload from the processing in the staging database. You avoid noisy neighbor issues with your primary production workload being impacted by the processing of the day's data you want to load.
When you provision machines for on-premises DW, you often buy a sufficiently large storage array/SAN that you can host your workload and potentially many others (consolidation scenarios). The premium/v-core DBs in SQL Azure are set up with local SSDs (with Hyperscale being the new addition where it gives you some cross-machine scale-out model that is a bit like a SAN in some regards). So, you would want to think through the IOPS required for your production system and your staging/loading process. You have the ability to choose to scale up/down each of these to better manage your workload and costs (unlike a CAPEX purchase of a large storage array which is made up front and then you tune workloads to fit into it).
Finally, there is also a SQL DW offering that works a bit differently than SQL Azure - it is optimized for larger DW workloads and has scale-out compute with the ability to scale that up/down as well. Depending on your workload needs, you may want to consider that as your eventual DW target if that is a better fit.
To get to your original question - can you run a data load pipeline on SQL Azure? Yes you can. There are a few caveats compared to your existing experiences on-premises, but it will work. To be fair, there are also people who just load from CSV files or similar directly without using a staging table. Often they don't do as many transformations, so YMMV based on your needs.
Hope that helps.
I'm building a monitoring tool which analyzes information in some sql tables and creates some charts and alerts based on some configurable criterias. However the underlying application is now getting some errors. I think it's because my queries are rather intensive on the tables which causes them to be locked for some amount of time and my idea of a work around is to synchronize the tables to a monitoring database and do my operations there.
Do you have any other ideas? And if I do the sync, whats the best way of syncing tables in SQL server? I prefer if the sync is as close to real-time as possible.
If you are running SQL Server 2008 R2 or above, Transactional Replication is usually a good fit for this type of scenarios and can support near real time synchronization. Here are few links to get familiar with Replication
Overview of replication
Use of replication for data warehousing and reporting applications.
The other solution is to Log ship the transactional database to reporting database.But
Log shipping is asynchronous operation, so the state of data in reporting database will be behind that of the data in transactional database.
You need to log ship the entire database even if you end up using only couple of tables.
The reporting database is not available when it is restoring from the transactional database.
so that would not match to your requirements.
In our company we using many DB servers in different cities. Sometimes data in one server should be synchronized with another. For example, in table "Monitor" values "status" and "date" may be updated very often. My problem is when theese values updated in server A, they also should be updated in server B:
Update Monitor set(date='2013-06-13')
and then
'Update Monitor set(status=4)'
in server A udating of both values is sucsessfull, but in server B (usualy with highest loading) somtimes, in approx. 0.03% cases updated only value date and status is stil old. Can anybody explain, is it possible in DB server with high loading?
It's hard to explain without looking at the boxes, logs and workload each is doing; there are a thousand things that would cause server "B" to miss data, including table and row locks, requests dropped by the network, unfinished transactions and the like. To find out exactly, you'd have to turn on the logging and compare the requests on "A" versus "B". The first thing I'd do, however, would be to look for errors in the SQL logs.
But in general keeping database synchronized across regions is do-able using existing technologies available in MS and Oracle. One scenario involves using a master, central db to receive all requests. It then distributes inserts, updates, delete and queries out to the regional DBs using SSIS or regular DB connectivity over a WAN.
Here's a high-level guide to the technology solution available in SQL Server.
http://msdn.microsoft.com/en-us/library/hh868047.aspx
You were probably looking for a simple answer, but I don't think there is one.
I have a online Database which will be updated Daily from various Sources.
I need to have a local Database with some tables from Server Database which have to check for any changes or new rows in tables in server and update the local Database for particular Intervals of Time. How can I Achieve this???
You may want to look into SQL Server Replication.
Replication will manage the data synchronization between the two copies of your database. You can configure replication for any tables in the database, including all tables. Replication will take care of checking for updates, adds and deletes from the Server Database and transfer the changes to the local database.
You can setup replication to update the local database at near-real-time or you can schedule periodic updates.
Replication is a high-maintenance solution. It's designed to maintain two copies of the same database with significant reliability. This makes replication a good solution when you must avoid data problems or recover from problems with little to no data loss.
If you don't require the high-maintenance solution, then SQL Server Integration Services (SSIS) may be a good alternative. With SSIS, you develop the data transfer and data management solution. Along with managing data problems, you design the solution to identify data adds, deletes and updates.
I've got a database with a table that is updated whose updates I need to push to an archive database so the two tables are identical.
Do I use replication? What is the most efficient choice given I have roughly 14,000 rows (increases daily to become roughly 30,000 rows every two months or so)
Personally, I'd go with transactional replication. Your other options would be to do this in code with a synchronous trigger or asynchronously via service broker, which IMHO, is a much bigger pain to set up than replication.