Parallel Execution of SSIS (ETL) across multiple SQL databases

Parallel Execution of SSIS (ETL) across multiple SQL databases - sql

I have multitenancy database (one DB per customer). And at present we have set the SQL memory to 60 percent of RAM and running ETL across one site at a time.
Is it possible to run ETLs across multiple sites/database at the same time?
Note - During ETL execution other operation would be slow as ETL uses maximum RAM, hence wanted to know can two ETLs be run at a same across different database.

SSIS Packages can be executed in parallel. However, as your Operational databases are on the same server, and the database that you are importing to is also on the same server, I am failing to see how that would speed up the ETL, unless the databases are on separate disks which can help parallelize the reads and writes.
Also it may be more dangerous, as if you cannot afford running ETL at the point in time (like when the businesses open), you will have to stop all packages, and then restart all of them

Related

Staging tables in DB vs storage area

Typically on an on-premise SQL server ETL workflow via SSIS, we load data from anywhere into staging tables and then apply validation and transformations to load/merge them into downstream data warehouse tables.
My question is if we should do something similar on Azure, where we have set of staging tables and downstream tables in azure SQL database or use azure storage area as staging and move data from there into final downstream tables via ADF.
As wild is it may seem, we also have a proposal to have separate staging database and downstream database, between which we move using ADF.

There are different models for doing data movement pipelines and no single one is perfect. I'll make a few comments on the common patterns I see in case that will help you make decisions on your application.
For many data warehouses where you are trying to stage in data and create dimensions, there is often a process where you load the raw source data into some other database/tables as raw data and then process it into the format you want to insert into your fact and dimension tables. That process is complicated by the fact that you may have data arrive late or data that is corrected on a later day, so often these systems are designed using partitioned tables on the target fact tables to allow re-processing of a partition worth of data (e.g. a day) without having to reprocess the whole fact table. Furthermore, the transformation process on that staging table may be intensive if the data itself is coming in a form far away from how you want to represent it in your DW. Often in on-premises systems, these are handled in a separate database (potentially on the same SQL Server) to isolate it from the production system. Furthermore, it is sometimes the case that these staging tables are re-creatable from original source data (CSV files or similar), so it is not the store of record for that source material. This allows you to consider using simple recovery mode on that database (which reduces the Log IO requirements and recovery time compared to full recovery). While not every DW uses full recovery mode for the processed DW data (some do dual load to a second machine instead since the pipeline is there), the ability to use full recovery plus physical log replication (AlwaysOn Availability Groups) in SQL Server gives you the flexibility to create a disaster recovery copy of the database in a different region of the world. (You can also do query read scale-out on that server if you would like). There are variations on this basic model, but a lot of on-premises systems have something like this.
When you look at SQL Azure, there are some similarities and some differences that matter when considering how to set up an equivalent model:
You have full recovery on all user databases (but tempdb is in simple recovery). You also have quorum-commit of your changes to N replicas (like in Availability Groups) when using v-core or premium dbs which matters a fair amount because you often have a more generic network topology in public cloud systems vs. a custom system you build yourself. In other words, log commit times may be slower than your current system. For batch systems it does not necessarily matter too much, but you need to be careful to use large enough batch sizes so that you are not waiting on the network all the time in your application. Given that your staging table may also be a SQL Azure database, you need to be aware that it also has quorum commit so you may want to consider which data is going to stay around day-over-day (stays in SQL Azure DB) vs. which can go into tempdb for lower latencies and be re-created if lost.
There is no intra-db resource governance model today in SQL Azure (other than elastic pools which is partial and is targeting a different use case than DW). So, having a separate staging database is a good idea since it isolates your production workload from the processing in the staging database. You avoid noisy neighbor issues with your primary production workload being impacted by the processing of the day's data you want to load.
When you provision machines for on-premises DW, you often buy a sufficiently large storage array/SAN that you can host your workload and potentially many others (consolidation scenarios). The premium/v-core DBs in SQL Azure are set up with local SSDs (with Hyperscale being the new addition where it gives you some cross-machine scale-out model that is a bit like a SAN in some regards). So, you would want to think through the IOPS required for your production system and your staging/loading process. You have the ability to choose to scale up/down each of these to better manage your workload and costs (unlike a CAPEX purchase of a large storage array which is made up front and then you tune workloads to fit into it).
Finally, there is also a SQL DW offering that works a bit differently than SQL Azure - it is optimized for larger DW workloads and has scale-out compute with the ability to scale that up/down as well. Depending on your workload needs, you may want to consider that as your eventual DW target if that is a better fit.
To get to your original question - can you run a data load pipeline on SQL Azure? Yes you can. There are a few caveats compared to your existing experiences on-premises, but it will work. To be fair, there are also people who just load from CSV files or similar directly without using a staging table. Often they don't do as many transformations, so YMMV based on your needs.
Hope that helps.

Azure SQL Database vs. MS SQL Server on Dedicated Machine

I'm currently running an instance of MS SQL Server 2014 (12.1.4100.1) on a dedicated machine I rent for $270/month with the following specs:
Intel Xeon E5-1660 processor (six physical 3.3ghz cores +
hyperthreading + turbo->3.9ghz)
64 GB registered DDR3 ECC memory
240GB Intel SSD
45000 GB of bandwidth transfer
I've been toying around with Azure SQL Database for a bit now, and have been entertaining the idea of switching over to their platform. I fired up an Azure SQL Database using their P2 Premium pricing tier on a V12 server (just to test things out), and loaded a copy of my existing database (from the dedicated machine).
I ran several sets of queries side-by-side, one against the database on the dedicated machine, and one against the P2 Azure SQL Database. The results were sort of shocking: my dedicated machine outperformed (in terms of execution time) the Azure db by a huge margin each time. Typically, the dedicated db instance would finish in under 1/2 to 1/3 of the time that it took the Azure db to execute.
Now, I understand the many benefits of the Azure platform. It's managed vs. my non-managed setup on the dedicated machine, they have point-in-time restore better than what I have, the firewall is easily configured, there's geo-replication, etc., etc. But I have a database with hundreds of tables with tens to hundreds of millions of records in each table, and sometimes need to query across multiple joins, etc., so performance in terms of execution time really matters. I just find it shocking that a ~$930/month service performs that poorly next to a $270/month dedicated machine rental. I'm still pretty new to SQL as a whole, and very new to servers/etc., but does this not add up to anyone else? Does anyone perhaps have some insight into something I'm missing here, or are those other, "managed" features of Azure SQL Database supposed to make up the difference in price?
Bottom line is I'm beginning to outgrow even my dedicated machine's capabilities, and I had really been hoping that Azure's SQL Database would be a nice, next stepping stone, but unless I'm missing something, it's not. I'm too small of a business still to go out and spend hundreds of thousands on some other platform.
Anyone have any advice on if I'm missing something, or is the performance I'm seeing in line with what you would expect? Do I have any other options that can produce better performance than the dedicated machine I'm running currently, but don't cost in the tens of thousand/month? Is there something I can do (configuration/setting) for my Azure SQL Database that would boost execution time? Again, any help is appreciated.
EDIT: Let me revise my question to maybe make it a little more clear: is what I'm seeing in terms of sheer execution time performance to be expected, where a dedicated server # $270/month is well outperforming Microsoft's Azure SQL DB P2 tier # $930/month? Ignore the other "perks" like managed vs. unmanaged, ignore intended use like Azure being meant for production, etc. I just need to know if I'm missing something with Azure SQL DB, or if I really am supposed to get MUCH better performance out of a single dedicated machine.

(Disclaimer: I work for Microsoft, though not on Azure or SQL Server).
"Azure SQL" isn't equivalent to "SQL Server" - and I personally wish that we did offer a kind of "hosted SQL Server" instead of Azure SQL.
On the surface the two are the same: they're both relational database systems with the power of T-SQL to query them (well, they both, under-the-hood use the same DBMS).
Azure SQL is different in that the idea is that you have two databases: a development database using a local SQL Server (ideally 2012 or later) and a production database on Azure SQL. You (should) never modify the Azure SQL database directly, and indeed you'll find that SSMS does not offer design tools (Table Designer, View Designer, etc) for Azure SQL. Instead, you design and work with your local SQL Server database and create "DACPAC" files (or special "change" XML files, which can be generated by SSDT) which then modify your Azure DB such that it copies your dev DB, a kind of "design replication" system.
Otherwise, as you noticed, Azure SQL offers built-in resiliency, backups, simplified administration, etc.
As for performance, is it possible you were missing indexes or other optimizations? You also might notice slightly higher latency with Azure SQL compared to a local SQL Server, I've seen ping times (from an Azure VM to an Azure SQL host) around 5-10ms, which means you should design your application to be less-chatty or to parallelise data retrieval operations in order to reduce page load times (assuming this is a web-application you're building).

Perf and availability aside, there are several other important factors to consider:
Total cost: your $270 rental cost is only one of many cost factors. Space, power and hvac are other physical costs. Then there's the cost of administration. Think work you have to do each patch Tuesday and when either Windows or SQL Server ships a service pack or cumulative update. Even if you don't test them before rolling out, it still takes time and effort. If you do test, then there's a second machine and duplicating the product instance and workload for test.
Security: there is a LOT written about how bad and dangerous and risky it is to store any data you care about in the cloud. Personally, I've seen way worse implementations and processes on security with local servers (even in banks and federal agencies) than I've seen with any of the major cloud providers (Microsoft, Amazon, Google). It's a lot of work getting things right then even more work keeping them right. Also, you can see and audit their security SLAs (See Azure's at http://azure.microsoft.com/en-us/support/trust-center/).
Scalability: not just raw scalability but the cost and effort to scale. Azure SQL DB recently released the huge P11 edition which has 7x the compute capacity of the P2 you tested with. Scaling up and down is not instantaneous but really easy and reasonably quick. Best part is (for me anyway), it can be bumped to some higher edition when I run large queries or reindex operations then back down again for "normal" loads. This is hard to do with a regular SQL Server on bare metal - either rent/buy a really big box that sits idle 90% of the time or take downtime to move. Slightly easier if in a VM; you can increase memory online but still need to bounce the instance to increase CPU; your Azure SQL DB stays online during scale up/down operations.

There is an alternative from Microsoft to Azure SQL DB:
“Provision a SQL Server virtual machine in Azure”
https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-provision-sql-server/
A detailed explanation of the differences between the two offerings: “Understanding Azure SQL Database and SQL Server in Azure VMs”
https://azure.microsoft.com/en-us/documentation/articles/data-management-azure-sql-database-and-sql-server-iaas/
One significant difference between your stand alone SQL Server and Azure SQL DB is that with SQL DB you are paying for high levels of availability, which is achieved by running multiple instances on different machines. This would be like renting 4 of your dedicated machines and running them in an AlwaysOn Availability Group, which would change both your cost and performance. However, as you never mentioned availability, I'm guessing this isn't a concern in your scenario. SQL Server in a VM may better match your needs.

SQL DB has built in availability (which can impact performance), point in time restore capability and DR features. You have the option to scale up / down your DB based on your usage to reduce the cost. You can improve your query performance using Global query (shard data). SQl DB manages auto upgrades and patching and greatly improves the manageability story. You may need to pay a little premium for that. Application level caching / evenly distributing the load, downgrading when cold etc. may help improve your database performance and optimize the cost.

SISS and SQL Server CPU Performence

I'm running a 32 core SQL Server Box. Which also runs a SISS Server, where SISS packages are stored and run.
The load on the database is very low, nightly updates, a handfull are tables updated during daytime, and otherwise is just a lot of select statements. Typical DW with a frontend that caches data.
My issues is that when I run a SISS package, in the studio, then it executes within a hour. But if I run it on the SISS Server it runs for hours. The package basically aggregates data from various tables, and places it in one table. Other package of other types, also run very slow when run from SISS Server.
CPU usage is never above 7%, memory is at 29gb of 32 on the server.
Is there a way that I can prioritize CPU away from the SQL Server and over to SISS Server?
I believe CPU priotization is the issue, but I might be wrong.

When SSIS runs out of memory it will typically slow down a lot before it fails, as it will start writing and reading temp files for some operations. I suspect this is your problem.
To avoid this I would reduce the maximum Memory allocated to the SQL Server Database Engine, e.g. to 8GB. Obviously you need to consider potential impacts on other SQL Server Database Engine operations, but it usually does surprisingly well with less Memory.
PS: you are admirably consistent with your mis-spelling of SSIS ... :-)

SQL Server Architecture on Production Environment

I want to understand the best approach for SQL Server architecture on production environment.
Here is my problem:
I have database which has on average around 20,000 records being inserted every second in various tables.
We have reports also implemented for the same, now what's happening is whenever reports is searched by user, performance of other application steeps down.
We have implemented
Table Partitioning
Indexing
And all other required things.
My question is: can anyone suggest an architecture that have different SQL Server databases for reports and application, and they can sync themselves online every time when new data is entered in master SQL Server?
Some what like Master and Slave Architecture. I understand Master and Slave architecture, however need to get more idea around it.
Our main tables are having around 40 millions rows (table partitioning done)

In SQL Server 2008R2 you have database mirroring and replication available, which will keep two databases in sync.
A schema which is efficient for OLTP is unlikely to be efficient for large volume reporting. The 'live' and 'reporting' databases should have different schema with an ETL process moving data from one to the other. I'd would like to negotiate with the business just how synchronised the reporting database needs to be. If the reports are processing large amounts of data they will take some time to run so a lag in data replication will not be noticed, I would suggest. In extremis you could construct a solution using Service Broker to move the data and processing on the reporting server to distribute it amonst the reporting tables.
The numbers you quote (20,000 inserts per second, 40 millions rows in largest table) suggests a record doesn't reside in the DB for long. You would have a significant load performing DELETEs. Optimising these out of peak hours could be sufficient to solve your problems.

How SSIS Dataflow really works?

I have an SSIS package which transfers the data from one database to another.
The SSIS package runs on an application server.
I am thinking of moving one of the two databases to another data server. Will there be an impact in performance? How is the data flows in SSIS i.e. does all the data go in the application server where the SSIS runs and then to the destination database?

SSIS is a client-side process, so if it is running on a server other than the machine running the DBMS the traffic will be going over the network. Your question is not very clearly worded, but I think you want to know whether moving a DB will affect performance given that the SSIS package is already running on a separate machine.
If the SSIS job is already running on an application server that is a physically separate machine to the DB server then moving one of the databases will probably not affect the performance unless it has a radically slower network connection than the other.

I recently came across the same situation and we upgraded our source system to a better configuration box. I didn't have to do anything on my part, but the data load times from the source to SQL box were cut down from approximately 40 minutes to under 12 minutes on average. To answer your question, you will only see any performance variance depending on 1) Your new systems resources and 2) If you make changes to the box hosting your SQL Server.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas