Azure SQL Database or SQL Data Warehouse

Azure SQL Database or SQL Data Warehouse - azure-sql-database

I am working on a solution architecture and am having hard time choosing between Azure SQL DB or SQL DW.
The current scope involves around developing real-time BI reporting solution which is based on multiple sources. But in the long run the solution may be extended into a full fledged EDW and Marts.
I initially thought of using SQL DW so that for future scope the MPP capabilities could be used. But when I spoke to a mate who recently used SQL DW, he explained that the the development in SQL DW is not similar to SQL DB.
I have worked previously on Real Time reporting with no scope for EDW and we successfully used SQL DB. With this as well we can create Facts and Dimension and Marts.
Is there a strong case where I should be choosing SQL DW over SQL DB?

I think the two most important data points you can have here is the volume of data you're processing and the number of concurrent queries that you need to support. When talking about processing large volume data, and by large, I mean more than 3tb (which is not even really large, but large enough), then Azure SQL Data Warehouse becomes a juggernaut. The parallel processing is simply amazing (it's amazing at smaller volumes too, but you're paying a lot of money for overkill). However, the one issue can be the simultaneous query limit. It currently has a limit of 128 concurrent queries with a limit of 1,000 queries queued (read more here). If you're using the Data Warehouse as a data warehouse to process large amounts of data and then feed them into data marts where the majority of the querying takes place, this isn't a big deal. If you're planning to open this to large volume querying, it quickly becomes problematic.
Answer those two questions, query volume and data volume, and you can more easily decide between the two.
Additional factors can include the issues around the T-SQL currently supported. It is less than traditional SQL Server. Again, for most purposes around data warehousing, this is not an issue. For a full blown reporting server, it might be.
Most people successfully implementing Azure SQL Data Warehouse are using a combination of the warehouse for processing and storage and Azure SQL Database for data marts. There are exceptions when dealing with very large data volumes that need the parallel processing, but don't require lots of queries.

The 4 TB limit of Azure SQL Database may be an important factor to consider when choosing between the two options. Queries can be faster with Azure SQL Data Warehouse since is a MPP solution. You can pause Azure SQL DW to save costs with Azure SQL Database you can scale down to Basic tier (when possible).
Azure SQL DB can support up to 6,400 concurrent queries and 32k active connections, where Azure SQL DW can only support up to 32 concurrent queries and 1,024 active connections. So SQL DB is a much better solution if you are using something like a dashboard with thousands of users.
About developing for them, Azure SQL Database supports Entity Framework but Azure SQL DW does not support it.
I want also to give you a quick glimpse of how both of them compare in terms of performance 1 DWU is approximately 7.5 DTU (Database Throughput Unit, used to express the horse power of an OLTP Azure SQL Database) in capacity although they are not exactly comparable. More information about this comparison here.

Thanks for you responses Grant and Alberto. The responses have cleared a lot of air to make a choice.
Since, the data would be subject to dash-boarding and querying, I am tilting towards SQL Database instead of SQL DW.
Thanks again.

Related

Azure SQL Data Warehouse - Max concurrent queries

I have to decide to use an Azure SQL Data Warehouse or a SQL Data warehouse based on Microsoft SQL Server virtualized on a VM.
The problem what i do not understand is the MAX CONCURRENT QUERIES LIMITATION TO 32. The same for the Azure SQL Database is 6400.
To be honest when i want to use the Azure Data Warehouse in an Enterprise environment the 32 concurrent queries are laughable or i do not understand it.
Lets assume a company with 10.000 Employees worldwide and i set up a Azure Data Warehouse for reporting purpose where lets assume 250 permanently are querying from or additional 250 employees are working with a business app which uses data from the Data Warehouse. How should this work without extreme performance lacks?

This isn't the issue that you might think.
First, the limit is now 128. (https://learn.microsoft.com/en-us/azure/sql-data-warehouse/memory-and-concurrency-limits#gen2-1)
Second, this is well above the concurrency of the next most concurrent single cluster warehouse. I've often wondered what marketing mistake was made by Microsoft that concurrency is seen as a limitation on ASDW, but rarely mentioned for less concurrent competitors.
Third, the best way to serve thousands of concurrent query users (ie, dashboards) is through PowerBI hybrid queries, and (potentially) Azure Analysis Services. This gives extremely high concurrency and interactivity.
Perhaps the best evidence I can give is that I work with Azure SQL Data Warehouse customers on a daily basis. I often get questions like this when a customer is first exposed to ASDW, but I never get questions about concurrency by the time they're in production. In other words, the issue of "concurrency" just isn't important for most customers.

Staging tables in DB vs storage area

Typically on an on-premise SQL server ETL workflow via SSIS, we load data from anywhere into staging tables and then apply validation and transformations to load/merge them into downstream data warehouse tables.
My question is if we should do something similar on Azure, where we have set of staging tables and downstream tables in azure SQL database or use azure storage area as staging and move data from there into final downstream tables via ADF.
As wild is it may seem, we also have a proposal to have separate staging database and downstream database, between which we move using ADF.

There are different models for doing data movement pipelines and no single one is perfect. I'll make a few comments on the common patterns I see in case that will help you make decisions on your application.
For many data warehouses where you are trying to stage in data and create dimensions, there is often a process where you load the raw source data into some other database/tables as raw data and then process it into the format you want to insert into your fact and dimension tables. That process is complicated by the fact that you may have data arrive late or data that is corrected on a later day, so often these systems are designed using partitioned tables on the target fact tables to allow re-processing of a partition worth of data (e.g. a day) without having to reprocess the whole fact table. Furthermore, the transformation process on that staging table may be intensive if the data itself is coming in a form far away from how you want to represent it in your DW. Often in on-premises systems, these are handled in a separate database (potentially on the same SQL Server) to isolate it from the production system. Furthermore, it is sometimes the case that these staging tables are re-creatable from original source data (CSV files or similar), so it is not the store of record for that source material. This allows you to consider using simple recovery mode on that database (which reduces the Log IO requirements and recovery time compared to full recovery). While not every DW uses full recovery mode for the processed DW data (some do dual load to a second machine instead since the pipeline is there), the ability to use full recovery plus physical log replication (AlwaysOn Availability Groups) in SQL Server gives you the flexibility to create a disaster recovery copy of the database in a different region of the world. (You can also do query read scale-out on that server if you would like). There are variations on this basic model, but a lot of on-premises systems have something like this.
When you look at SQL Azure, there are some similarities and some differences that matter when considering how to set up an equivalent model:
You have full recovery on all user databases (but tempdb is in simple recovery). You also have quorum-commit of your changes to N replicas (like in Availability Groups) when using v-core or premium dbs which matters a fair amount because you often have a more generic network topology in public cloud systems vs. a custom system you build yourself. In other words, log commit times may be slower than your current system. For batch systems it does not necessarily matter too much, but you need to be careful to use large enough batch sizes so that you are not waiting on the network all the time in your application. Given that your staging table may also be a SQL Azure database, you need to be aware that it also has quorum commit so you may want to consider which data is going to stay around day-over-day (stays in SQL Azure DB) vs. which can go into tempdb for lower latencies and be re-created if lost.
There is no intra-db resource governance model today in SQL Azure (other than elastic pools which is partial and is targeting a different use case than DW). So, having a separate staging database is a good idea since it isolates your production workload from the processing in the staging database. You avoid noisy neighbor issues with your primary production workload being impacted by the processing of the day's data you want to load.
When you provision machines for on-premises DW, you often buy a sufficiently large storage array/SAN that you can host your workload and potentially many others (consolidation scenarios). The premium/v-core DBs in SQL Azure are set up with local SSDs (with Hyperscale being the new addition where it gives you some cross-machine scale-out model that is a bit like a SAN in some regards). So, you would want to think through the IOPS required for your production system and your staging/loading process. You have the ability to choose to scale up/down each of these to better manage your workload and costs (unlike a CAPEX purchase of a large storage array which is made up front and then you tune workloads to fit into it).
Finally, there is also a SQL DW offering that works a bit differently than SQL Azure - it is optimized for larger DW workloads and has scale-out compute with the ability to scale that up/down as well. Depending on your workload needs, you may want to consider that as your eventual DW target if that is a better fit.
To get to your original question - can you run a data load pipeline on SQL Azure? Yes you can. There are a few caveats compared to your existing experiences on-premises, but it will work. To be fair, there are also people who just load from CSV files or similar directly without using a staging table. Often they don't do as many transformations, so YMMV based on your needs.
Hope that helps.

Directly query databases with 1b rows of data using Tableau or PowerBI

I occasionally see people or companies showcasing querying a db/cube/etc from Tableau or PowerBI with less than 5s of response, sometimes even less than 1s. How do they do this? Is the data optimized to the gills? Are they using a massive Db?
On a related question, I've been experimenting with analysing a much smaller dataset 100m rows with Tableau against SQL DW and it still takes nearly a minute to calculate. Should I try some other tech? Perhaps Analysis Services or a big data technology?
These are usually one-off data analysis assignments so I do not have to worry about data growth.

Live connections in Tableau will only be as fast as the underlying data source. If you look at your log (C:\Users\username\Documents\My Tableau Repository\Logs\log.txt), you will see the sql tableau issued to the database. Run that query on the server itself...should take about the same amount of time. Side note: Tableau has a new data engine coming with the next release. It's called 'Hyper'. This should allow you to create an extract from 2b rows with very good performance. You can download the beta now...more info here

Reduce sql azure replication cost

Here is the situation, we have intense requests demands on our sql azure database, so we decided to replicate the database, one for read (sql queries) and one for writing (updates) given that we have to move to premium service so the database cost in creases from pretty 300 dollars to 2000 dollars and that affect dramatically the project rentability, so is there any other solutions to optimise that cost
Thank you

Are any of these options viable for you?
Running traditional SQL in an Azure VM (not limited to Azure SQL DTU model)
Using a message queue to move writes to an eventual consistent model using competing consumers
Create de-normalized/pre-computed tables (i.e. SELECT INTO) for the read operations during low-use times
Use materialized and/or indexed views for reads (reduced CPU/DTU usage)
Push data into Azure Table or DocumentDB for reads
Just ideas, but I feel your pain. The DTU thing is both a blessing and a curse.

Azure SQL Database vs. MS SQL Server on Dedicated Machine

I'm currently running an instance of MS SQL Server 2014 (12.1.4100.1) on a dedicated machine I rent for $270/month with the following specs:
Intel Xeon E5-1660 processor (six physical 3.3ghz cores +
hyperthreading + turbo->3.9ghz)
64 GB registered DDR3 ECC memory
240GB Intel SSD
45000 GB of bandwidth transfer
I've been toying around with Azure SQL Database for a bit now, and have been entertaining the idea of switching over to their platform. I fired up an Azure SQL Database using their P2 Premium pricing tier on a V12 server (just to test things out), and loaded a copy of my existing database (from the dedicated machine).
I ran several sets of queries side-by-side, one against the database on the dedicated machine, and one against the P2 Azure SQL Database. The results were sort of shocking: my dedicated machine outperformed (in terms of execution time) the Azure db by a huge margin each time. Typically, the dedicated db instance would finish in under 1/2 to 1/3 of the time that it took the Azure db to execute.
Now, I understand the many benefits of the Azure platform. It's managed vs. my non-managed setup on the dedicated machine, they have point-in-time restore better than what I have, the firewall is easily configured, there's geo-replication, etc., etc. But I have a database with hundreds of tables with tens to hundreds of millions of records in each table, and sometimes need to query across multiple joins, etc., so performance in terms of execution time really matters. I just find it shocking that a ~$930/month service performs that poorly next to a $270/month dedicated machine rental. I'm still pretty new to SQL as a whole, and very new to servers/etc., but does this not add up to anyone else? Does anyone perhaps have some insight into something I'm missing here, or are those other, "managed" features of Azure SQL Database supposed to make up the difference in price?
Bottom line is I'm beginning to outgrow even my dedicated machine's capabilities, and I had really been hoping that Azure's SQL Database would be a nice, next stepping stone, but unless I'm missing something, it's not. I'm too small of a business still to go out and spend hundreds of thousands on some other platform.
Anyone have any advice on if I'm missing something, or is the performance I'm seeing in line with what you would expect? Do I have any other options that can produce better performance than the dedicated machine I'm running currently, but don't cost in the tens of thousand/month? Is there something I can do (configuration/setting) for my Azure SQL Database that would boost execution time? Again, any help is appreciated.
EDIT: Let me revise my question to maybe make it a little more clear: is what I'm seeing in terms of sheer execution time performance to be expected, where a dedicated server # $270/month is well outperforming Microsoft's Azure SQL DB P2 tier # $930/month? Ignore the other "perks" like managed vs. unmanaged, ignore intended use like Azure being meant for production, etc. I just need to know if I'm missing something with Azure SQL DB, or if I really am supposed to get MUCH better performance out of a single dedicated machine.

(Disclaimer: I work for Microsoft, though not on Azure or SQL Server).
"Azure SQL" isn't equivalent to "SQL Server" - and I personally wish that we did offer a kind of "hosted SQL Server" instead of Azure SQL.
On the surface the two are the same: they're both relational database systems with the power of T-SQL to query them (well, they both, under-the-hood use the same DBMS).
Azure SQL is different in that the idea is that you have two databases: a development database using a local SQL Server (ideally 2012 or later) and a production database on Azure SQL. You (should) never modify the Azure SQL database directly, and indeed you'll find that SSMS does not offer design tools (Table Designer, View Designer, etc) for Azure SQL. Instead, you design and work with your local SQL Server database and create "DACPAC" files (or special "change" XML files, which can be generated by SSDT) which then modify your Azure DB such that it copies your dev DB, a kind of "design replication" system.
Otherwise, as you noticed, Azure SQL offers built-in resiliency, backups, simplified administration, etc.
As for performance, is it possible you were missing indexes or other optimizations? You also might notice slightly higher latency with Azure SQL compared to a local SQL Server, I've seen ping times (from an Azure VM to an Azure SQL host) around 5-10ms, which means you should design your application to be less-chatty or to parallelise data retrieval operations in order to reduce page load times (assuming this is a web-application you're building).

Perf and availability aside, there are several other important factors to consider:
Total cost: your $270 rental cost is only one of many cost factors. Space, power and hvac are other physical costs. Then there's the cost of administration. Think work you have to do each patch Tuesday and when either Windows or SQL Server ships a service pack or cumulative update. Even if you don't test them before rolling out, it still takes time and effort. If you do test, then there's a second machine and duplicating the product instance and workload for test.
Security: there is a LOT written about how bad and dangerous and risky it is to store any data you care about in the cloud. Personally, I've seen way worse implementations and processes on security with local servers (even in banks and federal agencies) than I've seen with any of the major cloud providers (Microsoft, Amazon, Google). It's a lot of work getting things right then even more work keeping them right. Also, you can see and audit their security SLAs (See Azure's at http://azure.microsoft.com/en-us/support/trust-center/).
Scalability: not just raw scalability but the cost and effort to scale. Azure SQL DB recently released the huge P11 edition which has 7x the compute capacity of the P2 you tested with. Scaling up and down is not instantaneous but really easy and reasonably quick. Best part is (for me anyway), it can be bumped to some higher edition when I run large queries or reindex operations then back down again for "normal" loads. This is hard to do with a regular SQL Server on bare metal - either rent/buy a really big box that sits idle 90% of the time or take downtime to move. Slightly easier if in a VM; you can increase memory online but still need to bounce the instance to increase CPU; your Azure SQL DB stays online during scale up/down operations.

There is an alternative from Microsoft to Azure SQL DB:
“Provision a SQL Server virtual machine in Azure”
https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-provision-sql-server/
A detailed explanation of the differences between the two offerings: “Understanding Azure SQL Database and SQL Server in Azure VMs”
https://azure.microsoft.com/en-us/documentation/articles/data-management-azure-sql-database-and-sql-server-iaas/
One significant difference between your stand alone SQL Server and Azure SQL DB is that with SQL DB you are paying for high levels of availability, which is achieved by running multiple instances on different machines. This would be like renting 4 of your dedicated machines and running them in an AlwaysOn Availability Group, which would change both your cost and performance. However, as you never mentioned availability, I'm guessing this isn't a concern in your scenario. SQL Server in a VM may better match your needs.

SQL DB has built in availability (which can impact performance), point in time restore capability and DR features. You have the option to scale up / down your DB based on your usage to reduce the cost. You can improve your query performance using Global query (shard data). SQl DB manages auto upgrades and patching and greatly improves the manageability story. You may need to pay a little premium for that. Application level caching / evenly distributing the load, downgrading when cold etc. may help improve your database performance and optimize the cost.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas