Staging tables in DB vs storage area

Staging tables in DB vs storage area - sql

Typically on an on-premise SQL server ETL workflow via SSIS, we load data from anywhere into staging tables and then apply validation and transformations to load/merge them into downstream data warehouse tables.
My question is if we should do something similar on Azure, where we have set of staging tables and downstream tables in azure SQL database or use azure storage area as staging and move data from there into final downstream tables via ADF.
As wild is it may seem, we also have a proposal to have separate staging database and downstream database, between which we move using ADF.

There are different models for doing data movement pipelines and no single one is perfect. I'll make a few comments on the common patterns I see in case that will help you make decisions on your application.
For many data warehouses where you are trying to stage in data and create dimensions, there is often a process where you load the raw source data into some other database/tables as raw data and then process it into the format you want to insert into your fact and dimension tables. That process is complicated by the fact that you may have data arrive late or data that is corrected on a later day, so often these systems are designed using partitioned tables on the target fact tables to allow re-processing of a partition worth of data (e.g. a day) without having to reprocess the whole fact table. Furthermore, the transformation process on that staging table may be intensive if the data itself is coming in a form far away from how you want to represent it in your DW. Often in on-premises systems, these are handled in a separate database (potentially on the same SQL Server) to isolate it from the production system. Furthermore, it is sometimes the case that these staging tables are re-creatable from original source data (CSV files or similar), so it is not the store of record for that source material. This allows you to consider using simple recovery mode on that database (which reduces the Log IO requirements and recovery time compared to full recovery). While not every DW uses full recovery mode for the processed DW data (some do dual load to a second machine instead since the pipeline is there), the ability to use full recovery plus physical log replication (AlwaysOn Availability Groups) in SQL Server gives you the flexibility to create a disaster recovery copy of the database in a different region of the world. (You can also do query read scale-out on that server if you would like). There are variations on this basic model, but a lot of on-premises systems have something like this.
When you look at SQL Azure, there are some similarities and some differences that matter when considering how to set up an equivalent model:
You have full recovery on all user databases (but tempdb is in simple recovery). You also have quorum-commit of your changes to N replicas (like in Availability Groups) when using v-core or premium dbs which matters a fair amount because you often have a more generic network topology in public cloud systems vs. a custom system you build yourself. In other words, log commit times may be slower than your current system. For batch systems it does not necessarily matter too much, but you need to be careful to use large enough batch sizes so that you are not waiting on the network all the time in your application. Given that your staging table may also be a SQL Azure database, you need to be aware that it also has quorum commit so you may want to consider which data is going to stay around day-over-day (stays in SQL Azure DB) vs. which can go into tempdb for lower latencies and be re-created if lost.
There is no intra-db resource governance model today in SQL Azure (other than elastic pools which is partial and is targeting a different use case than DW). So, having a separate staging database is a good idea since it isolates your production workload from the processing in the staging database. You avoid noisy neighbor issues with your primary production workload being impacted by the processing of the day's data you want to load.
When you provision machines for on-premises DW, you often buy a sufficiently large storage array/SAN that you can host your workload and potentially many others (consolidation scenarios). The premium/v-core DBs in SQL Azure are set up with local SSDs (with Hyperscale being the new addition where it gives you some cross-machine scale-out model that is a bit like a SAN in some regards). So, you would want to think through the IOPS required for your production system and your staging/loading process. You have the ability to choose to scale up/down each of these to better manage your workload and costs (unlike a CAPEX purchase of a large storage array which is made up front and then you tune workloads to fit into it).
Finally, there is also a SQL DW offering that works a bit differently than SQL Azure - it is optimized for larger DW workloads and has scale-out compute with the ability to scale that up/down as well. Depending on your workload needs, you may want to consider that as your eventual DW target if that is a better fit.
To get to your original question - can you run a data load pipeline on SQL Azure? Yes you can. There are a few caveats compared to your existing experiences on-premises, but it will work. To be fair, there are also people who just load from CSV files or similar directly without using a staging table. Often they don't do as many transformations, so YMMV based on your needs.
Hope that helps.

Related

Creating a Datawarehouse

Currently our team is having a major database management/data management issue where hundreds of databases are being built and used for minor/one off applications where the app should really be pulling from an already existing database.
Since our security is so tight, the owners of these Systems of authority will not allow others to pull data from them at a consistent (App Necessary) rate, rather they allow a single app to do a weekly pull and that data is then given to the org.
I am being asked to compile all of those publicly available (weekly snapshots) into a single data warehouse for end users to go to. We realistically are talking 30-40 databases each with hundreds of thousands of records.
What is the best way to turn this into a data warehouse? Create a SQL server and treat each one as its own DB on the server? As far as the individual app connections I am less worried, I really want to know what is the best practice to house all of the data for consumption.

What you're describing is more of a simple data lake. If all you're being asked for is a single place for the existing data to live as-is, then sure, directly pulling all 30-40 databases to a new server will get that done. One thing to note is that if they're creating Database Snapshots, those wouldn't be helpful here. With actual database backups, it would be easy to build a process that would copy and restore those to your new server. This is assuming all of the sources are on SQL Server.
"Data warehouse" implies a certain level of organization beyond that, to facilitate reporting on an aggregate of the data across the multiple sources. Generally you'd identify any concepts that are shared between the databases and create a unified table for each concept, then create an ETL (extract, transform, load) process to standardize the data from each source and move it into those unified tables. This would be a large lift for one person to build. There's plenty of resources that you could read to get you started--Ralph Kimball's The Data Warehouse Toolkit is a comprehensive guide.
In either case, a tool you might want to look into is SSIS. It's good for copying data across servers and has drivers for multiple different RDBMS platforms. You can schedule SSIS packages from SQL Agent. It has other features that could help for data warehousing as well.

SAP HANA SDI ECC Source vs HANA table delta

In our current system, we have a lot of ECC tables replicated to SAP HANA with SDI (Smart Data Integration). Replication tasks can be real-time or on demand, but sometimes a replication task comes too late and the data in the replicated table is very different from the source table.
What would be the best approach in SAP HANA to check these delta values?
ERP system uses DB2 database
DB2LogReaderAdapter is used to read DB2 database tables
Remote source is created in the Cloud (Virtual table)
There are about 260 replication tasks
Replication tasks contain only one object
Replication tasks are based on virtual tables
The biggest issue faced right now is latency in the remote source tables (delta values)

There is no easy/straightforward way to "check" delta values here.
The 260 replication tasks are processed independently from each other; regardless of transactional compounding in the source system.
That means, that if table A and B are updated in the same transaction, but replicated in separate tasks to HANA, the data will be written to HANA in separate transactions. The data in HANA will be lagging behind the source system.
Usually, this difference should only last a relatively short time (maybe a few secs.), but, of course, if you do aggregation queries and want to see current valid sums etc. this leads to wrong data.
One way to deal with this is to implement the queries in a way that takes this into account, by e.g. filtering on data that has been changed half an hour ago (or longer), and to exclude newer data.
Note that as the replication via LogReader is de-coupled from the source system's transaction processing, this problem of "lagging data" is built-in conceptionally and cannot be generally avoided.
All one can do is to reduce the extend of the lag and cope with the differences in the upstream processing.
This very issue is one of the reasons for why remote data access is usually preferred over replication for cases like operational reporting.
And if you do need data-loading (e.g. to avoid additional load on the source system) then a ETL/ELT approach into data stores (DWH/BW-like) makes the situation a lot better structures.
In fact, the current S/4 HANA & BW/4 HANA setups usually use a combination of scheduled data loads and ad-hoc fetching of new data via operational delta queues from the source system.

Lars,
If we need to replicate data from ECC on Oracle to a HANA instance, should we use SLT (because of cluster tables for example) or SDI already covers all functionality SLT provides?
Regards, Chris

Azure SQL Database or SQL Data Warehouse

I am working on a solution architecture and am having hard time choosing between Azure SQL DB or SQL DW.
The current scope involves around developing real-time BI reporting solution which is based on multiple sources. But in the long run the solution may be extended into a full fledged EDW and Marts.
I initially thought of using SQL DW so that for future scope the MPP capabilities could be used. But when I spoke to a mate who recently used SQL DW, he explained that the the development in SQL DW is not similar to SQL DB.
I have worked previously on Real Time reporting with no scope for EDW and we successfully used SQL DB. With this as well we can create Facts and Dimension and Marts.
Is there a strong case where I should be choosing SQL DW over SQL DB?

I think the two most important data points you can have here is the volume of data you're processing and the number of concurrent queries that you need to support. When talking about processing large volume data, and by large, I mean more than 3tb (which is not even really large, but large enough), then Azure SQL Data Warehouse becomes a juggernaut. The parallel processing is simply amazing (it's amazing at smaller volumes too, but you're paying a lot of money for overkill). However, the one issue can be the simultaneous query limit. It currently has a limit of 128 concurrent queries with a limit of 1,000 queries queued (read more here). If you're using the Data Warehouse as a data warehouse to process large amounts of data and then feed them into data marts where the majority of the querying takes place, this isn't a big deal. If you're planning to open this to large volume querying, it quickly becomes problematic.
Answer those two questions, query volume and data volume, and you can more easily decide between the two.
Additional factors can include the issues around the T-SQL currently supported. It is less than traditional SQL Server. Again, for most purposes around data warehousing, this is not an issue. For a full blown reporting server, it might be.
Most people successfully implementing Azure SQL Data Warehouse are using a combination of the warehouse for processing and storage and Azure SQL Database for data marts. There are exceptions when dealing with very large data volumes that need the parallel processing, but don't require lots of queries.

The 4 TB limit of Azure SQL Database may be an important factor to consider when choosing between the two options. Queries can be faster with Azure SQL Data Warehouse since is a MPP solution. You can pause Azure SQL DW to save costs with Azure SQL Database you can scale down to Basic tier (when possible).
Azure SQL DB can support up to 6,400 concurrent queries and 32k active connections, where Azure SQL DW can only support up to 32 concurrent queries and 1,024 active connections. So SQL DB is a much better solution if you are using something like a dashboard with thousands of users.
About developing for them, Azure SQL Database supports Entity Framework but Azure SQL DW does not support it.
I want also to give you a quick glimpse of how both of them compare in terms of performance 1 DWU is approximately 7.5 DTU (Database Throughput Unit, used to express the horse power of an OLTP Azure SQL Database) in capacity although they are not exactly comparable. More information about this comparison here.

Thanks for you responses Grant and Alberto. The responses have cleared a lot of air to make a choice.
Since, the data would be subject to dash-boarding and querying, I am tilting towards SQL Database instead of SQL DW.
Thanks again.

How to trigger SPLIT's and DROP's when sharding in SQL Azure

I am setting up a system running on Windows Azure for which I expect high volume of data and high traffic. In order to handle it, I am designing a Federated database. I am interested in having the application itself SPLIT (or DROP) federated databases when needed. There are 2 reasons that should trigger these operations to happen: 1) The size of the database is reaching the limit allowed in Windows Azure, and 2) The amount of traffic in the server is too high, and a SPLIT operation will improve performance, keeping the response time low (runs fast). (the inverse operations are based on similar reasoning).
My question is: How can I detect these 2 conditions programmatically?

You can use the Sql Azure Dynamic Management Views to programmatically monitor Sql Azure databases. Note that you will not be able to monitor the entire federated database at once, but rather each of its individual members.
Using the Dynamic Management Views to check for condition 1), the one related to size, should be straight forward. Detecting condition number 2), the one related to traffic / performance, is a bit more difficult since you will first need to identify the exact metrics that make sense and their threshold values.
One very important thing to keep in mind is that the SPLIT and DROP operations behave very differently. A SPLIT is an online operation (it does not involve any down time) through which a partition member is divided in two databases. The data is going to be automatically split between the two. This behavior means that splits might indeed be triggered from an automated scaling process.
The DROP however is quite different. When dropping a federation member, Sql Azure will move its range of key values to the lower or upper neighbor federation member, but the data itself is simply deleted. You can get a more detailed description in this article (search for "Scaling down" inside it). Basically you will have to manually export the data from the dropped database and manually merge it into the destination database. Technically speaking you might be able to automate the merge operation through the command line version of the Sql Azure Migration Wizard, but it's risky. It would require a lot of testing before putting it into production.
Microsoft is planning to implement automated merge on federation members drops, but that will happen in a future release. As it is at the moment, automated scaling down is not something I would recommend.
Update
For those interested, you can vote for the MERGE operation on federated SQL Azure databases here.

Where is the bottleneck / what are the gotchas when selecting records from a remote (linked) SQL server?

I'm in a satellite office that needs to pull some data from our main office for display on our intranet. We use MS SQL Server in both locations and we're planning to create a linked server in our satellite office pointing to the main office. The connection between the two is a VPN tunnel I believe (does that sound right? What do I know, I'm a programmer!)
I'm concerned about generating a lot of traffic across a potentially slow connection. We will be getting access to a SQL view on the main office's server. It's not a lot of data (~500 records) once the select query has run, but the view is huge (~30000 records) without a query.
I assume running a query on a linked server will bring back only the results over the wire (and not the entire view to be queried locally). In that case the major bottleneck is most likely the connection itself assuming the view is indexed, etc. Are there any other gotchas or potential bottlenecks (maybe based on the way I structure queries) that I should be aware of?

From what you explained your connection is likely to be the bottleneck.
Also, you might also consider caching data at the satellite location.
The decision will depend on the following:
- how many rows and how often data are updated in the main database
- how often you need to load the same data set at satellite location
Two edge examples:
Data is static or relatively static - inserts only in main DB. In satellite location users often query the same data again and again. In this case it would make sense to cache the data locally at satellite location.
Data is volatile, a lot of updates or/and deletes. Users in satellite location rarely query data and when they do, it is always different where condition. In this case it doesn't make sense to cache. If connection is slow and there are often changes you might end up never being at sync with the main DB.
Another advantage of caching is that you can implement data compression, which will alleviate bad effect of slow connection.
If you chose to cache at local location there are a lot of options, but this I believe would be another topic.
[Edit]
About compression: You can use compressed transaction log shipping. In SQL 2008 compression is supported in Enterprise edition only. In SQL 2008 R2 it is available starting Standard version. http://msdn.microsoft.com/en-us/library/bb964719.aspx .
You can implement custom compression before you ship transaction logs, using any compression library you like.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas