We have an application that reads from a message bus and upserts into Azure Synapse Analytics. There are multiple instances of the application and each instance is multi-threaded. Each upsert could have an upwards of a few hundred entries and the target tables could have a few million entries.
Synapse does not support MERGE operations. We are seeking a solution better than two-step DELETE - INSERT. We considered solutions involving table joins and followed by table renames such as this.
However, all these solutions have performance and/or concurrency issues.
Could you please recommend a solution to upsert into Azure Synapse Analytics from concurrent data streams.
Many thanks!
Related
Documentation from Microsoft and others strongly emphasizes the separation between storage and compute in Azure Synapse Analytics.
In the case of a Serverless SQL pool, it is clearly explained that the data is stored in an Azure Data Lake DSL Gen2.
However, in the case of a Dedicated SQL Pool, the documentation is not explicit enough on data storage.
In a book that deals with Azure Synapse, it is stated that in the case of Dedicated SQL Pool, data is stored in Storage Nodes which are completely separate from Compute Nodes.
Since this claim is not in Microsoft's documentation, I dare not trust it.
So, is there an official resource that sheds light on this question?
This is a question that has been on my mind for a long time as well. However, I have come to the conclusion that data is actually stored in Dedicated SQL Pools.
Let me explain why I believe this.
Take a look at the documentation given here,
https://learn.microsoft.com/en-us/azure/synapse-analytics/quickstart-copy-activity-load-sql-pool
Notice that it is about loading data into a Dedicated SQL Pool. Further, to quote part of the documentation,
A dedicated SQL pool offers T-SQL based compute and storage
capabilities. After creating a dedicated SQL pool in your Synapse
workspace, data can be loaded, modeled, processed, and delivered for
faster analytic insight.
It is said that Dedicated SQL Pools provide both compute and storage capabilities.
Furthermore, with Dedicated SQL Pools, you may already know that it is possible to create traditional tables. We can organize these tables into something along the lines of a star or snowflake schema to model our data warehouses.
Creation of such tables, however, is not possible with Serverless SQL Pools. Only the creation of metadata objects, i.e. views or external tables are allowed. This is explained here,
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview
To quote the relevant passage of the article,
Serverless SQL pool has no local storage, only metadata objects are
stored in databases. Therefore, T-SQL related to the following
concepts isn't supported:
Tables Triggers Materialized views DDL statements other than ones
related to views and security DML statements
To me, the fact that tables can actually be created in Dedicated SQL Pools is further proof that the data is physically stored in them.
My final argument is around the idea of distributions. The concept is explained here,
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/massively-parallel-processing-mpp-architecture
This talks about how data is divided up among the compute nodes and how queries are executed in parallel on the distributions in these nodes. It would not be possible to implement this if the data was not actually stored in these nodes.
In my humble opinion, how I believe Azure Storage comes into the picture (at least, when it comes to Dedicated SQL Pools) is with regards to storing data as files in a data lake and then ingesting them into the pool for analysis.
An explanation can be found here,
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/overview-architecture
Yet another quote,
Serverless SQL pool allows you to query your data lake files, while
dedicated SQL pool allows you to query and ingest data from your data
lake files. When data is ingested into dedicated SQL pool, the data is
sharded into distributions to optimize the performance of the system.
This is where Polybase comes into play. You can define various data loading patterns (into Dedicated SQL Pools) using Polybase as explained here,
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview
The Microsoft documentation on Design tables using dedicated SQL pool in Azure Synapse Analytics, found at https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview, states the following:
Table persistence: Tables store data either permanently in Azure
Storage, temporarily in Azure Storage, or in a data store external to
dedicated SQL pool.
Regular table A regular table stores data in Azure Storage as part of
dedicated SQL pool...
I have N databases, for example 10 databases.
Every database has the same schema, but different data.
Now i would like to take every data of each database from the table "Table1" and insert them in a common table in a new database "DWHDatabase" in a table named Table1Common.
so it's an insert like n to 1.
How i can do that? i'm trying to solve my issues with the elastic queries but seems it's a 1 to 1 stuff
Use Azure Data Factory with Linked Services to each database. Use the Copy activity to load the data.
You can also paramaterize the solution.
Parameterize linked services
Parameters in Azure Data Factory by Catherine Wilhemsen
Elastic query is best suited for reporting scenarios in which the majority of the processing (filtering, aggregation) may be done on the external source side. It is unsuitable for ETL procedures involving significant amounts of data transfer from a distant database (s). Consider Azure Synapse Analytics for large reporting workloads or data warehousing applications with more sophisticated queries.
You may use the Copy activity to copy data across on-premises and
cloud-based data storage. After you've copied the data, you may use
other actions to alter and analyse it. The Copy activity may also be
used to publish transformation and analysis findings for use in
business intelligence (BI) and application consumption.
MSFT Copy Activity Overview: Here.
I have an azure synapse workspace that contains a number of pipelines & external tables in the serverless sql pool. all associated with one particular project.
There are another 2-3 completely separate projects on the way that will require a synapse toolset.
Should i create a new workspace, or allow them all to share this one?
What is the best criteria to use to decide?
This is probably a bit of an opinion question which don't tend to do that well on StackOverflow, but that said, I tend to think of Synapse Workspaces as similar to an instance of SQL Server, so historically, why would you have used the same SQL instance?
Generally this was where projects have things have in common, eg same data, similar permissions (AAD) groups, similar HADR requirements etc, so ask yourself those questions.
Bear in mind you can have multiple databases (dedicated and serverless) within a workspace but cross database queries for tables in a dedicated sql pool are only possible via Spark Pools1. This could work in your favour if you require separation. Also bear in mind you can connect multiple storage accounts to the workspace. There is no cost overhead to having multiple workspaces, but there is an admin overhead and there would be a cost implication to duplicating any of your data across multiple lakes, storage accounts and databases.
One example - we're using workspaces for environments for example where there aren't separate dev, test, uat Azure subscriptions.
So a few things to consider.
1 import the two tables as dataframes then join them in a Synapse notebook as per this example
Background: We currently have our data split between two relational databases (Oracle and Postgres). There is a need to run ad-hoc queries that involve tables in both databases. Currently we are doing this in one of two ways:
ETL from one database to another. This requires a lot of developer
time.
Oracle foreign data wrapper on our
Postgres server. This is working, but the queries run extremely
slowly.
We already use Google Cloud Platform (for the project that uses the Postgres server). We are familiar with Google BigQuery (BQ).
What we want to do:
We want most of our tables from both these databases (as-is) available at a single location, so querying them is easy and fast. We are thinking of copying over the data from both DB servers into BQ, without doing any transformations.
It looks like we need to take full dumps of our tables on a periodic basis (daily) and update BQ since BQ is append-only. The recent availability of DML in BQ seems to be very limited.
We are aware that loading the tables as is to BQ is not an optimal solution and we need to denormalize for efficiency, but this is a problem we have to solve after analyzing the feasibility.
My question is whether BQ is a good solution for us, and if yes, how to efficiently keep BQ in sync with our DB data, or whether we should look at something else (like say, Redshift)?
WePay has been publishing a series of articles on how they solve these problems. Check out https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka.
To keep everything synchronized they:
The flow of data starts with each microservice’s MySQL database. These
databases run in Google Cloud as CloudSQL MySQL instances with GTIDs
enabled. We’ve set up a downstream MySQL cluster specifically for
Debezium. Each CloudSQL instance replicates its data into the Debezium
cluster, which consists of two MySQL machines: a primary (active)
server and secondary (passive) server. This single Debezium cluster is
an operational trick to make it easier for us to operate Debezium.
Rather than having Debezium connect to dozens of microservice
databases directly, we can connect to just a single database. This
also isolates Debezium from impacting the production OLTP workload
that the master CloudSQL instances are handling.
And then:
The Debezium connectors feed the MySQL messages into Kafka (and add
their schemas to the Confluent schema registry), where downstream
systems can consume them. We use our Kafka connect BigQuery connector
to load the MySQL data into BigQuery using BigQuery’s streaming API.
This gives us a data warehouse in BigQuery that is usually less than
30 seconds behind the data that’s in production. Other microservices,
stream processors, and data infrastructure consume the feeds as well.
1.I am new in azure, I want to know can we have same replication mechanism provided by on premise sql on azure sql db?
2 .Issue we are facing is, few of the tables are growing fast, daily insert around 10k records, so we are planning to keep only few months say 6 data on main DB and copy all data to other DB using replication (not sure if feasible).
We need to read data from backup as well in application for some reports.
Please suggest on this if replication will work or any other solution.
Geo-replication uses a version of AlwaysOn with async replicas under the hood. It is very similar to a distributed Availability Group in SQL 2016, but you cannot control it, you can only turn it on or off.
Replication will work for that, but it would replicate all the data in the DB, not just the tables you want.
Link to Azure Documentation: https://azure.microsoft.com/en-us/documentation/articles/sql-database-geo-replication-overview/