Best methods for processing a rule hierarchy/tree - azure-synapse

I'm using Azure SQL Pools/Synapse/SQL DW and have a rule hierarchy that I need to process. At each level a parent can specify if all (AND) or any (OR) children are required in order for the rule to be satisfied. Each level in the hierarchy can specify a different condition to the parent (so you could have an AND condition that contains an OR etc.)
In pure SQL this can be implemented as a loop that starts from leaf level and parses each level by left joining the hierarchy onto the data to be evaluated. Any data that does not match the condition is pruned from the dataset. AND conditions are processed by counting the distinct number of children that exist and the distinct number of children that match.
This creates a lot of complex SQL to maintain, as well as using a less efficient loop. I suspect that the graph functionality may be a better structure here, but cannot see any inbuilt functionality that would actually help with the processing. Likewise hierarchyid sounds appropriate for this however I don't believe it exists in Azure Synapse/Pools/DW

Azure Synapse Analytics dedicated SQL pools do not support either the graph tables or the hierarchyId available in SQL Server box product and Azure SQL DB. Therefore your best option is to probably use a nearby Azure SQL DB to do this processing. Use Azure Data Factory (ADF) or Synapse Pipelines to move data between them.
Alternately, I've done a few question answers which I think give good coverage on using graph or hierarchical data in Synapse and some of the approaches
which include: using Azure SQL DB, using WHILE loops and using Azure Synapse Notebooks and the GraphFrames library:
Recursive Query in Azure Synapse Analytics for Dates
This was where someone thought they needed a recursive query but did not:
Recursive Query in Azure Synapse Analytics for Dates
Synapse top level parent hierarchy coverage and examples of the SQL loops and GraphFrames option: https://stackoverflow.com/a/67065509/1527504
The second question in particular is quite thorough.

Related

Where is data physically stored in Azure Synapse Dedicated SQL Pool?

Documentation from Microsoft and others strongly emphasizes the separation between storage and compute in Azure Synapse Analytics.
In the case of a Serverless SQL pool, it is clearly explained that the data is stored in an Azure Data Lake DSL Gen2.
However, in the case of a Dedicated SQL Pool, the documentation is not explicit enough on data storage.
In a book that deals with Azure Synapse, it is stated that in the case of Dedicated SQL Pool, data is stored in Storage Nodes which are completely separate from Compute Nodes.
Since this claim is not in Microsoft's documentation, I dare not trust it.
So, is there an official resource that sheds light on this question?
This is a question that has been on my mind for a long time as well. However, I have come to the conclusion that data is actually stored in Dedicated SQL Pools.
Let me explain why I believe this.
Take a look at the documentation given here,
https://learn.microsoft.com/en-us/azure/synapse-analytics/quickstart-copy-activity-load-sql-pool
Notice that it is about loading data into a Dedicated SQL Pool. Further, to quote part of the documentation,
A dedicated SQL pool offers T-SQL based compute and storage
capabilities. After creating a dedicated SQL pool in your Synapse
workspace, data can be loaded, modeled, processed, and delivered for
faster analytic insight.
It is said that Dedicated SQL Pools provide both compute and storage capabilities.
Furthermore, with Dedicated SQL Pools, you may already know that it is possible to create traditional tables. We can organize these tables into something along the lines of a star or snowflake schema to model our data warehouses.
Creation of such tables, however, is not possible with Serverless SQL Pools. Only the creation of metadata objects, i.e. views or external tables are allowed. This is explained here,
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview
To quote the relevant passage of the article,
Serverless SQL pool has no local storage, only metadata objects are
stored in databases. Therefore, T-SQL related to the following
concepts isn't supported:
Tables Triggers Materialized views DDL statements other than ones
related to views and security DML statements
To me, the fact that tables can actually be created in Dedicated SQL Pools is further proof that the data is physically stored in them.
My final argument is around the idea of distributions. The concept is explained here,
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/massively-parallel-processing-mpp-architecture
This talks about how data is divided up among the compute nodes and how queries are executed in parallel on the distributions in these nodes. It would not be possible to implement this if the data was not actually stored in these nodes.
In my humble opinion, how I believe Azure Storage comes into the picture (at least, when it comes to Dedicated SQL Pools) is with regards to storing data as files in a data lake and then ingesting them into the pool for analysis.
An explanation can be found here,
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/overview-architecture
Yet another quote,
Serverless SQL pool allows you to query your data lake files, while
dedicated SQL pool allows you to query and ingest data from your data
lake files. When data is ingested into dedicated SQL pool, the data is
sharded into distributions to optimize the performance of the system.
This is where Polybase comes into play. You can define various data loading patterns (into Dedicated SQL Pools) using Polybase as explained here,
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview
The Microsoft documentation on Design tables using dedicated SQL pool in Azure Synapse Analytics, found at https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview, states the following:
Table persistence: Tables store data either permanently in Azure
Storage, temporarily in Azure Storage, or in a data store external to
dedicated SQL pool.
Regular table A regular table stores data in Azure Storage as part of
dedicated SQL pool...

Sql Azure - Cross database queries

I have N databases, for example 10 databases.
Every database has the same schema, but different data.
Now i would like to take every data of each database from the table "Table1" and insert them in a common table in a new database "DWHDatabase" in a table named Table1Common.
so it's an insert like n to 1.
How i can do that? i'm trying to solve my issues with the elastic queries but seems it's a 1 to 1 stuff
Use Azure Data Factory with Linked Services to each database. Use the Copy activity to load the data.
You can also paramaterize the solution.
Parameterize linked services
Parameters in Azure Data Factory by Catherine Wilhemsen
Elastic query is best suited for reporting scenarios in which the majority of the processing (filtering, aggregation) may be done on the external source side. It is unsuitable for ETL procedures involving significant amounts of data transfer from a distant database (s). Consider Azure Synapse Analytics for large reporting workloads or data warehousing applications with more sophisticated queries.
You may use the Copy activity to copy data across on-premises and
cloud-based data storage. After you've copied the data, you may use
other actions to alter and analyse it. The Copy activity may also be
used to publish transformation and analysis findings for use in
business intelligence (BI) and application consumption.
MSFT Copy Activity Overview: Here.

"Cursor" and "FOR XML" clause in Azure data warehouse

While creating stored procedures in Azure data warehouse, I have got some error on "Cursor" and "FOR XML". So wanted to know if they are supported by Azure data warehouse or not. If not then what are the alternatives.
sample code with error msg pictures are attached herein.
Neither FOR XML or cursors are supported in Azure Synapse dedicated SQL pools (formerly known as Azure SQL Data Warehouse) as per the documentation. For cursors, either convert them to use a WHILE loop which is supported or refactor the code to use a set-based approach. Another alternative is to use something external, like Azure Data Factory or Synapse Pipelines and use a For Each loop. Another alternative is to use a nearby Azure SQL DB to do some pre-processing. You should be aware the the MPP architecture of Azure Synapse Analytics does not lend itself well to this kind of row-based processing and you should remember it's a big data platform meant for large volumes of data, millions, billions of rows and set-based approaches should be preferred.
If you are just using FOR XML to do that sleazy string concatenation trick then you should use STRING_AGG instead which is fully supported in Synapse. See this answer for a recent example. If you are actually producing XML then you will need to find an alternative method, eg a nearby Azure SQL DB.

Transformation in Snowflake or Azure data Factory?

I'm very new to Snowflake, so forgive me if the answer is obvious.
I am loading the data from on-prem into Azure using Data Factory, and then ingesting into Snowflake using COPY INTO. However, I need to enable access for some of the transformed data to other platforms, meaning that if I perform transformation in Snowflake, I'll need to create an external table in Azure (essentially pushing this data back to Azure so other platforms can access it).
As we don't particularly want to introduce a new tool, I have two options for our fairly basic transformation:
do the transformation in ADF
do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools (these platforms don't integrate with Snowflake)
Are there any major drawbacks to option 2 apart from increased storage costs?
I'm trying to weigh up the following: maintenance effort (our team's skills lie in SQL not ADF), cost, and performance.
Any advice would be appreciated.
As stated in the question, there are many possible answers for this scenario - with my favorite being the second one ("do the transformation in Snowflake in SQL scripts and then create an external table so other teams can access the data using other tools").
If you need to make the results of these transformations available on Azure storage, Azure Data Factory supports this natively:
Copy data from Snowflake that utilizes Snowflake's COPY into [location] command to achieve the best performance. https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake#supported-capabilities
Or you could manage this inside Snowflake using the same COPY INTO that ADF uses.
Let me add a couple screenshots from the Snowflake webinar "Data Warehouse or Data Lake? How You Can Have Both in a Single Platform":
https://resources.snowflake.com/webinars-thought-leadership/data-warehouse-or-data-lake-how-you-can-have-both-in-a-single-platform-3

How to create a triggered update of cloud SQL instance export into SQL dump file in cloud storage? [duplicate]

I am designing a solution in which Google Cloud SQL will be used to store all data from the regular functioning of the app(kind of OLTP data). The data is expected to grow over time into pretty large size. The data itself is relational in nature and hence we have chosen Cloud SQL instead of Cloud Datastore.
This data needs to be fed into Big Query for analytics and this needs to be near real-time analytics (as the best case), although realistically some lag can be expected. But I am trying to design a solution which reduces this lag to minimum possible.
My question has 3 parts -
Should I use Cloud SQL for storing data and then move it to BigQuery or change the basic design itself and use BigQuery for storing the data initially as well? Is BigQuery suitable for use for regular, low-latency OLTP workloads?(I don't think so - is my assumption correct?)
What is the recommended/best practice for loading Cloud SQL data into BigQuery and have this integration work near real-time?
Is Cloud Dataflow a good option? If I connect Cloud SQL to Cloud DataFlow and further to BigQuery - will it work? Or is there any other way to achieve this which is better(as asked in question 2)?
Take a look at how WePay does this:
https://wecode.wepay.com/posts/bigquery-wepay
The MySQL to GCS operator executes a SELECT query against a MySQL
table. The SELECT pulls all data greater than (or equal to) the last
high watermark. The high watermark is either the primary key of the
table (if the table is append-only), or a modification timestamp
column (if the table receives updates). Again, the SELECT statement
also goes back a bit in time (or rows) to catch potentially dropped
rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.
BigQuery supports Cloud SQL federated queries which lets you directly query Cloud SQL database from BigQuery. To keep Cloud SQL table in sync with BigQuery, you can write a simple script with following query to sync two tables every hour.
INSERT
demo.customers (column1)
SELECT
*
FROM
EXTERNAL_QUERY(
"project.us.connection",
"SELECT column1 FROM mysql_table WHERE timestamp > ${timestamp};");
Just remember replace the ${timestamp} with the current timestamp - 1 hour.
Another method would be to split the write process to CloudSQL and to Cloud Pub/Sub and then have a Dataflow reader to stream into BigQuery. This works well when you have materially different target schema for your BigQuery tables - which is common when denormalizing your relational data.
The upside is that you can reduce overall latency to say a few seconds; however, the main downside is that if your transactional data is highly mutating you will have to create a versioning scheme to track changes.
Google has provided a reference article on this subject related to using a change data capture tool to identify the changed data and only pushing that.
This makes some assumptions that may not work for you:
willingness to learn debezium
willingness to let GCP connect to your source MySQL database
If those work for your situation it seems like a good solution.
I think you can use federated queries as one possible solution:
A federated query is a way to send a query statement to an external database and get the result back as a temporary table. Federated queries use the BigQuery Connection API to establish a connection with the external database. In your standard SQL query, you use the EXTERNAL_QUERY function to send a query statement to the external database, using that database's SQL dialect. The results are converted to BigQuery standard SQL data types.
You can use federated queries with the following external databases:
Cloud Spanner
Cloud SQL
After the initial one-time setup, you can write a query with the EXTERNAL_QUERY SQL function.
I leave you the documentation so you can implement it on your project:
https://cloud.google.com/bigquery/docs/federated-queries-intro