Azure Synapse: Performant way to promote data from staging schema to prod schema? - azure-synapse

I've used ADF/synapse pipeline to extract my fact data from source to staging schema table in dedicated sql pool.
My task now is to enrich the fact data with a surrogate key, which comes via lookup from a small dimension table already in prod schema.
I've been using ADF/synapse dataflow for this, but it seems inefficient because it's moving the data BACK into ADF. I could do this via spark notebook, but also seems like an unnecessary data movement.
So i think best approach is to make a stored procedure on the dedicated sql pool to perform this enrichment. My concern is to ensure this sql proc is coded in a performant way (not a row by row insert).
There are lots of proc examples that create a table from nothing (ctas), but I haven't found examples that do the enrich/append action in a scalable way. What are good sql code practices for this?

I think the best practice for this would be to have your small dimension distributed as REPLICATE and use CTAS to create a HASH distributed table on a suitable column. Consider partition switching in addition if you have enough volume. A simplified example:
CREATE TABLE fact.yourBigTable
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH( someColumn )
)
AS
SELECT ...
FROM staging. ...
CREATE TABLE dim.yourSmallTable (
...
)
WITH
(
CLUSTERED INDEX ( someColumn ),
DISTRIBUTION = REPLICATE
);
CTAS is optimised to work well on the MPP infrastructure of dedicated SQL pools. If you do not feel CTAS is appropriate look a straightforward INSERT instead. Dedicated SQL pools now support MERGE (in preview) so that may also be worth a look.
I would agree with you not to use ADF or Dataflows for this as there's nothing faster than a bit of SQL running directly on the server and I reserve these things for things you can't do with SQL, eg orchestrations / running tasks in parallel, advanced transform (eg with notebooks) and so on.

Related

How to Delete First Row of All Tables in Azure Synapse Serverless Pool

I have create a number of tables in Azure Synapse Analytics and I would like to remove the first row from each of the tables.
Can someone assist with code that will remove the first row from each table.
I tried the following:
DELETE TOP (1)
FROM [dbo].[MyTable]
I got the error:
DML Operations are not supported with external tables.
External tables are read-only. They are just another abstraction layer to lake.
You can't perform DML operations over those files from Synapse.
What you could do is:
Try to employ REJECT_TYPE and REJECT_VALUE in OPTIONS while creating the table.
Only use Serverless Views so that you can filter some data out
Implement a strategy to re-stage data to a distributed pool table by using CTAS and providing the filter there
General rule of thumb is: only get what you need, deletion is usually slow.
Best,
Onur

ADF - How should I copy table data from source Azure SQL Database to 6 other Azure SQL Databases?

We curate data in the "Dev" Azure SQL Database and then currently use RedGate's Data Compare tool to push up to 6 higher Azure SQL Databases. I am trying to migrate that manual process to ADFv2 and would like to avoid copy/pasting the 10+ copy data actives for each database (x6) to keep it more maintainable for future changes. The static tables have some customization in the copy data activity but the basic idea follows this post to perform an upsert.
How can the implementation described above be done in Azure Data Factory?
I was imagining something like the following:
Using one parameterized link service that has the server name & database name configurable to generate a dynamic connection to Azure SQL Database.
Creating a pipeline for each table's copy data activity.
Creating a master pipeline to then nest each table's pipeline in.
Using variables loop over the different connections an passing those to the sub-pipelines parameters.
Not sure if that is the most efficient plan or even works yet. Other ideas/suggestions?
we can not tell you if that's the most efficient plan. But I think so. Just make it works.
As you said in the comment:
we can use Dynamic Pipelines - Copy multiple tables in Bulk with
'Lookup' & 'ForEach'. we can perform dynamic copies of your data
table lists in bulk within a single pipeline. Lookup returns either
the lists of data or first row of data. ForEach - #activity('Azure
SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv') + This is
efficient and cost optimized since we are using less number of
activities and datasets.
In usually, we also will choose same solution with you: dynamic parameter/pipeline, lookup + foreach active to achieve the scenario. In one word, make the pipeline has a strong logic, simple and efficient.
Added the same info mentioned in the Comment as Answer.
Yup, we can use Dynamic Pipelines - Copy multiple tables in Bulk with 'Lookup' & 'ForEach'.
We can perform dynamic copies of your data table lists in bulk within a single pipeline. Lookup returns either the lists of data or first row of data.
ForEach - #activity('Azure SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv')
This is efficient and cost optimized since we are using less number of activities and datasets.
Attached pic as ref-

SSIS : Huge Data Transfer from Source (SQL Server) to Destination (SQL Server)

Requirement :
Transfer millions of records from source (SQL Server) to destination (SQL Server).
Structure of source tables is different from destination tables.
Refresh data once per week in destination server.
Minimum amount of time for the processing.
I am looking for optimized approach using SSIS.
Was thinking these options :
Create Sql dump from source server and import that dump in destination server.
Directly copy the tables from source server to destination server.
Lots of issues to consider here. Such as are the servers in the same domain, on same network, etc.
Most of the time you will not want to move the data as a single large chunk of millions of records but in smaller amounts. An SSIS package handles that logic for you, but you can always recreate it as well but iterating the changes easier. Sometimes this is a reason to push changes more often rather than wait an entire week as smaller syncs are easier to manage with less downtime.
Another consideration is to be sure you understand your delta's and to ensure that you have ALL of the changes. For this reason I would generally suggest using a staging table at the destination server. By moving changes to staging and then loading to the final table you can more easily ensure that changes are applied correctly. Think of the scenario of a an increment being out of order (identity insert), datetime ordered incorrectly or 1 chunk failing. When using a staging table you don't have to rely solely on the id/date and can actually do joins on primary keys to look for changes.
Linked Servers proposed by Alex K. can be a great fit, but you will need to pay close attention to a couple of things. Always do it from Destination server so that it is a PULL not a push. Linked servers are fast at querying the data but horrible at updating/inserting in bulk. 1 XML column cannot be in the table at all. You may need to set some specific properties for distributed transactions.
I have done this task both ways and I would say that SSIS does give a bit of advantage over Linked Server just because of its robust error handling, threading logic, and ability to use different adapters (OLEDB, ODBC, etc. they have different performance do a search and you will find some results). But the key to your #4 is to do it in smaller chunks and from a staging table and if you can do it more often it is less likely to have an impact. E.g. daily means it would already be ~1/7th of the size as weekly assuming even daily distribution of changes.
Take 10,000,000 records changed a week.
Once weekly = 10mill
once daily = 1.4 mill
Once hourly = 59K records
Once Every 5 minutes = less than 5K records
And if it has to be once a week. just think about still doing it in small chunks so that each insert will have more minimal affect on your transaction logs, actual lock time on production table etc. Be sure that you never allow loading of a partially staged/transferred data otherwise identifying delta's could get messed up and you could end up missing changes/etc.
One other thought if this is a scenario like a reporting instance and you have enough server resources. You could bring over your entire table from production into a staging or update a copy of the table at destination and then simply do a drop of current table and rename the staging table. This is an extreme scenario and not one I generally like but it is possible and actual impact to the user would be very nominal.
I think SSIS is good at transfer data, my approach here:
1. Create a package with one Data Flow Task to transfer data. If the structure of two tables is different then it's okay, just map them.
2. Create a SQL Server Agent job to run your package every weekend
Also, feature Track Data Changes (SQL Server) is also good to take a look. You can config when you want to sync data and it's good at performance too
With SQL Server versions >2005, it has been my experience that a dump to a file with an export is equal to or slower than transferring data directly from table to table with SSIS.
That said, and in addition to the excellent points #Matt makes, this the usual pattern I follow for this sort of transfer.
Create a set of tables in your destination database that have the same table schemas as the tables in your source system.
I typically put these into their own database schema so their purpose is clear.
I also typically use the SSIS OLE DB Destination package's "New" button to create the tables.
Mind the square brackets on [Schema].[TableName] when editing the CREATE TABLE statement it provides.
Use SSIS Data Flow tasks to pull the data from the source to the replica tables in the destination.
This can be one package or many, depending on how many tables you're pulling over.
Create stored procedures in your destination database to transform the data into the shape it needs to be in the final tables.
Using SSIS data transformations is, almost without exception, less efficient than using server side SQL processing.
Use SSIS Execute SQL tasks to call the stored procedures.
Use parallel processing via Sequence Containers where possible to save time.
This can be one package or many, depending on how many tables you're transforming.
(Optional) If the transformations are complex, requiring intermediate data sets, you may want to create a separate Staging database schema for this step.
You will have to decide whether you want to use the stored procedures to land the data in your ultimate destination tables, or if you want to have the procedures write to intermediate tables, and then move the transformed data directly into the final tables. Using intermediate tables minimizes down time on the final tables, but if your transformations are simple or very fast, this may not be an issue for you.
If you use intermediate tables, you will need a package or packages to manage the final data load into the destination tables.
Depending on the number of packages all of this takes, you may want to create a Master SSIS package that will call the extraction package(s), then the transformation package(s), and then, if you use intermediate processing tables, the final load package(s).

How to manage Schema Drift while streaming to BigQuery sharded table

We are new to BigQuery and are trying to figure out the best way to use it for real time analytics. We are sending a stream of logs from our back-end services to Kafka, and we want to stream those into BigQuery using streaming inserts. For queryability we are both partitioning by time, and sharding tables by event type (for use with wildcard queries). We put a view overtop of the family of tables created so that they look like 1 table and use the _TABLE_SUFFIX (well, when they roll out the feature, for now using UNION ALL) and _PARTITIONTIME columns to reduce the set of rows scanned for queries. So far so good.
What we are unsure of how to handle properly is schema changes. The schema of our log messages changes frequently. Having a manual process to keep BigQuery in sync is not tenable. Ideally our streaming pipeline would detect the change and apply the schema update (for adding columns) or table creation (for adding an event type) as necessary. We have tooling up-stream so that we know all schema updates will be backwards compatible.
My understanding is that all of the shards must have the same schema. How do we apply the schema update in such a fashion that:
We don't break queries that are run during the update.
We don't break streaming inserts.
Is #1 possible? I don't believe we can atomically change the schema of all the sharded tables.
For #2 I presume we have to stop our streaming pipelines while the schema update process is occurring.
Thanks,
--Ben
Wildcard tables with _TABLE_SUFFIX is available https://cloud.google.com/bigquery/docs/querying-wildcard-tables and you can use it even if the schemas of the tables are different, they just need to have compatible schemas. With UNION ALL, you need all the tables to have the same schema so it will not work if you're updating schemas at the same time.
Streaming insert will also work if you only specify a subset of fields. However you cannot add new fields as part of the streaming insert, you'll have to update table first and then insert the data with new schema.

SSIS delete duplicate data

I have a problem for making a SSIS package.Can anyone help?
Here is the case:
I have two tables: A & B, and the structure is the same. But they are stored on different servers.
I have already made an SSIS package to transfer data from A to B (about one million rows at a time, which takes between one and two minutes).
After that I want to delete table A's data, after having been transfered to B. The SSIS package I wrote would follow that. I use a merge join and Conditional Split command to select the same data.
After that I use the OLE DB Command to delete Table A's data (just use "Delete RE_FormTo Where ID=?" SQLCommand to delete). It can work, but it is too slow! It took about one hour to delete the duplicate data! Does anyone know of a more efficient way of doing this?
SSIS Package Link
The execution is bound to be slow because of the poor SSIS package design .
Kindly refer the document Best Practices of SSIS Design
Let me explain you the mistakes which are there in your package .
1.You are using a Blocking transformation (Sort Component) .These transformations doesn't reuse the input buffer but create a new buffer for output and mostly they are slower than Synchronous components such as Lookup ,Derived Column etc which try to re use the input buffer .
As per MSDN
Do not sort within Integration Services unless it is absolutely necessary. In
order to perform a sort, Integration Services allocates the memory space of the
entire data set that needs to be transformed. If possible, presort the data before
it goes into the pipeline. If you must sort data, try your best to sort only small
data sets in the pipeline. Instead of using Integration Services for sorting, use
an SQL statement with ORDER BY to sort large data sets in the database – mark
the output as sorted by changing the Integration Services pipeline metadata
on the data
source.
2.Merge Join is a semi-blocking transformation which does hamper the performance but much less than Blocking transformation
There are 2 ways in which you can solve the issue
Use Lookup
Use Execute SQL Task and write the Merge SQL
DECLARE #T TABLE(ID INT);
Merge #TableA as target
using #TableB as source
on target.ID=source.ID
when matched then
Delete OUTPUT source.ID INTO #T;
DELETE #TableA
WHERE ID in (SELECT ID
FROM #T);
After you Joining both tables just insert Sort element, he gonna remove duplicates....
http://sqlblog.com/blogs/jamie_thomson/archive/2009/11/12/sort-transform-arbitration-ssis.aspx