Can we join tables in on-premise SQL Server database to tables in Delta tables in Azure Delta lake? what are my options - sql

I am archiving rows that are older than a year into ADLSv2 as delta tables, when there is a need to report on that data, I need to join archived data with some tables existing on on-premise database. Is there a way we can do a join without re-hydrating from or hydrating data to cloud?

Yes, you can achieve this task by using Azure Data Factory.
Azure Data Factory (ADF) is a fully managed, serverless data integration
service. Visually integrate data sources with more than 90 built-in,
maintenance-free connectors at no added cost. Easily construct ETL and
ELT processes code-free in an intuitive environment or write your own
code.
Firstly, you need to install the Self-hosted Integration Runtime in your local machine to access the on-premises SQL Server in ADF. To accomplish this, refer Connect to On-premises Data in Azure Data Factory with the Self-hosted Integration Runtime.
As you have archived the data in ADLS, you need to change the Access tier of that container from Cold -> Hot in order to retrieve the data in ADF.
Later, create a Linked Service using Self-hosted IR which you have created. Create a Dataset using this Linked Service to access the on-premises database.
Similarly, create a Linked Service using default Azure IR. Create a Dataset using this Linked Service to access the data from ADLS.
Now, you also require a destination database where you will store the data after join. If you are storing it in same on-premises database, you can use the existing Linked Service but you need to create a new Dataset mentioning the destination table name.
Once all this configuration done, create a Data Flow activity pipeline in ADF.
Mapping data flows are visually designed data transformations in Azure
Data Factory. Data flows allow data engineers to develop data
transformation logic without writing code. The resulting data flows
are executed as activities within Azure Data Factory pipelines that
use scaled-out Apache Spark clusters.
Learn more about Mapping data flow here.
Finally, in data-flow activity, your sources will be on-premises dataset and ADLS dataset which you have created above. You will be using join transformation in mapping data flow to combine data from two sources. The output stream will include all columns from both sources matched based on a join condition.
The sink transformation will take your destination dataset where the data will be stored as an output.

Related

Load multiple files using Azure Data factory or Synapse

I am moving from SSIS to Azure.
we have 100's of files and MSSQL tables that we want to push into a Gen2 data lake
using 3 zones then SQL Data Lake
Zones being Raw, Staging & Presentation (Change names as you wish)
What is the best process to automate this as much as possible
for example build a table with files / folders / tables to bring into Raw zone
then have Synapse bring these objects either full or incremental load
then process the them into the next 2 zones I guess more custom code as we progress.
Your requirement can be accomplished using multiple activities in Azure Data Factory.
To migrate SSIS packages, you need to use SSIS Integrated Runtime (IR). ADF supports SSIS Integration which can be configured by creating a new SSIS Integration runtime. To create the same, click on the Configure SSIS Integration, provide the basic details and create a new runtime.
Refer below image to create new SSIS IR.
Refer this third-party tutorial by SQLShack to Move local SSIS packages to Azure Data Factory.
Now, to copy the data to different zones using copy activity. You can make as much copy of your data as your requirement using copy activity. Refer Copy data between Azure data stores using Azure Data Factory.
ADF also supports Incrementally load data using Change Data Capture (CDC).
Note: Both Azure SQL MI and SQL Server support the Change Data Capture technology.
Tumbling window trigger and CDC window parameters need to be configured to make the incremental load automated. Check this official tutorial.
The last part:
then process them into the next 2 zones
This you need to manage programmatically as there is no such feature available in ADF which can update the other copies of the data based on CDC. You need to either create a separate CDC for those zones or do it logically.

Excel into Azure Data Factory into SQL

I read a few threads on this but noticed most are outdated, with excel becoming an integration in 2020.
I have a few excel files stored in Drobox, I would like to automate the extraction of that data into azure data factory, perform some ETL functions with data coming from other sources, and finally push the final, complete table to Azure SQL.
I would like to ask what is the most efficient way of doing so?
Would it be on the basis of automating a logic app to extract the xlsx files into Azure Blob, use data factory for ETL, join with other SQL tables, and finally push the final table to Azure SQL?
Appreciate it!
Before using Logic app to extract excel file Know Issues and Limitations with respect to excel connectors.
If you are importing large files using logic app depending on size of files you are importing consider this thread once - logic apps vs azure functions for large files
Just to summarize approach, I have mentioned below steps:
Step1: Use Azure Logic app to upload excel files from Dropbox to blob storage
Step2: Create data factory pipeline with copy data activity
Step3: Use blob storage service as a source dataset.
Step4: Create SQL database with required schema.
Step5: Do schema mapping
Step6: Finally Use SQL database table as sink

Options for ingesting and processing data in Azure sql

I need expert opinion on a project I am working on. We currently get data files that we load into our Azure sql database using a local script that calls stored procedures. I am planning on replacing the script with ssis jobs to load the data into our Azure Sql but wondering if that's a good option given our needs.I am opened to different suggestions too. The process we go through is to load data file to staging tables and validate before making updates to live tables. The validation and updates are done by calling stored procedures...so the ssis package will just load the data and make calls to those stored procedures. I have looked at ADF IR and Databricks but they seem overkill but am open to hear people with experience using those as well. I am currently running the ssis package locally as well. Any suggestion on better architecture or tools for this scenario? Thanks!
I would definitely have a look at Azure Data Factory Data flows. With this you can easily build your ETL pipelines in the a Azure Data Factory GUI.
In the following example two text files from a Blob Storage are read, joined, a surrogate key is added and finally the data is loaded to Azure Synapse Analytics (would be the same for Azure SQL):
You finally put this Mapping Data Flow into a pipeline and can trigger it, e. g. if new data arrives.
You can just BULK INSERT data from Azure Blob Store:
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15#accessing-data-in-a-csv-file-referencing-an-azure-blob-storage-location
Then you can use ADF (no IR) or Databricks or Azure Batch or Azure Elastic Jobs to schedule the execution.

Creating Feeds between local SQL servers and Azure SQL servers?

We are wanting to use Azure servers to run our Power Apps applications, however we have local SQL servers which contains our data warehouse we want only certain tables to be on Azure and want to create data feeds between the two with information going from one to the other.
Does anyone have any insight into how I can achieve this?
I have googled but there doesn't appear to be a wealth of information on this topic.
It depends on how fast after a change in your source (the on premise SQL Server) you need that change reflected in your Sink (Azure SQL).
If you have some minutes or even only need to update it every day I would suggest a basic Data Factory Pipeline (search on google for data factory upsert). Here it depends on your data on how you can achieve this.
If you need it faster or it is impossible to extract an incremental update from your source you would need to either use triggers and write the changes from one database to the other or get a program that does change data capture that does that.
It looks like you just want to sync the data in some table between local SQL Server and Azure SQL database.
You can use the Azure SQL Data Sync.
Summary:
SQL Data Sync is a service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across multiple SQL databases and SQL Server instances.
With Data Sync, you can keep data synchronized between your on-premises databases and Azure SQL databases to enable hybrid applications.
A Sync Group has the following properties:
The Sync Schema describes which data is being synchronized.
The Sync Direction can be bi-directional or can flow in only one
direction. That is, the Sync Direction can be Hub to Member, or
Member to Hub, or both.
The Sync Interval describes how often synchronization occurs.
The Conflict Resolution Policy is a group level policy, which can be
Hub wins or Member wins.
Next step, you need to learn how to configure the Data Sync. Please reference this Azure document:Tutorial: Set up SQL Data Sync between Azure SQL Database and SQL Server on-premises.
In this tutorial, you learn how to set up Azure SQL Data Sync by creating a sync group that contains both Azure SQL Database and SQL Server instances. The sync group is custom configured and synchronizes on the schedule you set.
Hope this helps.
The most robust solution here is Transactional Replication. You can also use SSIS or Azure Data Factory for copying tables to/from Azure SQL Database. And Azure SQL Data Sync also exists.

Real-time data synchronization from Azure database to Azure SQL Data Warehouse?

I've done a fair bit of reading and it seems like there are a couple of off-the-shelf products that replicate/sync data from on-premise database to Azure SQL Data Warehouse but I've found nothing that syncs using an Azure database as the source. The Azure Data Factory holds some promise however it looks more suited to one off loads.
Anyone know of a way? (SSIS package not really an option as I want the transfer to occur wholly inside the cloud)
Azure Data Factory can run continuous loads from SQL Database to SQL Data Warehouse. You'll want to look into the frequency and interval parameters for the pipeline
The documentation is here https://azure.microsoft.com/en-us/documentation/articles/data-factory-create-datasets/.