Access Azure Key Vault in Pandas read/write Azure Data Lake Storage Gen2 data in serverless Apache Spark pool in Synapse Analytics - pandas

Recently, Microsoft released a way for Pandas to read/write Azure Data Lake Storage Gen2 data in serverless Apache Spark pool in Synapse Analytics as per the below link:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/tutorial-use-pandas-spark-pool
If I have to use the same strategy for pyspark in Azure DataBricks, how can I use the datalake secret (from Azure Key Vault) containing the account key so that pandas can access the data lake smoothly? In this way, I don't have to expose the secret value in DataBricks notebook

for Azure Databricks you just need to create a secret scope out of the Azure KeyVault, and then you can use dbutils.secrets.get function to retrieve a secret from secret scope or ingest the secrets into a Spark conf.
Please note that you will need to set correct Spark configuration to use that storage account key refer to documentation for details (blob storage, ADLS Gen2)

Related

Getting data from on-prem to Azure Synapse - "unexpected metadata of Synapse Link was detected in the source database"

I'm trying to move data from table A in our on-prem database into an equivalent table A in an Azure Synapse (ASA) dedicated pool. I've set up the integration runtime, and have selected my on prem table from ASA. However, when I run a link connection I am seeing the following error:
Failed to enable Synapse Link on the source due to 'Unexpected metadata of Synapse Link was detected in the source database.'.
Failed to disable Synapse Link on the source due to 'Failed to drop the link topic in the source database: Failed to enable Synapse Link on the source due to 'Unexpected metadata of Synapse Link was detected in the source database.'.'.
Continuous run ID: e52df111-9947-401e-97cb-4ef3f4532934
I'm expecting Table A in ASA to be populated with data from on-prem.
What does this mean? I'm very new to ASA so might have overlooked some setup.
I tried to reproduce the error you have got and ended up with the similar error as below:
The main cause of error is your Azure Synapse workspace managed identity has no permissions to access the Azure Data Lake Storage Gen2 storage account.
As per Official Microsoft Document
Make sure that you've granted your Azure Synapse workspace managed identity permissions to the Azure Data Lake Storage Gen2 storage account
To grant the managed identity of Azure Synapse workspace to access the Azure Data Lake Storage Gen2 storage account follow below steps:
Go to your Azure Data Lake Storage Gen2 >> Access Control (IAM) >> Add >> Add Role assignments
After this pick Storage Blob Data Contributor,
For selected role choose Managed identity.
Then under Members, choose your Azure Synapse workspace.

What is the best method to sync medical images between my client PCs and my Azure Blob storage through a cloud-based web application?

What is the best method to sync medical images between my client PCs and my Azure Blob storage through a cloud-based web application? I tried to use MS Azure Blob SDK v18, but it is not that fast. I'm looking for something like dropbox, fast, resumable and efficient parallel uploading.
Solution 1:
AzCopy is a command-line tool for copying data to or from Azure Blob storage, Azure Files, and Azure Table storage, by using simple commands. The commands are designed for optimal performance. Using AzCopy, you can either copy data between a file system and a storage account, or between storage accounts. AzCopy may be used to copy data from local (on-premises) data to a storage account.
And also You can create a scheduled task or cron job that runs an AzCopy command script. The script identifies and uploads new on-premises data to cloud storage at a specific time interval.
Fore more details refer this document
Solution 2:
Azure Data Factory is a fully managed, cloud-based, data-integration ETL service that automates the movement and transformation of data.
By using Azure Data Factory, you can create data-driven workflows to move data between on-premises and cloud data stores. And you can process and transform data with Data Flows. ADF also supports external compute engines for hand-coded transformations by using compute services such as Azure HDInsight, Azure Databricks, and the SQL Server Integration Services (SSIS) integration runtime.
Create an Azure Data Factory pipeline to transfer files between an on-premises machine and Azure Blob Storage.
For more details refer this thread

How to create linked service from azure analysis service to azure synapse SQL pool

How to pull data from cube that is hosted on Azure analysis service and load data in SQL pools of synapse
One solution is to use Azure Data Factory for data movement.
There's no built-in connector for Azure Analysis Service in Data Factory. But since Azure Analysis Services uses Azure Blob Storage to persist storage, you can use the connector for Azure Blob Storage.
In Data Factory, use a Copy Activity with Blob Storage as source and Azure Synapse Analytics as sink.
More on Azure Data Factory here: https://learn.microsoft.com/en-us/azure/data-factory/
Available connectors in Data Factory: https://learn.microsoft.com/en-us/azure/data-factory/connector-overview

An easy-to-use tool to copy data from Amazon S3 to Azure Blob/ADLS Gen2 via Azure Data Factory

Is there any simple tool to help me copy data from Amazon S3 to Azure Blob or Azure Data Lake Gen2?
Azure Data Factory team recently built a storage explorer extension, which is used to copy data from Amazon s3 to Azure Blob or Azure Data Lake Gen2 with simple drag and drop.
Check it here:
https://github.com/Azure/Azure-DataFactory/blob/main/StorageExplorerExtension/storage-explorer-plugin.md
Demo: https://www.youtube.com/watch?reload=9&v=GacGa5T0flk

Connecting storage securely to Azure Data Lake Analytics or Data Factory

I am setting up a new Azure Data Lake Analytics (ADLA) PAAS service to run USQL against some existing data sets in blob storage. The blob storage is firewalled for security and when I try to add the storage account to the data sources in ADLA I get the following error. Similar happens for data factory.
InvalidArgument: The Storage account '' or its accessKey
is invalid.
If I disable the firewall, the storage account can be successfully added. I have tried to add the relevant Azure Data Center IP Address ranges but the connection still fails. I have also ticked the "Allow trusted Microsoft Services box" but this does not seem include data lake or data factory. How do I access my storage account from ADLA but still have it secured?
You could install a selfhosted IR to access your blob storage. Whitelist the IP of the machine hosting your selfhosted IR.