Azure Synapse Serverless - SQL query to return rows in directory for each file

Azure Synapse Serverless - SQL query to return rows in directory for each file - azure-data-lake

I have an Azure Data Lake Gen2 Container in which I have several json files. I would like to write a query that returns a record for each file. I am not interested in parsing the files, I just want to know what files are there and have this returned in a view. Does anyone have any tips on how I might do this? Everything I have found is about how to parse/read the files...I am going to let Power BI do that since the json format is not standard. In this case I just need a listing of files. Thanks!

You can use the filepath() and filename() function in Azure Synapse Analytics serverless SQL pools to return those. You can even GROUP BY them to return aggregated results. A simple example:
SELECT
[result].filepath() AS filepath,
[result].filename() AS filename,
COUNT(*) AS records
FROM
OPENROWSET(
BULK 'https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow/puYear=2019/puMonth=4/*.parquet',
FORMAT = 'PARQUET'
) AS [result]
GROUP BY [result].filepath(), [result].filename()
See the documentation for further examples.

Related

Query data lake from Azure SQL database

I'm just finding my way around Azure, trying to build a modern data warehouse. One thing I haven't been able to figure out is how to query my data lake from an Azure SQL database.
Something similar to the following works in Azure Synapse (note the long-term plan is to remove Synapse due to cost reasons):
SELECT top 100 *
FROM
OPENROWSET( BULK
'https://storageaccount.blob.core.windows.net/container/folder/2022/09/03/filename.parquet'
,SINGLE_BLOB
) AS [result]
But I get the following error running this from an Azure SQL database (in the Azure portal, using the query editor):
Failed to execute the query. Error: Cannot bulk load because the file "https://storageaccount.blob.core.windows.net/container/folder/2022/09/03/filename.parquet" could not be opened. Operating system error code 6(The handle is invalid.).
I also tried the code below after searching on the Internet:
CREATE EXTERNAL DATA SOURCE pocBlobStorage
WITH ( TYPE = BLOB_STORAGE,
LOCATION = 'https://storageaccount.blob.core.windows.net/container/folder/2022/09/03',
CREDENTIAL= sqlblob);
-- Query remote file
SELECT *
FROM OPENROWSET(BULK 'filename.parquet',
DATA_SOURCE = 'pocBlobStorage',
SINGLE_CLOB
--FORMATFILE='currency.fmt',
--FIRSTROW=2
--, FORMATFILE_DATA_SOURCE = 'pocBlobStorage'
) as D
I tried various combinations of the formatting options, but couldn't get anything to work.
The current error I'm getting is: Failed to execute query. Error: Referenced external data source "pocBlobStorage" not found.
I'm wondering if I need to do something to enable the Azure SQL database to access my data lake. For example, I haven't configured any credential called 'SQL blob' as per my last code segment, but I'm not sure where to do this (for example something similar to creating a linked service in azure data factory).
So how do I query my data lake, directly from my azure SQL database? Is the issue in my query, or do I need to configure access first, and if so how?

SQL Server Serverless - Openrowset - Metadata functions?

I am using Openrowset with OpenJson to query files in a data lake using a Synapse SQL Serverless database.
I have found the filename() and filepath() file metadata functions.
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-specific-files#filepath
Are there more functions or is that it?
I would really like to be able to get the file modified timestamp.

File metadata features are not supported via serverless function.
You should use Synapse spark notebook to query the blob metadata instead. You can find more information at BlobProperties Class

Loading 50GB CSV File Azure Blob to Azure SQL DB in Less time- Performance

I am loading 50GB CSV file From Azure Blob to Azure SQL DB using OPENROWSET.
It takes 7 hours to load this file.
Can you please help me with possible ways to reduce this time?

The easiest option IMHO is just use BULK INSERT. Move the csv file into a Blob Store and the import it directly using BULK INSERT from Azure SQL. Make sure Azure Blob storage and Azure SQL are in the same Azure region.
To make it as fast as possible:
split the CSV in more than one file (for example using something like a CSV splitter. This looks nice https://www.erdconcepts.com/dbtoolbox.html. Never tried and just came up with a simple search, but looks good)
run more BULK INSERT in parallel using TABLOCK option. (https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-2017#arguments). This, if the target table is empty, will allow multiple concurrent bulk operations in parallel.
make sure you are using an higher SKU for the duration of the operation. Depending on the SLO (Service Level Objective) you're using (S4? P1, vCore?) you will get a different amount of log throughput, up to close 100 MB/Sec. That's the maximum speed you can actually achieve. (https://learn.microsoft.com/en-us/azure/sql-database/sql-database-resource-limits-database-server)

Please try using Azure Data Factory.
First create the destination table on Azure SQL Database, let's call it USDJPY. After that upload the CSV to an Azure Storage Account. Now create your Azure Data Factory instance and choose Copy Data.
Next, choose "Run once now" to copy your CSV files.
Choose "Azure Blob Storage" as your "source data store", specify your Azure Storage which you stored CSV files.
Provide information about Azure Storage account.
Choose your CSV files from your Azure Storage.
Choose "Comma" as your CSV files delimiter and input "Skip line count" number if your CSV file has headers
Choose "Azure SQL Database" as your "destination data store".
Type your Azure SQL Database information.
Select your table from your SQL Database instance.
Verify the data mapping.
Execute data copy from CSV files to SQL Database just confirming next wizards.

large amount of data in Azure SQL > U-SQL

I am trying to find the best way to process in U-SQL a query that produces 450 millions rows. I use DataFactory V2 (pipeline) - USQL for transformation
The problem : it takes an eternity to extract it into a csv file and to inject it back to the Azure SQL DB
I wanted to know if it is possible to query directly the Azure database with U-SQL and after injecting the result directly to the Azure DB?
it would be faster instead of generating CSV as an imput and a CSV as an output...
thanks

copy blob data into on-premise sql table

My problem statement is that I have a csv blob and I need to import that blob into a sql table. Is there an utility to do that?
I was thinking of one approach, that first to copy blob to on-premise sql server using AzCopy utility and then import that file in sql table using bcp utility. Is this the right approach? and I am looking for 1-step solution to copy blob to sql table.

Regarding your question about the availability of a utility which will import data from blob storage to a SQL Server, AFAIK there's none. You would need to write one.
Your approach seems OK to me. Though you may want to write a batch file or something like that to automate the whole process. In this batch file, you would first download the file on your computer and the run the BCP utility to import the CSV in SQL Server. Other alternatives to writing batch file are:
Do this thing completely in PowerShell.
Write some C# code which makes use of storage client library to download the blob and once the blob is downloaded, start the BCP process in your code.

To pull a blob file into an Azure SQL Server, you can use this example syntax (this actually works, I use it):
BULK INSERT MyTable
FROM 'container/folder/folder/file'
WITH ( DATA_SOURCE = 'ds_blob',BATCHSIZE=10000,FIRSTROW=2);
MyTable has to have identical columns (or it can be a view against a table that yields identical columns)
In this example, ds_blob is an external data source which needs to be created beforehand (https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql)
The external data source needs to use a database contained credential, which uses an SAS key which you need to generate beforehand from blob storage https://learn.microsoft.com/en-us/sql/t-sql/statements/create-database-scoped-credential-transact-sql)
The only downside to this mehod is that you have to know the filename beforehand - there's no way to enumerate them from inside SQL Server.
I get around this by running powershell inside Azure Automation that enumerates blobds and writes them into a queue table beforehand

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Azure Synapse Serverless - SQL query to return rows in directory for each file - azure-data-lake

Related

Query data lake from Azure SQL database

SQL Server Serverless - Openrowset - Metadata functions?

Loading 50GB CSV File Azure Blob to Azure SQL DB in Less time- Performance

large amount of data in Azure SQL > U-SQL

copy blob data into on-premise sql table

Categories

Resources