Azure Data Factory creates .CSV that's incompatible with Power Query - azure-data-factory-2

I have a pipeline that creates a dataset from a stored procedure on Azure SQL Server.
I want to then manipulate it in a power query step within the factory, but it fails to load in the power query editor with this error.
It opens up the JSON file (to correct it, I assume) but I can't see anything wrong with it.
If I download the extract from blob and upload it again as a .csv then it works fine.
The only difference I can find is that if I upload a blob direct to storage then the file information for the blob looks like this:
If I just let ADF create the .csv in blob storage the file info looks like this:
So my assumption is that somewhere in the process in ADF that creates the .csv file it's getting something wrong, and the Power Query module can't recognise it as a valid file.
All the other parts of the pipeline (Data Flows, other datasets) recognise it fine, and the 'preview data' brings it up correctly. It's just PQ that won't read it.
Any thooughts? TIA

I reproduced the same. When data is copied from SQL database to BLOB as csv file, Power query is unable to read. Also, Power query doesn't support json file. But when I tried to download the csv file and reupload, it worked.
Below are the steps to overcome this issue.
When I tried to upload the file in Blob and create the dataset for that file in power query, Schema is imported from connection/Store. It forces us to import schema either from connection/store or from sample file. There is no option as none here.
When data is copied from sql db to Azure blob, and dataset which uses the blob storage didn't have schema imported by default.
Once imported the schema, power query activity ran successfully.
Output before importing schema in dataset
After importing schema in dataset

Related

SAME PARQUET FILE different float values read when using AzureML Dataset vs Directly Getting From Azure Storage using Bytestream

A little background, we are exploring a dataset given by a 3rd party vendor, which they've exposed to us using their SNOWFLAKE instance. What I did then was I exported this dataset into our Azure Blob Storage(s) as parquet files then created an AzureML dataset that encapsulates the parquet files that are stored in our BLOB CONTAINERS.
As I was exploring the data, I raised to them that I saw erroneous data with their longitude and latitude data which they denied. I then sent screenshots of the erroneous data, and when we queried the same data using SNOWFLAKE that I am raising as erroneous, the values were indeed correct!
I was baffled, and upon investigation, I narrowed it down to AzureML dataset somewhat gives INCORRECT data VERSUS if you directly read the parquet file / blob via stream into a pandas dataframe.
Also, all text data were identical, but when it came to FLOAT values, the incorrect values are manifesting.
I checked the datatypes when defining the schema of the AzureML dataset, and they are correct
Honestly, Im stomped. Any ideas or anyone encountered this issue and can explain to me what is happening? Thank you
SCREENSHOTS BELOW
Here is when read using the AzureML dataset
Here is the same file, same record, read directly from blob storage using bytestream
Downloaded the parquet file to my local machine and viewed it

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

Loading 50GB CSV File Azure Blob to Azure SQL DB in Less time- Performance

I am loading 50GB CSV file From Azure Blob to Azure SQL DB using OPENROWSET.
It takes 7 hours to load this file.
Can you please help me with possible ways to reduce this time?
The easiest option IMHO is just use BULK INSERT. Move the csv file into a Blob Store and the import it directly using BULK INSERT from Azure SQL. Make sure Azure Blob storage and Azure SQL are in the same Azure region.
To make it as fast as possible:
split the CSV in more than one file (for example using something like a CSV splitter. This looks nice https://www.erdconcepts.com/dbtoolbox.html. Never tried and just came up with a simple search, but looks good)
run more BULK INSERT in parallel using TABLOCK option. (https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-2017#arguments). This, if the target table is empty, will allow multiple concurrent bulk operations in parallel.
make sure you are using an higher SKU for the duration of the operation. Depending on the SLO (Service Level Objective) you're using (S4? P1, vCore?) you will get a different amount of log throughput, up to close 100 MB/Sec. That's the maximum speed you can actually achieve. (https://learn.microsoft.com/en-us/azure/sql-database/sql-database-resource-limits-database-server)
Please try using Azure Data Factory.
First create the destination table on Azure SQL Database, let's call it USDJPY. After that upload the CSV to an Azure Storage Account. Now create your Azure Data Factory instance and choose Copy Data.
Next, choose "Run once now" to copy your CSV files.
Choose "Azure Blob Storage" as your "source data store", specify your Azure Storage which you stored CSV files.
Provide information about Azure Storage account.
Choose your CSV files from your Azure Storage.
Choose "Comma" as your CSV files delimiter and input "Skip line count" number if your CSV file has headers
Choose "Azure SQL Database" as your "destination data store".
Type your Azure SQL Database information.
Select your table from your SQL Database instance.
Verify the data mapping.
Execute data copy from CSV files to SQL Database just confirming next wizards.

Getting files and folders in the datalake while reading from datafactory

While reading azure sql table data (which actually consists of path of the directories) from azure data factory by using the paths how to dynamically get the files from the datalake.
Can any one tell me what should I give in the dataset
Screenshot
You could use lookup activity to read data from azure sql, and then following it by an foreach activity. And then, pass #item(). to your dataset parameter k1.

copy blob data into on-premise sql table

My problem statement is that I have a csv blob and I need to import that blob into a sql table. Is there an utility to do that?
I was thinking of one approach, that first to copy blob to on-premise sql server using AzCopy utility and then import that file in sql table using bcp utility. Is this the right approach? and I am looking for 1-step solution to copy blob to sql table.
Regarding your question about the availability of a utility which will import data from blob storage to a SQL Server, AFAIK there's none. You would need to write one.
Your approach seems OK to me. Though you may want to write a batch file or something like that to automate the whole process. In this batch file, you would first download the file on your computer and the run the BCP utility to import the CSV in SQL Server. Other alternatives to writing batch file are:
Do this thing completely in PowerShell.
Write some C# code which makes use of storage client library to download the blob and once the blob is downloaded, start the BCP process in your code.
To pull a blob file into an Azure SQL Server, you can use this example syntax (this actually works, I use it):
BULK INSERT MyTable
FROM 'container/folder/folder/file'
WITH ( DATA_SOURCE = 'ds_blob',BATCHSIZE=10000,FIRSTROW=2);
MyTable has to have identical columns (or it can be a view against a table that yields identical columns)
In this example, ds_blob is an external data source which needs to be created beforehand (https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql)
The external data source needs to use a database contained credential, which uses an SAS key which you need to generate beforehand from blob storage https://learn.microsoft.com/en-us/sql/t-sql/statements/create-database-scoped-credential-transact-sql)
The only downside to this mehod is that you have to know the filename beforehand - there's no way to enumerate them from inside SQL Server.
I get around this by running powershell inside Azure Automation that enumerates blobds and writes them into a queue table beforehand