SAME PARQUET FILE different float values read when using AzureML Dataset vs Directly Getting From Azure Storage using Bytestream - pandas

A little background, we are exploring a dataset given by a 3rd party vendor, which they've exposed to us using their SNOWFLAKE instance. What I did then was I exported this dataset into our Azure Blob Storage(s) as parquet files then created an AzureML dataset that encapsulates the parquet files that are stored in our BLOB CONTAINERS.
As I was exploring the data, I raised to them that I saw erroneous data with their longitude and latitude data which they denied. I then sent screenshots of the erroneous data, and when we queried the same data using SNOWFLAKE that I am raising as erroneous, the values were indeed correct!
I was baffled, and upon investigation, I narrowed it down to AzureML dataset somewhat gives INCORRECT data VERSUS if you directly read the parquet file / blob via stream into a pandas dataframe.
Also, all text data were identical, but when it came to FLOAT values, the incorrect values are manifesting.
I checked the datatypes when defining the schema of the AzureML dataset, and they are correct
Honestly, Im stomped. Any ideas or anyone encountered this issue and can explain to me what is happening? Thank you
SCREENSHOTS BELOW
Here is when read using the AzureML dataset
Here is the same file, same record, read directly from blob storage using bytestream
Downloaded the parquet file to my local machine and viewed it

Related

Azure Data Factory creates .CSV that's incompatible with Power Query

I have a pipeline that creates a dataset from a stored procedure on Azure SQL Server.
I want to then manipulate it in a power query step within the factory, but it fails to load in the power query editor with this error.
It opens up the JSON file (to correct it, I assume) but I can't see anything wrong with it.
If I download the extract from blob and upload it again as a .csv then it works fine.
The only difference I can find is that if I upload a blob direct to storage then the file information for the blob looks like this:
If I just let ADF create the .csv in blob storage the file info looks like this:
So my assumption is that somewhere in the process in ADF that creates the .csv file it's getting something wrong, and the Power Query module can't recognise it as a valid file.
All the other parts of the pipeline (Data Flows, other datasets) recognise it fine, and the 'preview data' brings it up correctly. It's just PQ that won't read it.
Any thooughts? TIA
I reproduced the same. When data is copied from SQL database to BLOB as csv file, Power query is unable to read. Also, Power query doesn't support json file. But when I tried to download the csv file and reupload, it worked.
Below are the steps to overcome this issue.
When I tried to upload the file in Blob and create the dataset for that file in power query, Schema is imported from connection/Store. It forces us to import schema either from connection/store or from sample file. There is no option as none here.
When data is copied from sql db to Azure blob, and dataset which uses the blob storage didn't have schema imported by default.
Once imported the schema, power query activity ran successfully.
Output before importing schema in dataset
After importing schema in dataset

Connecting Tranco Google BigQuery with Metabase

I am trying to connect third party ranking management system (https://tranco-list.eu/) with metabase. Tranco is giving us an option to see the record on Google BigQuery but when I am trying to connect the Tranco with Metabase then it is asking for dataset from my Google cloud console project. Since Tranco is an external database source and I don't have access to the dataset Id from this.
If you want to get the result of tranco in Google BigQuery then run below query.
select * from `tranco.daily.daily` where domain ='google.com' limit 10
When I am searching Tranco in public dataset then also I am not finding this over their also. Is anyone aware of, how to add the third party dataset to our Google cloud project.
Thanks in advance.
Unfortunately, you can’t read the Tranco dataset directly from BigQuery; but, what you can do is to load the CSV data from Tranco into a Cloud Storage Bucket and then read your bucket in BigQuery.
When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.
Note that it has the next limitations:
CSV files do not support nested or repeated data.
Remove byte order mark (BOM) characters. They might cause unexpected
issues.
If you use gzip compression, BigQuery cannot read the data in
parallel. Loading compressed CSV data into BigQuery is slower than
loading uncompressed data.
You cannot include both compressed and uncompressed files in the same
load job.
The maximum size for a gzip file is 4 GB. When you load CSV or JSON
data, values in DATE columns must use the dash (-) separator and the
date must be in the following format: YYYY-MM-DD (year-month-day).
When you load JSON or CSV data, values in TIMESTAMP columns must use
a dash (-) separator for the date portion of the timestamp, and the
date must be in the following format: YYYY-MM-DD (year-month-day).
The hh:mm:ss (hour-minute-second) portion of the timestamp must use a
colon (:) separator.
Also, you can see this documentation if you don’t know how you can upload and read your CSV data.
And also in the next link I'm sending you is a step by step guide in how yo can create / select the bucket you will use.

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

How can I load data into snowflake from S3 whilst specifying data types

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)
As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

How do I read Athena-created Parquet tables into python

I created a table using Athena CTAS statements. Per Glue, I see that the table is stored on my s3 bucket. I further confirmed that there are files in the expected place in my s3 bucket.
These files, however, are not parquet files (they are extension-less). When I try to read them into python using pd.read_parquet, I get the Error "Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.". A similar error occurs when I try to query the table and read the csv output using pd.read_csv. There, the error is "'utf-8' codec can't decode byte 0xee in position 0: invalid continuation byte". I tried using awswrangler and got the same errors.
I'm pretty sure these errors are related to the SSE_S3 encryption I put on the bucket. However, I'm at a loss as to how I can actually interact with these files outside of Athena.
The resolution is that the default Athena workgroup had CSE_KMS encryption turned on. I couldn't quickly figure out how to pass these options via awswrangler, so I took the shortcut of recreating the table using another workgroup that didn't have encryption.