How can I load data into snowflake from S3 whilst specifying data types - amazon-s3

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)

As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

Related

bq load command to load parquet file from GCS to BigQuery with column name start with number

I am loading parquet file into BigQuery using bq load command, my parquet file contains column name start with number (e.g. 00_abc, 01_xyz). since BigQuery don't support column name start number I have created column in BigQuery such as _00_abc, _01_xyz.
But I am unable to load the parquet file to BigQuery using bq load command.
Is there any way to specify bq load command that source column 00_abc (from parquet file) will load to target column _00_abc (in BigQuery).
Thanks in advance.
Regards,
Gouranga Basak
It's general best practice to not start a Parquet column name with a number. You will experience compatibility issues with more than just bq load. For example, many Parquet readers use the parquet-avro library, and Avro's documentation says:
The name portion of a fullname, record field names, and enum symbols must:
start with [A-Za-z_]
subsequently contain only [A-Za-z0-9_]
The solution here is to rename the column in the Parquet file. Depending on how much control you have over the Parquet file's creation, you may need to write a Cloud Function to rename the columns (Pandas Dataframes won't complain about your column names).

BigQuery fails on parsing dates in M/D/YYYY format from CSV file

Problem
I'm attempting to create a BigQuery table from a CSV file in Google Cloud Storage.
I'm explicitly defining the schema for the load job (below) and set header rows to skip = 1.
Data
$ cat date_formatting_test.csv
id,shipped,name
0,1/10/2019,ryan
1,2/1/2019,blah
2,10/1/2013,asdf
Schema
id:INTEGER,
shipped:DATE,
name:STRING
Error
BigQuery produces the following error:
Error while reading data, error message: Could not parse '1/10/2019' as date for field shipped (position 1) starting at location 17
Questions
I understand that this date isn't in ISO format (2019-01-10), which I'm assuming will work.
However, I'm trying to define a more flexible input configuration whereby BigQuery will correctly load any date that the average American would consider valid.
Is there a way to specify the expected date format(s)?
Is there a separate configuration / setting to allow me to successfully load the provided CSV in with the schema defined as-is?
According to the listed limitations:
When you load CSV or JSON data, values in DATE columns must use
the dash (-) separator and the date must be in the following
format: YYYY-MM-DD (year-month-day).
So this leaves us with 2 options:
Option 1: ETL
Place new CSV files in Google Cloud Storage
That in turn triggers a Google Cloud Function or Google Cloud Composer job to:
Edit the date column in all the CSV files
Save the edited files back to Google Cloud Storage
Load the modified CSV files into Google BigQuery
Option 2: ELT
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to shipped:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognised date format. Use SELECT id, PARSE_DATE('%m/%d/%Y', shipped) AS shipped, name
Use that view for your analysis
I'm not sure, from your description, if this is a once-off job or recurring. If it's once-off, I'd go with Option 2 as it requires the least effort. Option 1 requires a bit more effort, and would only be worth it for recurring jobs.

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

DataBricks - save changes back to DataLake (ADLS Gen2)

I have legacy data stored as CSV in an Azure DataLake Gen2 storage account. I'm able to connect to this and interrogate it using DataBricks. I have a requirement to remove certain records once their retention period expires, or if a GDPR "right to be forgotten" needs applying to the data.
Using Delta I can load a CSV into a Delta table and use SQL to locate and delete the required rows, but what is the best way to save these changes? Ideally back to the original file, so that the data is removed from the original. I've used the LOCATION option when creating the Delta table to persist the generated Parquet format files to the DataLake but it would be nice to keep it in the original CSV format.
Any advice appreciated.
I'd be careful here. Right to be forgotten means you need to delete the data. Delta doesn't actually delete it from the original file (initially at least) - this will only happen once the data is vacuumed.
The safest way to delete data is to read all the data into a dataframe, filter off the records you do not want and then write it back using overwrite. This will ensure the data is remove and the same structure is re-written.
Convert Parquet to CSV in ADF
The versioned parquet files created in the ADLS Gen2 location can be converted to CSV using the Copy Data task in an Azure Data Factory pipeline.
So, you could read the CSV data into a Delta table(with location pointing to a Data lake folder), perform the required changes using SQL and then convert the parquet files to CSV format using ADF.
I have tried this and it works. The only hurdle might be detecting the column headers while reading the CSV file to Delta. You could read it to a dataframe and create a Delta table from it.
If you are running the delete operations periodically then it is costly to save file in csv, As every time you are reading the file and transforming the dataframe to Delta and then query on it and finally after filtering the records you are again saving it to csv and deleting the Delta table.
So my suggestion here would be, transform the csv to Delta once, perform delete periodically and generate csv only when it's needed.
The advantage here is - Delta internally stores data in parquet format which stores data in binary format and allow better compression and encoding/decoding of data.

how to load multiple CSV files into Multiple Tables

I have Multiple CSV files in Folder
Example :
Member.CSv
Leader.CSv
I need to load them in to Data base tables .
I have worked on it using ForEachLoop Container ,Data FlowTask, Excel Source and OLEDB Destination
we can do if by using Expressions and Precedence Constraints but how can I do using Script task if I have more than 10 files ..I got Stuck with this one
We have a similar issue, our solution is a mixture of the suggestions above.
We have a number of files types sent from our client on a daily basis.
These have a specific filename pattern (e.g. SalesTransaction20160218.csv, Product20160218.csv)
Each of these file types have a staging "landing" table of the structure you expect
We then have a .net script task that takes the filename pattern and loads that data into a landing table.
There are also various checks that are done within the csv parser - matching number of columns, some basic data validation, before loading into the landing table
We are not good enough .net programmers to be able to dynamically parse an unknown file structure, create SQL table and then load the data in. I expect it is feasible, after all, that is what the SSIS Import/Export Wizard does (with some manual intervention)
As an alternative to this (the process is quite delicate), we are experimenting with a HDFS data landing area, then it allows us to use analytic tools like R to parse the data within HDFS. After that utilising PIG to load the data into SQL.