BigQuery fails on parsing dates in M/D/YYYY format from CSV file - google-bigquery

Problem
I'm attempting to create a BigQuery table from a CSV file in Google Cloud Storage.
I'm explicitly defining the schema for the load job (below) and set header rows to skip = 1.
Data
$ cat date_formatting_test.csv
id,shipped,name
0,1/10/2019,ryan
1,2/1/2019,blah
2,10/1/2013,asdf
Schema
id:INTEGER,
shipped:DATE,
name:STRING
Error
BigQuery produces the following error:
Error while reading data, error message: Could not parse '1/10/2019' as date for field shipped (position 1) starting at location 17
Questions
I understand that this date isn't in ISO format (2019-01-10), which I'm assuming will work.
However, I'm trying to define a more flexible input configuration whereby BigQuery will correctly load any date that the average American would consider valid.
Is there a way to specify the expected date format(s)?
Is there a separate configuration / setting to allow me to successfully load the provided CSV in with the schema defined as-is?

According to the listed limitations:
When you load CSV or JSON data, values in DATE columns must use
the dash (-) separator and the date must be in the following
format: YYYY-MM-DD (year-month-day).
So this leaves us with 2 options:
Option 1: ETL
Place new CSV files in Google Cloud Storage
That in turn triggers a Google Cloud Function or Google Cloud Composer job to:
Edit the date column in all the CSV files
Save the edited files back to Google Cloud Storage
Load the modified CSV files into Google BigQuery
Option 2: ELT
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to shipped:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognised date format. Use SELECT id, PARSE_DATE('%m/%d/%Y', shipped) AS shipped, name
Use that view for your analysis
I'm not sure, from your description, if this is a once-off job or recurring. If it's once-off, I'd go with Option 2 as it requires the least effort. Option 1 requires a bit more effort, and would only be worth it for recurring jobs.

Related

Connecting Tranco Google BigQuery with Metabase

I am trying to connect third party ranking management system (https://tranco-list.eu/) with metabase. Tranco is giving us an option to see the record on Google BigQuery but when I am trying to connect the Tranco with Metabase then it is asking for dataset from my Google cloud console project. Since Tranco is an external database source and I don't have access to the dataset Id from this.
If you want to get the result of tranco in Google BigQuery then run below query.
select * from `tranco.daily.daily` where domain ='google.com' limit 10
When I am searching Tranco in public dataset then also I am not finding this over their also. Is anyone aware of, how to add the third party dataset to our Google cloud project.
Thanks in advance.
Unfortunately, you can’t read the Tranco dataset directly from BigQuery; but, what you can do is to load the CSV data from Tranco into a Cloud Storage Bucket and then read your bucket in BigQuery.
When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.
Note that it has the next limitations:
CSV files do not support nested or repeated data.
Remove byte order mark (BOM) characters. They might cause unexpected
issues.
If you use gzip compression, BigQuery cannot read the data in
parallel. Loading compressed CSV data into BigQuery is slower than
loading uncompressed data.
You cannot include both compressed and uncompressed files in the same
load job.
The maximum size for a gzip file is 4 GB. When you load CSV or JSON
data, values in DATE columns must use the dash (-) separator and the
date must be in the following format: YYYY-MM-DD (year-month-day).
When you load JSON or CSV data, values in TIMESTAMP columns must use
a dash (-) separator for the date portion of the timestamp, and the
date must be in the following format: YYYY-MM-DD (year-month-day).
The hh:mm:ss (hour-minute-second) portion of the timestamp must use a
colon (:) separator.
Also, you can see this documentation if you don’t know how you can upload and read your CSV data.
And also in the next link I'm sending you is a step by step guide in how yo can create / select the bucket you will use.

Trying to upload a CSV into Snowflake and the date column isn't recognized

The column contains dates in the following format:
dd/mm/yyyy
I've tried DATE, TIMESTAMP etc but whatever I do, I can't seem to upload the file.
In the classic UI you can click on the Load Tablebutton and follow the dialogs to upload your file. It is a bit hard to find. Click on Databases right to the big snowflake icon. Then select a database and a table. Then you should the the button. In the wizard there will be a step for defining the 'File Format'. There, you have to scroll down to define the date format. (see Classic snowflake UI)
Without the classic UI you have to install SnowSQL on your device first (https://docs.snowflake.com/en/user-guide/snowsql-install-config.html).
Start SnowSQL and apply the following steps:
Use the database where to upload the file to. You need various privileges for creating a stage, a fileformat, and a table. E.g. USE TEST_DB
Create the fileformat you want to use for uploading your csv file. E.g.
CREATE FILE FORMAT "TEST_DB"."PUBLIC".MY_FILE_FORMAT TYPE = 'CSV' DATE_FORMAT = 'dd/mm/yyyy';
Create a stage using this fileformat
CREATE STAGE MY_STAGE file_format = "TEST_DB"."PUBLIC".MY_FILE_FORMAT;
Now you can put your file to this stage
PUT file://<file_path>/file.csv #MY_STAGE;
You can check the upload with
SELECT d.$1, ..., d.$N FROM #MY_STAGE/file.csv d;
Then, create your table.
Copy the content from your stage to your table. If you want to transform your data at this point, you have to use an inner select. If not then the following command is enough.
COPY INTO mycsvtable from #MY_STAGE/file.csv;
You can find documentation for configuring the fileupload at https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
You can find documentation for configuring the stage at https://docs.snowflake.com/en/sql-reference/sql/create-stage.html
You can find documentation for copying the staged file into a table https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
I recommend that you upload your file with disabled automatically date detection OR that your initial table has a string column instead of a date column. IMHO, it is easier to transform your upload afterwards using the try_to_date function. Using this function it is much easier to handle possible parsing errors.
e.g. SELECT try_to_date(date_column, 'dd/mm/yyyy') as MY_DATE, IFNULL(MY_DATE, date_column) as MY_DATE_NOT_PARSABLE FROM upload;
You see that it is pretty much to do for loading a simple CSV file to Snowflake. It becomes even more complicated when you take into account that every step can cause some specific failures and that your file might contain erroneous lines. This is why my team and I are working at Datameer to make these types of tasks easier. We aim for a simple drag and drop solution that does most of the work for you. We would be happy if you would try it out here: https://www.datameer.com/upload-csv-to-snowflake/

How can I load data into snowflake from S3 whilst specifying data types

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)
As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

How to export AVRO files from a BigQuery table with a DATE column and load it again to BigQuery

For moving data from a BigQuery (BQ) table that resides in the US, I want to export the table to a Cloud Storage (GCS) bucket in the US, copy it to an EU bucket, and from there import it again.
The problem is that AVRO does not support DATE types, but it is crucial to us as we are using the new partitioning feature that is not relying on ingestion time, but a column in the table itself.
The AVRO files contain the DATE column as a STRING and therefore a
Field date has changed type from DATE to STRING error is thrown, when trying to load the files via bq load.
There has been a similar question, but it is about timestamps - in my case it absolutely needs to be a DATE as dates don't carry timezone information and timestamps are always interpreted in UTC by BQ.
It works when using NEWLINE_DELIMITED_JSON, but is it possible to make this work with AVRO files?
As #ElliottBrossard pointed out in the comments, there's a public feature request regarding this where it's possible to sign up for the whitelist.