How to export AVRO files from a BigQuery table with a DATE column and load it again to BigQuery - google-bigquery

For moving data from a BigQuery (BQ) table that resides in the US, I want to export the table to a Cloud Storage (GCS) bucket in the US, copy it to an EU bucket, and from there import it again.
The problem is that AVRO does not support DATE types, but it is crucial to us as we are using the new partitioning feature that is not relying on ingestion time, but a column in the table itself.
The AVRO files contain the DATE column as a STRING and therefore a
Field date has changed type from DATE to STRING error is thrown, when trying to load the files via bq load.
There has been a similar question, but it is about timestamps - in my case it absolutely needs to be a DATE as dates don't carry timezone information and timestamps are always interpreted in UTC by BQ.
It works when using NEWLINE_DELIMITED_JSON, but is it possible to make this work with AVRO files?

As #ElliottBrossard pointed out in the comments, there's a public feature request regarding this where it's possible to sign up for the whitelist.

Related

Connecting Tranco Google BigQuery with Metabase

I am trying to connect third party ranking management system (https://tranco-list.eu/) with metabase. Tranco is giving us an option to see the record on Google BigQuery but when I am trying to connect the Tranco with Metabase then it is asking for dataset from my Google cloud console project. Since Tranco is an external database source and I don't have access to the dataset Id from this.
If you want to get the result of tranco in Google BigQuery then run below query.
select * from `tranco.daily.daily` where domain ='google.com' limit 10
When I am searching Tranco in public dataset then also I am not finding this over their also. Is anyone aware of, how to add the third party dataset to our Google cloud project.
Thanks in advance.
Unfortunately, you can’t read the Tranco dataset directly from BigQuery; but, what you can do is to load the CSV data from Tranco into a Cloud Storage Bucket and then read your bucket in BigQuery.
When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.
Note that it has the next limitations:
CSV files do not support nested or repeated data.
Remove byte order mark (BOM) characters. They might cause unexpected
issues.
If you use gzip compression, BigQuery cannot read the data in
parallel. Loading compressed CSV data into BigQuery is slower than
loading uncompressed data.
You cannot include both compressed and uncompressed files in the same
load job.
The maximum size for a gzip file is 4 GB. When you load CSV or JSON
data, values in DATE columns must use the dash (-) separator and the
date must be in the following format: YYYY-MM-DD (year-month-day).
When you load JSON or CSV data, values in TIMESTAMP columns must use
a dash (-) separator for the date portion of the timestamp, and the
date must be in the following format: YYYY-MM-DD (year-month-day).
The hh:mm:ss (hour-minute-second) portion of the timestamp must use a
colon (:) separator.
Also, you can see this documentation if you don’t know how you can upload and read your CSV data.
And also in the next link I'm sending you is a step by step guide in how yo can create / select the bucket you will use.

How can I load data into snowflake from S3 whilst specifying data types

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)
As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

Google BigQuery: Importing DATETIME fields using Avro format

I have a script that downloads data from an Oracle database, and uploads it to Google BigQuery. This is done by writing to an Avro file, which is then uploaded directly using BQ's python framework. The BigQuery tables I'm uploading the data to has predefined schemas, some of which contain DATETIME fields.
As BigQuery now has support for Avro Logical fields, import of timestamp data is no longer a problem. However, I'm still not able to import datetime fields. I tried using string, but then I got the following error:
Field CHANGED has incompatible types. Configured schema: datetime; Avro file: string.
I also tried to convert the field data to timestamps on export, but that produced an internal error in BigQuery:
An internal error occurred and the request could not be completed. Error: 3144498
Is it even possible to import datetime fields using Avro?
In Avro, the logical data types must include the attribute logicalType, it is possible that this field is not included in your schema definition.
Here there are a couple of examples like the following one. As far as I know the type can be int or long, but logicalType should be date:
{
'name': 'DateField',
'type': 'int',
'logicalType': 'date'
}
Once the logical data type is set, try again. The documentation does indicate it should work:
Avro logical type --> date
Converted BigQuery data type --> DATE
In case you get an error, it would be helpful to check the schema of your avro file, you can use this command to obtain its details:
java -jaravro-tools-1.9.2.jargetschema my-avro-file.avro
UPDATE
For cases where DATE alone doesn't work, please consider that the TIMESTAMP can store the date and time with a number of micro/nano seconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC (UTC seems to be the default for avro). Additionally, the values stored in an avro file (of type DATE o TIMESTAMP) are independent of a particular time zone, in this sense, it is very similar to BigQuery Timestamp data type.

BigQuery fails on parsing dates in M/D/YYYY format from CSV file

Problem
I'm attempting to create a BigQuery table from a CSV file in Google Cloud Storage.
I'm explicitly defining the schema for the load job (below) and set header rows to skip = 1.
Data
$ cat date_formatting_test.csv
id,shipped,name
0,1/10/2019,ryan
1,2/1/2019,blah
2,10/1/2013,asdf
Schema
id:INTEGER,
shipped:DATE,
name:STRING
Error
BigQuery produces the following error:
Error while reading data, error message: Could not parse '1/10/2019' as date for field shipped (position 1) starting at location 17
Questions
I understand that this date isn't in ISO format (2019-01-10), which I'm assuming will work.
However, I'm trying to define a more flexible input configuration whereby BigQuery will correctly load any date that the average American would consider valid.
Is there a way to specify the expected date format(s)?
Is there a separate configuration / setting to allow me to successfully load the provided CSV in with the schema defined as-is?
According to the listed limitations:
When you load CSV or JSON data, values in DATE columns must use
the dash (-) separator and the date must be in the following
format: YYYY-MM-DD (year-month-day).
So this leaves us with 2 options:
Option 1: ETL
Place new CSV files in Google Cloud Storage
That in turn triggers a Google Cloud Function or Google Cloud Composer job to:
Edit the date column in all the CSV files
Save the edited files back to Google Cloud Storage
Load the modified CSV files into Google BigQuery
Option 2: ELT
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to shipped:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognised date format. Use SELECT id, PARSE_DATE('%m/%d/%Y', shipped) AS shipped, name
Use that view for your analysis
I'm not sure, from your description, if this is a once-off job or recurring. If it's once-off, I'd go with Option 2 as it requires the least effort. Option 1 requires a bit more effort, and would only be worth it for recurring jobs.

TIMESTAMP from CSV via API

Does the API support importing a CSV to a new table when there is a TIMESTAMP field?
If I manually (using the BigQuery web interface) upload a CSV file containing timestamp data, and specify the field to be a TIMESTAMP via the schema, it works just fine. The data is loaded. The timestamp data is interpreted as timestamp data and imported into the timestamp field just fine.
However, when I use the API to do the same thing with the same file, I get this error:
"Illegal CSV schema type: TIMESTAMP"
More specifically, I'm using Google Apps Script to connect to the BigQuery API, but the response seems to be coming from the BigQuery API itself, which suggests this is not a feature of the API.
I know I can import as STRING, then convert to TIMESTAMP in my queries, but I was hoping to ultimately end up with a table schema with a timestamp field... populated from a CSV file... using the API... preferably through Apps Script for simplicity.
It looks like TIMESTAMP is missing from the 'inline' schema parser. The fix should be in next week's build In the mean time, if you pass the schema via the 'schema' field rather than the schemaInline field it should work for you.