Partitioning BigQuery table, loaded from AVRO - google-bigquery

I have a bigquery table whose data is loaded from AVRO files on GCS. This is NOT an external table.
One of the fields in every AVRO object is created (date with a long type) and I'd like to use this field to partition the table.
What is the best way to do this?
Thanks

Two issues that prevent from using created as a partition column:
The avro file defines the schema during loading time. There is only one option to partition at this step: select Partition By Ingestion Time, however, most probably will include another field for this purpose.
The field created is long. This value seems to contain a Datetime. If it was Integer you will be able to use Integer Range partitioned tables somehow. But in this case, you would need to convert the long value into a Date/Timestamp to use date/timestamp partitioned tables.
So, from my opinion you can try:
Importing the data as it is into a first table.
Create a second empty table partitioned by created of type TIMESTAMP.
Execute a query reading from the first table and applying a timestamp function on created like TIMESTAMP_SECONDS (or TIMESTAMP_MILLIS) to transform the value to a TIMESTAMP, so each value you insert will be partioned.

Related

How to partition a datetime column when creating Athena table

I have some log files in S3 with the following csv format (sample data in parenthesis):
userid (15678),
datetime (2017-09-14T00:21:10),
tag1 (some random text),
tag2 (some random text)
I want to load into Athena tables and partition the data based on datetime in a day/month/year format. Is there a way to split the datetime on table creation or do I need to run some job before to separate the columns and then import?
Athena supports only External tables of Hive. In external tables to partition the data you data must be in different folders.
There are two ways in which you can do that. Both are mentioned here.

Change datatype of a date partitioned table

I have a date partitioned table with around 400 partitions.
Unfortunately one of the columns datatypes has changed and should be changed from INT to STR.
I can change the datatype as follows:
SELECT
CAST(change_var AS STRING) change_var
<rest of columns>
FROM dataset.table_name
and overwrite the table, but the date partitioning is then lost.
Is there any way to keep the partitioning and change a columns datatype?
Option 1.
Export table by partition. I created a simple library to achieve it. https://github.com/rdtr/bq-partition-porter
Then create a new table with a correct type and load data into the new table again, by partition. Be careful about the quota (1000 exports per day). 400 should be okay.
Option 2.
By using Cloud Dataflow, you can export a whole table then use DynamicDestination to import data into BQ by partition. If a number of partitions are too many, this would suffice the requirement.
I expect bq load command to have some way to specify a partition key field name (since it's already described in bq load help). Until then, you need to follow either of these options.

Partitioning based on column data?

When creating a partitioned table using bq mk --time_partitioning_type=DAY are the partitions created based on the load time of the data, not a date key within the table data itself?
To create partitions based on dates within the date, is the current approach to manually create sharded tables, and load them based on date, as in this post from 2012?
Yes, partitions created based on data load time not based on data itself
You can use partition decorator (mydataset.mytable1$20160810) if you want to load data into specific partition
Per my understanding, partition by column is something that we should expect to be supported at some point - but not now
Good news, BigQuery currently supports 2 type data partition, included partition by column. Please check here.
I like the feature: An individual operation can commit data into up to 2,000 distinct partitions.

Streaming into BQ partitioned tables

I'm trying to use dataflow to stream into BQ partitioned table.
The documentation says that:
Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.
I can see that's the case when inserting rows into a date partitioned table.
Is there a way to be able to set the partition time of the rows I want to insert so that BigQuery can infer the correct partition?
So far I've tried doing: tableRow.set("_PARTITIONTIME", milliessinceepoch);
but I get hit with a no such field exception.
As of a month or so ago, you can stream into a specific partition of a date-partitioned table. For example, to insert into partition for date 20160501 in table T, you can call insertall with table name T$20160501
AFAIK, as of writing, BigQuery does not allow specifying the partition manually per row - it is inferred from the time of insertion.
However, as an alternative to BigQuery's built-in partitioned tables feature, you can use Dataflow's feature for streaming to multiple BigQuery tables at the same time: see Sharding BigQuery output tables.

Import CSV to partitioned table on BigQuery using specific timestamp column?

I want to import a large csv to a bigquery partitioned table that has a timestamp type column that is actually the date of some transaction, the problem is that when I load the data it imports everything into one partition of today's date.
Is it possible to use my own timestamp value to partition it? How can I do that.
In BigQuery, currently, partitioning based on specific column is not supported.
Even if this column is date related (timestamp).
You either rely on time of insertion so BigQuery engine will insert into respective partition or you specify which exactly partition you want to insert your data into
See more about Creating and Updating Date-Partitioned Tables
The best way to do that today is by using Google Dataflow [1]. You can develop a streaming pipeline which will read the file from Google Cloud Storage bucket and insert the rows into BigQuery's table.
You will need to create the partitioned table manually [2] before running the pipeline, because Dataflow right now doesn't support creating partitioned tables
There are multiple examples available at [3]
[1] https://cloud.google.com/dataflow/docs/
[2] https://cloud.google.com/bigquery/docs/creating-partitioned-tables
[3] https://cloud.google.com/dataflow/examples/all-examples