Import CSV to partitioned table on BigQuery using specific timestamp column? - google-bigquery

I want to import a large csv to a bigquery partitioned table that has a timestamp type column that is actually the date of some transaction, the problem is that when I load the data it imports everything into one partition of today's date.
Is it possible to use my own timestamp value to partition it? How can I do that.

In BigQuery, currently, partitioning based on specific column is not supported.
Even if this column is date related (timestamp).
You either rely on time of insertion so BigQuery engine will insert into respective partition or you specify which exactly partition you want to insert your data into
See more about Creating and Updating Date-Partitioned Tables

The best way to do that today is by using Google Dataflow [1]. You can develop a streaming pipeline which will read the file from Google Cloud Storage bucket and insert the rows into BigQuery's table.
You will need to create the partitioned table manually [2] before running the pipeline, because Dataflow right now doesn't support creating partitioned tables
There are multiple examples available at [3]
[1] https://cloud.google.com/dataflow/docs/
[2] https://cloud.google.com/bigquery/docs/creating-partitioned-tables
[3] https://cloud.google.com/dataflow/examples/all-examples

Related

Google Dataflow store to specific Partition using BigQuery Storage Write API

I want to store data to BigQuery by using specific partitions. The partitions are ingestion-time based. I want to use a range of partitions spanning over two years. I use the partition alias destination project-id:data-set.table-id$partition-date.
I get failures since it does recognise the destination as an alias but as an actual table.
Is it supported?
When you ingest data into BigQuery, it will land automatically in the corresponding partition. If you choose a daily ingestion time as partition column, that means that every new day will be a new partition. To be able to "backfill" partitions, you need to choose some other column for the partition (e.g. a column in the table with the ingestion date). When you write data from Dataflow (from anywhere actually), the data will be stored in the partition corresponding to the value of that column for each record.
Direct writes to partitions by ingestion time is not supported using the Write API.
Also using the stream api is not supported if a window of 31 days has passed
From the documentation:
When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time.
The solution that works is to use BigQuery load jobs to insert data. This can handle this scenario.
Because this operation has lot's of IO involved (files getting created on GCS), it can be lengthy, costly and resource intensive depending on the data.
A approach can be to create table shards and split the Big Table to small ones so the Storage Read and the Write api can be used. Then load jobs can be used from the sharded tables towards the partitioned table would require less resources, and the problem is already divided.

Partitioning based on column data?

When creating a partitioned table using bq mk --time_partitioning_type=DAY are the partitions created based on the load time of the data, not a date key within the table data itself?
To create partitions based on dates within the date, is the current approach to manually create sharded tables, and load them based on date, as in this post from 2012?
Yes, partitions created based on data load time not based on data itself
You can use partition decorator (mydataset.mytable1$20160810) if you want to load data into specific partition
Per my understanding, partition by column is something that we should expect to be supported at some point - but not now
Good news, BigQuery currently supports 2 type data partition, included partition by column. Please check here.
I like the feature: An individual operation can commit data into up to 2,000 distinct partitions.

Streaming into BQ partitioned tables

I'm trying to use dataflow to stream into BQ partitioned table.
The documentation says that:
Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.
I can see that's the case when inserting rows into a date partitioned table.
Is there a way to be able to set the partition time of the rows I want to insert so that BigQuery can infer the correct partition?
So far I've tried doing: tableRow.set("_PARTITIONTIME", milliessinceepoch);
but I get hit with a no such field exception.
As of a month or so ago, you can stream into a specific partition of a date-partitioned table. For example, to insert into partition for date 20160501 in table T, you can call insertall with table name T$20160501
AFAIK, as of writing, BigQuery does not allow specifying the partition manually per row - it is inferred from the time of insertion.
However, as an alternative to BigQuery's built-in partitioned tables feature, you can use Dataflow's feature for streaming to multiple BigQuery tables at the same time: see Sharding BigQuery output tables.

Date-partitioned template tables in BigQuery?

I am trying to create date-partitioned + template tables in BigQuery:
Create base table using bq mk --time_partitioning_type=DAY myapp.customer
Call API insertAll with "tableId": "customer", "templateSuffix": "_activated"
The resulting customer_activated table inherits the schema of the customer table, but has no timePartitioning.
How can I ensure template tables inherit the time partitioning of the base table?
For people coming here in the future, the accepted answer is outdated. BigQuery Streaming APIs support date-partition tables now, both to the table and to a specific partition
Link to docs
Streaming APIs do not yet support date-partitioning
Your option is to use load job with the partition as the destination for initial population and then just use streaming directly to the table (without using partitions) and let bigquery infer the partition timestamp
Otherwise you should wait when streaming will support date-partitioning which Google Team mentioned to happen in near future
Update:
Since around mid-2017 BigQuery supports Streaming into partitioned tables
Just FYI, as of November 2022, it is indeed possible to stream data into already existing partitioned tables, however, tables created automatically using a template table do NOT inherit the time partitioning configuration of the parent table, which is what OP was asking in the first place.

Streaming data to a specific BigQuery Time Partition

I would like to know if there is any way to stream data to a specific time partition of a BigQuery table. The documentation says that you must use table decorators:
Loading data using partition decorators
Partition decorators enable you to load data into a specific
partition. To adjust for timezones, use a partition decorator to load
data into a partition based on your preferred timezone. For example,
if you are on Pacific Standard Time (PST), load all data generated on
May 1, 2016 PST into the partition for that date by using the
corresponding partition decorator:
[TABLE_NAME]$20160501
Source: https://cloud.google.com/bigquery/docs/partitioned-tables#dealing_with_timezone_issues
And:
Restating data in a partition
To update data in a specific partition, append a partition decorator
to the name of the partitioned table when loading data into the table.
A partition decorator represents a specific date and takes the form:
$YYYYMMDD
Source: https://cloud.google.com/bigquery/docs/creating-partitioned-tables#creating_a_partitioned_table
But if I try to use them when streaming data i got the following error: Table decorators cannot be used with streaming insert.
Thanks in advance!
Sorry for the inconvenience. We are considering providing support for this in the near future. Please stay tuned for more updates.
Possible workarounds that might work in many cases:
If you have most of the data available(which is sometimes the case when restating data for an old partition), you can use a load job with the partition as the destination.
Another option is to stream to a temporary table and after the data has been flushed from the streaming buffer, use bq cp
This feature was recently released and you can now stream directly into a decorated date partition within the last 30 days historically and 5 days into the future.
https://cloud.google.com/bigquery/streaming-data-into-bigquery