Partitioning based on column data? - google-bigquery

When creating a partitioned table using bq mk --time_partitioning_type=DAY are the partitions created based on the load time of the data, not a date key within the table data itself?
To create partitions based on dates within the date, is the current approach to manually create sharded tables, and load them based on date, as in this post from 2012?

Yes, partitions created based on data load time not based on data itself
You can use partition decorator (mydataset.mytable1$20160810) if you want to load data into specific partition
Per my understanding, partition by column is something that we should expect to be supported at some point - but not now

Good news, BigQuery currently supports 2 type data partition, included partition by column. Please check here.
I like the feature: An individual operation can commit data into up to 2,000 distinct partitions.

Related

Google Dataflow store to specific Partition using BigQuery Storage Write API

I want to store data to BigQuery by using specific partitions. The partitions are ingestion-time based. I want to use a range of partitions spanning over two years. I use the partition alias destination project-id:data-set.table-id$partition-date.
I get failures since it does recognise the destination as an alias but as an actual table.
Is it supported?
When you ingest data into BigQuery, it will land automatically in the corresponding partition. If you choose a daily ingestion time as partition column, that means that every new day will be a new partition. To be able to "backfill" partitions, you need to choose some other column for the partition (e.g. a column in the table with the ingestion date). When you write data from Dataflow (from anywhere actually), the data will be stored in the partition corresponding to the value of that column for each record.
Direct writes to partitions by ingestion time is not supported using the Write API.
Also using the stream api is not supported if a window of 31 days has passed
From the documentation:
When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time.
The solution that works is to use BigQuery load jobs to insert data. This can handle this scenario.
Because this operation has lot's of IO involved (files getting created on GCS), it can be lengthy, costly and resource intensive depending on the data.
A approach can be to create table shards and split the Big Table to small ones so the Storage Read and the Write api can be used. Then load jobs can be used from the sharded tables towards the partitioned table would require less resources, and the problem is already divided.

Change datatype of a date partitioned table

I have a date partitioned table with around 400 partitions.
Unfortunately one of the columns datatypes has changed and should be changed from INT to STR.
I can change the datatype as follows:
SELECT
CAST(change_var AS STRING) change_var
<rest of columns>
FROM dataset.table_name
and overwrite the table, but the date partitioning is then lost.
Is there any way to keep the partitioning and change a columns datatype?
Option 1.
Export table by partition. I created a simple library to achieve it. https://github.com/rdtr/bq-partition-porter
Then create a new table with a correct type and load data into the new table again, by partition. Be careful about the quota (1000 exports per day). 400 should be okay.
Option 2.
By using Cloud Dataflow, you can export a whole table then use DynamicDestination to import data into BQ by partition. If a number of partitions are too many, this would suffice the requirement.
I expect bq load command to have some way to specify a partition key field name (since it's already described in bq load help). Until then, you need to follow either of these options.

Streaming into BQ partitioned tables

I'm trying to use dataflow to stream into BQ partitioned table.
The documentation says that:
Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.
I can see that's the case when inserting rows into a date partitioned table.
Is there a way to be able to set the partition time of the rows I want to insert so that BigQuery can infer the correct partition?
So far I've tried doing: tableRow.set("_PARTITIONTIME", milliessinceepoch);
but I get hit with a no such field exception.
As of a month or so ago, you can stream into a specific partition of a date-partitioned table. For example, to insert into partition for date 20160501 in table T, you can call insertall with table name T$20160501
AFAIK, as of writing, BigQuery does not allow specifying the partition manually per row - it is inferred from the time of insertion.
However, as an alternative to BigQuery's built-in partitioned tables feature, you can use Dataflow's feature for streaming to multiple BigQuery tables at the same time: see Sharding BigQuery output tables.

Import CSV to partitioned table on BigQuery using specific timestamp column?

I want to import a large csv to a bigquery partitioned table that has a timestamp type column that is actually the date of some transaction, the problem is that when I load the data it imports everything into one partition of today's date.
Is it possible to use my own timestamp value to partition it? How can I do that.
In BigQuery, currently, partitioning based on specific column is not supported.
Even if this column is date related (timestamp).
You either rely on time of insertion so BigQuery engine will insert into respective partition or you specify which exactly partition you want to insert your data into
See more about Creating and Updating Date-Partitioned Tables
The best way to do that today is by using Google Dataflow [1]. You can develop a streaming pipeline which will read the file from Google Cloud Storage bucket and insert the rows into BigQuery's table.
You will need to create the partitioned table manually [2] before running the pipeline, because Dataflow right now doesn't support creating partitioned tables
There are multiple examples available at [3]
[1] https://cloud.google.com/dataflow/docs/
[2] https://cloud.google.com/bigquery/docs/creating-partitioned-tables
[3] https://cloud.google.com/dataflow/examples/all-examples

Streaming data to a specific BigQuery Time Partition

I would like to know if there is any way to stream data to a specific time partition of a BigQuery table. The documentation says that you must use table decorators:
Loading data using partition decorators
Partition decorators enable you to load data into a specific
partition. To adjust for timezones, use a partition decorator to load
data into a partition based on your preferred timezone. For example,
if you are on Pacific Standard Time (PST), load all data generated on
May 1, 2016 PST into the partition for that date by using the
corresponding partition decorator:
[TABLE_NAME]$20160501
Source: https://cloud.google.com/bigquery/docs/partitioned-tables#dealing_with_timezone_issues
And:
Restating data in a partition
To update data in a specific partition, append a partition decorator
to the name of the partitioned table when loading data into the table.
A partition decorator represents a specific date and takes the form:
$YYYYMMDD
Source: https://cloud.google.com/bigquery/docs/creating-partitioned-tables#creating_a_partitioned_table
But if I try to use them when streaming data i got the following error: Table decorators cannot be used with streaming insert.
Thanks in advance!
Sorry for the inconvenience. We are considering providing support for this in the near future. Please stay tuned for more updates.
Possible workarounds that might work in many cases:
If you have most of the data available(which is sometimes the case when restating data for an old partition), you can use a load job with the partition as the destination.
Another option is to stream to a temporary table and after the data has been flushed from the streaming buffer, use bq cp
This feature was recently released and you can now stream directly into a decorated date partition within the last 30 days historically and 5 days into the future.
https://cloud.google.com/bigquery/streaming-data-into-bigquery