I would like to know if there is any way to stream data to a specific time partition of a BigQuery table. The documentation says that you must use table decorators:
Loading data using partition decorators
Partition decorators enable you to load data into a specific
partition. To adjust for timezones, use a partition decorator to load
data into a partition based on your preferred timezone. For example,
if you are on Pacific Standard Time (PST), load all data generated on
May 1, 2016 PST into the partition for that date by using the
corresponding partition decorator:
[TABLE_NAME]$20160501
Source: https://cloud.google.com/bigquery/docs/partitioned-tables#dealing_with_timezone_issues
And:
Restating data in a partition
To update data in a specific partition, append a partition decorator
to the name of the partitioned table when loading data into the table.
A partition decorator represents a specific date and takes the form:
$YYYYMMDD
Source: https://cloud.google.com/bigquery/docs/creating-partitioned-tables#creating_a_partitioned_table
But if I try to use them when streaming data i got the following error: Table decorators cannot be used with streaming insert.
Thanks in advance!
Sorry for the inconvenience. We are considering providing support for this in the near future. Please stay tuned for more updates.
Possible workarounds that might work in many cases:
If you have most of the data available(which is sometimes the case when restating data for an old partition), you can use a load job with the partition as the destination.
Another option is to stream to a temporary table and after the data has been flushed from the streaming buffer, use bq cp
This feature was recently released and you can now stream directly into a decorated date partition within the last 30 days historically and 5 days into the future.
https://cloud.google.com/bigquery/streaming-data-into-bigquery
Related
I want to store data to BigQuery by using specific partitions. The partitions are ingestion-time based. I want to use a range of partitions spanning over two years. I use the partition alias destination project-id:data-set.table-id$partition-date.
I get failures since it does recognise the destination as an alias but as an actual table.
Is it supported?
When you ingest data into BigQuery, it will land automatically in the corresponding partition. If you choose a daily ingestion time as partition column, that means that every new day will be a new partition. To be able to "backfill" partitions, you need to choose some other column for the partition (e.g. a column in the table with the ingestion date). When you write data from Dataflow (from anywhere actually), the data will be stored in the partition corresponding to the value of that column for each record.
Direct writes to partitions by ingestion time is not supported using the Write API.
Also using the stream api is not supported if a window of 31 days has passed
From the documentation:
When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time.
The solution that works is to use BigQuery load jobs to insert data. This can handle this scenario.
Because this operation has lot's of IO involved (files getting created on GCS), it can be lengthy, costly and resource intensive depending on the data.
A approach can be to create table shards and split the Big Table to small ones so the Storage Read and the Write api can be used. Then load jobs can be used from the sharded tables towards the partitioned table would require less resources, and the problem is already divided.
I have extensive experience working with Hive Partitioned tables. I use Hive 2.X. I was interviewing for a Big Data Solution Architect role and I was asked the below question.
Question: How would you ingest a streaming data in a Hive table partitioned on Date? The streaming data is first stored in S3 bucket and then loaded to Hive. Although the S3 bucket names have a date identifier such as S3_ingest_YYYYMMDD, the content could have data for more than 1 date.
My Answer: Since the content could have more than 1 Date, creating external table might not be possible since we want to read the file and distribute the file based on the date. I suggested we first load the S3 bucket in an external staging table with no partitions and then Load/Insert the final date partition table using Dynamic Partition settings which will dynamically distribute the data to the correct partition directory.
The interviewer said my answer was not correct and I was curious to know what the correct answer was, but ran out of time.
The only caveat in my answer is that, over time the partitioned date directories will have multiple small files that can lead to small file issue, which can always be handled via batch maintenance process.
What are the other/correct options to handle this scenario?
Thanks.
It depends on the requirements.
As per my understanding if one file or folder with S3_ingest_YYYYMMDD files can contain more than one date, then some events are loaded the next day or even later. This is rather common scenario.
Ingestion date and event date are two different dates. Put ingested files into table partitioned by ingestion date (LZ). You can track the initial data. If reprocessing is possible, then use ingestion_date as a bookmark for reprocessing of LZ table.
Then schedule a process which will take two or more last days of ingestion date and load into table partitioned by event_date. Last day will be always incomplete, and may be you need to increase look-back period to 3 or even more ingestion days (using ingestion_date >= current_date - 2 days filter), it depends how many dates back ingestion may load event dates. And in this process you are using dynamic partitioning by event_date and applying some logic - cleaning, etc and loading into ODS or DM.
This approach is very similar to what you proposed. The difference is in first table, it should be partitioned to allow you process data incrementally and to do easy restatement if you need to change the logic or upstream data was also restated and reloaded in the LZ.
When creating a partitioned table using bq mk --time_partitioning_type=DAY are the partitions created based on the load time of the data, not a date key within the table data itself?
To create partitions based on dates within the date, is the current approach to manually create sharded tables, and load them based on date, as in this post from 2012?
Yes, partitions created based on data load time not based on data itself
You can use partition decorator (mydataset.mytable1$20160810) if you want to load data into specific partition
Per my understanding, partition by column is something that we should expect to be supported at some point - but not now
Good news, BigQuery currently supports 2 type data partition, included partition by column. Please check here.
I like the feature: An individual operation can commit data into up to 2,000 distinct partitions.
I'm trying to use dataflow to stream into BQ partitioned table.
The documentation says that:
Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.
I can see that's the case when inserting rows into a date partitioned table.
Is there a way to be able to set the partition time of the rows I want to insert so that BigQuery can infer the correct partition?
So far I've tried doing: tableRow.set("_PARTITIONTIME", milliessinceepoch);
but I get hit with a no such field exception.
As of a month or so ago, you can stream into a specific partition of a date-partitioned table. For example, to insert into partition for date 20160501 in table T, you can call insertall with table name T$20160501
AFAIK, as of writing, BigQuery does not allow specifying the partition manually per row - it is inferred from the time of insertion.
However, as an alternative to BigQuery's built-in partitioned tables feature, you can use Dataflow's feature for streaming to multiple BigQuery tables at the same time: see Sharding BigQuery output tables.
I want to import a large csv to a bigquery partitioned table that has a timestamp type column that is actually the date of some transaction, the problem is that when I load the data it imports everything into one partition of today's date.
Is it possible to use my own timestamp value to partition it? How can I do that.
In BigQuery, currently, partitioning based on specific column is not supported.
Even if this column is date related (timestamp).
You either rely on time of insertion so BigQuery engine will insert into respective partition or you specify which exactly partition you want to insert your data into
See more about Creating and Updating Date-Partitioned Tables
The best way to do that today is by using Google Dataflow [1]. You can develop a streaming pipeline which will read the file from Google Cloud Storage bucket and insert the rows into BigQuery's table.
You will need to create the partitioned table manually [2] before running the pipeline, because Dataflow right now doesn't support creating partitioned tables
There are multiple examples available at [3]
[1] https://cloud.google.com/dataflow/docs/
[2] https://cloud.google.com/bigquery/docs/creating-partitioned-tables
[3] https://cloud.google.com/dataflow/examples/all-examples