I am trying to create date-partitioned + template tables in BigQuery:
Create base table using bq mk --time_partitioning_type=DAY myapp.customer
Call API insertAll with "tableId": "customer", "templateSuffix": "_activated"
The resulting customer_activated table inherits the schema of the customer table, but has no timePartitioning.
How can I ensure template tables inherit the time partitioning of the base table?
For people coming here in the future, the accepted answer is outdated. BigQuery Streaming APIs support date-partition tables now, both to the table and to a specific partition
Link to docs
Streaming APIs do not yet support date-partitioning
Your option is to use load job with the partition as the destination for initial population and then just use streaming directly to the table (without using partitions) and let bigquery infer the partition timestamp
Otherwise you should wait when streaming will support date-partitioning which Google Team mentioned to happen in near future
Update:
Since around mid-2017 BigQuery supports Streaming into partitioned tables
Just FYI, as of November 2022, it is indeed possible to stream data into already existing partitioned tables, however, tables created automatically using a template table do NOT inherit the time partitioning configuration of the parent table, which is what OP was asking in the first place.
Related
Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.
I need to move BigQuery datasets with many tables (both partitioned and unpartitioned) from the US to the EU.
If the source table is unpartitioned, the documented way of bq extracting the data to GCS and bq loading it in another region works fine, so far so good.
If however the source table is partitioned, during the load step the mapping between data and partition is lost and I'll end up having all data within one partition.
Is there a good (automated) way of exporting and importing partitioned tables in BQ? Any pointers would be greatly appreciated!
There's a few ways to do this, but I would personally use Cloud Dataflow to solve it. You'll have to pay a little bit more for Dataflow, but you'll save a lot of time and scripting in the long run.
High level:
Spin up a Dataflow pipeline
Read partitioned table in US (possibly aliasing the _PARTITIONTIME to make it easier later)
Write results back to BigQuery using same partition.
It's basically the same as what was talked about here.
Another solution is to use DML to load the data, instead of load, https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables. Since you have a timestamp column in the table to infer the partition, you can use
INSERT INTO PROJECT_ID.DATASET.mytable (_PARTITIONTIME, field1, field2) AS SELECT timestamp_column, 1, “one” FROM PROJECT_ID.DATASET.federated_table
You can define a permanent federated table, or a temporary one, https://cloud.google.com/bigquery/external-data-cloud-storage#permanent-tables. You'll need to pay for DML though, while load is free.
I saw in the documentation for partitioning that you can partition a table based on a timestamp field in the schema, rather than on the data's insertion time. I was hoping to experiment with this by taking one of our existing tables, exporting its data, and then creating a new table with the same schema and with partitioning on the timestamp field, but when I try it I get:
"Field based partitioning support is not yet available for this project"
Is this something I have to ask to be set up for my project, or is it experimental? If the latter, is there an ETA for it being rolled?
The situation is that I have terabytes of data stored in nonpartitioned tables, and it seems like not only will the conversion process will be painful (I've read Migrating from non-partitioned to Partitioned tables), but my Dataflow pipeline going forward will have to do ugly things to write new data into the correct partition, because 'time of insertion' won't be accurate compared to the timestamps in the actual data.
I guess you read about the new feature in the API references. We're preparing to alpha this feature so have enabled it in the API and client. You can track the feature progress at https://issuetracker.google.com/issues/65440943. Thanks!
I would like to know if there is any way to stream data to a specific time partition of a BigQuery table. The documentation says that you must use table decorators:
Loading data using partition decorators
Partition decorators enable you to load data into a specific
partition. To adjust for timezones, use a partition decorator to load
data into a partition based on your preferred timezone. For example,
if you are on Pacific Standard Time (PST), load all data generated on
May 1, 2016 PST into the partition for that date by using the
corresponding partition decorator:
[TABLE_NAME]$20160501
Source: https://cloud.google.com/bigquery/docs/partitioned-tables#dealing_with_timezone_issues
And:
Restating data in a partition
To update data in a specific partition, append a partition decorator
to the name of the partitioned table when loading data into the table.
A partition decorator represents a specific date and takes the form:
$YYYYMMDD
Source: https://cloud.google.com/bigquery/docs/creating-partitioned-tables#creating_a_partitioned_table
But if I try to use them when streaming data i got the following error: Table decorators cannot be used with streaming insert.
Thanks in advance!
Sorry for the inconvenience. We are considering providing support for this in the near future. Please stay tuned for more updates.
Possible workarounds that might work in many cases:
If you have most of the data available(which is sometimes the case when restating data for an old partition), you can use a load job with the partition as the destination.
Another option is to stream to a temporary table and after the data has been flushed from the streaming buffer, use bq cp
This feature was recently released and you can now stream directly into a decorated date partition within the last 30 days historically and 5 days into the future.
https://cloud.google.com/bigquery/streaming-data-into-bigquery
As part of our Bigquery solution we have a cron job which checks the latest table created in a dataset and will create more if this table is out of date.This check is done with the following query
SELECT table_id FROM [dataset.__TABLES_SUMMARY__] WHERE table_id LIKE 'table_root%' ORDER BY creation_time DESC LIMIT 1
Our integration tests have recently been throwing errors because this query is hitting Bigquery's internal cache even though running the query against the underlying table would provide a different result. This caching also occurs if I run this query in the web interface from Google cloud console.
If I specify for the query not to cache using the
queryRequest.setUseQueryCache(false)
flag in the code then the tests pass correctly.
My understanding was that Bigquery automatic caching would not occur if running the query against the underlying table would provide a different result. Am I incorrect in this assumption in which case when does it occur or is this a bug?
Well the answer for your question is: you are doing conceptually wrong. You always need to set the no cache param if you want no cache data. Even on the web UI there are options you need to use. The default is to use the cached version.
But, fundamentally you need to change the process and use the recent features:
Automatic table creation using template tables
A common usage pattern for streaming data into BigQuery is to split a logical table into many smaller tables, either for creating smaller sets of data (e.g., by date or by user ID) or for scalability (e.g., streaming more than the current limit of 100,000 rows per second). To split a table into many smaller tables without adding complex client-side code, use the BigQuery template tables feature to let BigQuery create the tables for you.
To use a template table via the BigQuery API, add a templateSuffix parameter to your insertAll request
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset. Templates also make it easier to update the schema because you need only update the template table.
Tables created via template tables are usually available within a few seconds.
This way you don't need to have a cron, as it will automatically create the missing tables.
Read more here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#template-tables