The limitation of partition table updates in BigQuery - google-bigquery

The Quotas & Limits document says that "partition table updates" has the two below limitations.
Daily limit: 2,000 partition updates per table, per day
Rate limit: 50 partition updates every 10 seconds
My question is whether these limitations are applied to a partitoin table or ones in the dataset.
For example, is it possible to have thousands day-partitioned tables and perform streaming insert to each table everyday?

Related

Is it possible to set expiration time for records in BigQuery

Is it possible to set a time to live for a column in BigQuery?
If there are two records in table payment_details and timestamp, the data in BigQuery table should be deleted automatically if the timestamp is current time - timestamp is greater is 90 days.
Solution 1:
BigQuery has a partition expiration feature. You can leverage that for your use case.
Essentially you need to create a partitioned table, and set the partition_expiration_days option to 90 days.
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
OPTIONS(
partition_expiration_days=90
)
or if you have a table partitioned already by the right column
ALTER TABLE mydataset.mytable
SET OPTIONS (
-- Sets partition expiration to 90 days
partition_expiration_days=90
)
When a partition expires, BigQuery deletes the data in that partition.
Solution 2:
You can setup a Scheduled Query that will prune hourly/daily your data that is older than 90 days. By writing a "Delete" query you have more control to actually combine other business logic, like only delete duplicate rows, but keep most recent entry even if it's older than 90 days.
Solution 3:
If you have larger business process that does the 90 day pruning based on other external factors, like an API response, and conditional evaluation, you can leverage Cloud Workflows to build and invoke a workflow regularly to automate the pruning of your data. See Automate the execution of BigQuery queries with Cloud Workflows article which can guide you with this.

BigQuery - Create view with Partition but base table doesn't have

This may sound crazy, but I want to implement something like having a view with a partition.
Background:
I had a table with a date partition on a column which is really huge in size. We are running data ingestion to this table at every 2mins interval. All the data loads are append-only. Ever load will insert 10k+ rows. After some time, we encountered the partition limitation issue.
message: "Quota exceeded: Your table exceeded quota for Number of partition modifications to a column partitioned table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
Root cause:(from GCP support team)
The root cause under the hood was that due to your partitioned tables
have pretty granular partition for instance by minutes, hours or date,
when the loaded data cover a wide range of partition period, the
number of partition get modified will be high and above 4000. As per
internal documentation, it was suggested the user who ran into this
issue to consider making a less granular partition for instance change
a date/hour/minute based partitioned table to a week based partitioned
table. Alternatively split the load to multiple and hence limit the
data range to cover less number of partitions that would be affected.
This is the best recommendation I could have now.
So I'm planning to keep this table as un-partitioned and create a view(we need a view for eliminating the duplicates) and it should have parition. Is this possible? or any other alternate solution for this?
You can't partition a view, it's not physically materialized. Partitioning on day can be limiting with the 4000 limit, would year work? then you can use an integer partition:
create or replace table BI.test
PARTITION BY RANGE_BUCKET(Year, GENERATE_ARRAY(2000, 3000, 1)) as
select 2000 as Year, 1 as value
union all
select 2001 as Year, 1 as value
union all
select 2002 as Year, 1 as value
Alternatively, I've used month (YYYYMM) or week (YYYYWW) to integer partition by which gets you around 40 years:
RANGE_BUCKET(monthasintegerfield, GENERATE_ARRAY(201612, 205712, 1))

Designing a Cloud BigTable: Millions of Rows X Millions of Columns?

I'm wondering if the following table design for BigTable is legit. From what I read, having millions of sparse columns should work, but would it work well?
The idea is to keep time-based "samples" in columns (each is a few Kb). I expect to have millions of rows, where each would have a limited number of entries (~10-50) as values in the table. Each column in the table represents a timespan of (say, ) 10 seconds. and since there are roughly 2.6 seconds in a month, a year would take about 3M columns. I intend to use row-scans to fetch rows by prefix - usually just a handful of rows per fetch.
so, to sum:
the table will contain (million rows X 50 samples per row, each a few kb): 50M items
but the table's dimensions are (million rows X million columns): a trillion cells.
Now, I know that empty cells don't take space and the whole "table" metaphor isn't really apt to BT, but I'm still wondering: does the above represent a valid use-case for BigTable?
Based on the Google docs, Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns. About the limitation of Cloud Bigtable rows and columns, Cloud Bigtable Rows can be big but are not infinite, the rows can contain ~100 column families and millions of columns but the recommendation is 100MB for row size then 10MB for column value.
Therefore, In BigTable, the limit of the data within table is based on data size instead of the number of columns or rows (except for "Column families per table"). I believe your use-case is valid and could have a million of rows and columns as long as the values is within the hard limit. As a best practice, design your schema to keep the size of your data.

Does deleting and creating a table in bigquery renews the quota limit per day?

I am creating a data pipeline which writes the data into bigquery table every minute and eventually exceeds the quota limit. Does deleting the table after a few hours and then creating it again will renew the quota limit of that table?
I'm using Python API of bigquery to achieve this task.
Need to update the same table in bigquery without exceeding the quota limit.
As per BQ documents, it imposes an upper-bound limit of 1,000 updates per table per day.
I think you have to "engineer" ways to get around your frequency of updates to a table. There are some very obvious ways around this (which are also pretty standard industry practices) and then there are some tricks. Here is what I can think from top of my head:
You can choose to update your target table (overwrite) less frequently.
You can a compose a new table name to be valid only for updates coming in for a certain time interval during the day (example: between 2-3 AM, let your pipeline write query results to table mydataset.my_table_[date]_02_03). Then, at the time of querying, you can just use wildcard statements like:
select count(*) as cnt from `mydataset.my_table_[date]_*`
Which is equivalent to:
select count(*) as cnt from (
select * from (
select * from `mydataset.my_table_[date]_00_01`
)
union all
select * from (
select * from `mydataset.my_table_[date]_01_02`
)
union all
....
)
In this, however, make sure you are always "appending" (not overwriting) data to the table corresponding to the hour of the day. Also, not to forget, you can always take decent advantage of BQ's date partitioned tables to achieve similar results.
Hope this helps.

Wilcard on day table vs time partition

I try to understand if there is a difference in big query (in the cost or possibility of requesting for example) between :
Create one table per day (like my_table_2018_02_06)
Create a time partitioned table (my-table with time partition by day).
Thanks !
Short explanation: querying multiple tables using Wildcard Tables was the proposed alternative for when BigQuery did not have a partition mechanism available. The natural evolution was to include the feature of Partitioned Table, and currently there is an alpha release consisting in column-based time partitioning, i.e. letting the user define which column (having a DATE or TIMESTAMP data type) will be used for the partitioning.
So currently BigQuery engineers are working in adding more new features to table partitioning, instead of the legacy Wildcard Tables methodology, then I'd suggest that you work with them.
Long explanation: you are comparing two approaches that in fact are used with the same purpose, but which have different implications:
Wildcard Tables: some time ago, when table partitioning was not a feature supported by Big Query, Wildcard Tables was the way to query multiple tables using concise SQL queries. A Wildcard Table represents the union of all the tables that match the wildcard expression specified in the SQL statement. However, Wildcard Tables have some limitations, such as:
Do not support views.
Do not support cached results (queries containing wildcard tables are billed every time they are run, even if the "cached results" option is checked).
Only work with native BigQuery storage (cannot work with external tables [Bigtable, Storage or Drive]).
Only available in standard SQL.
Partitioned Tables: these are unique tables that are divided into segments, split by date. There is a lot of documentation regarding how to work with Partitioned Tables, and regarding the pricing, each partition in a Partitioned Table is considered an independent entity, so if a partition was not updated for the last 90 days, this data will be considered long-term and therefore will be billed with the appropriate discount (as would happen with a normal table). Finally, Partitioned Tables are here to stay, so there are more incoming features to them, such as column-based partitioning, which is currently in alpha, and you can follow its status in this Public Issue Tracker post. On the other hand, there are also some current limitations to be considered:
Maximum of 2500 partitions per Partitioned Table.
Maximum of 2000 partition updates per table per day.
Maximum of 50 partition updates every 10 seconds.
So in general, it would be advisable to work with Partitioned Tables over multiple tables using Wildcard Tables. However, you should always consider your use case and see which one of the possibilities meets your requirements better.
One thing to add to your decision criteria here is caching and usage of legacy vs standard SQL.
Since the syntax in standard SQL for selecting multiple tables uses a wild card there is no way for the query result to be cached.
Interestingly, the query result would have been cached if legacy SQL was used. Just converting the query to standard SQL would disable caching.
This may be important to consider, at least in some cases more than others.
Thank you,
Hazem
Not exactly a time partition, but one can benefit from both worlds - wildcard "partitions" and real partitions to slice the data even further. Below is an example where we first use the data suffix to select only table holding data from that particular date, then we use actual partitioning within the table to limit the amount of data scanned even further.
Create first partitioned table with data suffix
CREATE TABLE `test_2021-01-05` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-05` (x,y) values (5,1);
insert `test_2021-01-05` (x,y) values (5,2);
insert `test_2021-01-05` (x,y) values (5,3);
Create second partitioned table with data suffix
CREATE TABLE `test_2021-01-04` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-04` (x,y) values (4,1);
insert `test_2021-01-04` (x,y) values (4,2);
Select all the data from both tables using wildcard notation, 80B of data is the whole test set
select * from `test_*`
-- 80B, all the data
Just select data from one table, which is like partitioning on date
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
-- 48B
Select data both from one table(where I am interested in one date) and only from one partition
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
and y = 1
-- 16B, that was the goal
Select data just from one partition from all the tables
select * from `test_*`
where y = 1
-- 32B, only one partition from both tables
The ultimate goal was to limit the data scanned when reading, thus reducing the cost and increasing performance.