When does BigQuery flush the streaming output buffer - google-bigquery

I know this question has been asked in a different form a while ago. But now that BQ allows DML on partitioned table, its more important to understand when the streaming buffer is flushed so that we can perform DML on tables for maintenance.
This is very important now since
I have 1500 partitioned tables.
Each table has atleast 200
partitions.
Now I have to update all the tables since we are performing some sort of hashing for GDPR.
If I cant run the DML, then
I have to restate the 200 * 1500 partitions by joining with a reference table.
If I can run the DML then I just have to run 1500 udpate statements.
I have stopped the streaming and have been waiting since > 90 minutes and yet still get the same error that I cant run DML since the table has streaming buffer. Any response with your own experience would be highly appreciated.

Answer is "it depends" and mostly based on size of data you stream to buffer - but it also based on algorithmic tuning on BQ side. As of now - there is no definite time you can somehow calculate before data will flush. And there is no mechanism to invoke flush of buffer manually.

So apparently BigQuery now allows update on older partitions of partitioned tables with streaming buffer now. But not on the streaming buffer itself.
For example :
update
`dataset.table_name`
set column = 'value'
where _PARTITIONTIME = '2018-05-01'
Works beautifully.
But
update
`dataset.table_name`
set column = 'value'
where _PARTITIONTIME is null
Doesn't work and fails with the below error:
UPDATE or DELETE statement over table dataset.table_name would affect rows in the streaming buffer, which is not supported

Related

Clarification about streaming buffer in big query

I just started to discover Big Query in GCP for my learning purpose. so I created two tables and tried to insert, delete and update queries by using python API.
I'm able to update a table called table_1 any time using the below
query
UPDATE *****.*****.table_1 SET col_1 = 'value_1', col_2 = 'value_2' WHERE col3 = 'value_3'
and it returns This statement modified 2 rows in ****:****.Projects.
But when I try to update the table called table_2 using a query in the
same way it returns
UPDATE or DELETE statement over table ***.***.table_2 would affect rows in the streaming buffer, which is not supported
so I created tables and perform operations in the same way, my problem is why I'm getting this error only for the table_2
Thank you
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table but it can take up to 90 minutes to become available for copy/export and other operations like UPDATE or DELETE. You probably have to wait up to 90 minutes so all buffers are persisted on the cluster. You can check the ‘tables.get’ response for a section named ‘streamingBuffer’ to check whether the table has a streaming buffer or not.
If you have used a load job to create the table, you won't have a streaming buffer, but probably you streamed some values to it.
You can also refer to this documentation [1] for more information
[1] https://cloud.google.com/bigquery/streaming-data-into-bigquery

BigQuery standard sql not deleted?

I can not delete the range defined by where.
My query:
delete from `dataset.events1` as t where t.group='error';
Result:
Error: UPDATE or DELETE statement over table dataset.events1 would affect rows in the streaming buffer, which is not supported.
According to the BQ docs:
Rows that were written to a table recently via streaming (using the tabledata.insertall method) cannot be modified using UPDATE, DELETE, or MERGE statements. Recent writes are typically those that occur within the last 30 minutes. Note that all other rows in the table remain modifiable by using UPDATE, DELETE, or MERGE statements.
This looks like the error you're facing.
You can check if your table has a streaming buffer attached through the BigQuery API.
This error message is considered as an expected behavior when querying rows that were recently streamed into the table in order to maintain the data consistency. Based on this, it is required to wait until the buffer is flushed, which can take up to 90 minutes to become available for copy/export and other operations, otherwise you would get the same error.
To validate if the table has an active streaming buffer process, you can check the tables.get response and verify if it contains a section named streamingBuffer.

How to manage Schema Drift while streaming to BigQuery sharded table

We are new to BigQuery and are trying to figure out the best way to use it for real time analytics. We are sending a stream of logs from our back-end services to Kafka, and we want to stream those into BigQuery using streaming inserts. For queryability we are both partitioning by time, and sharding tables by event type (for use with wildcard queries). We put a view overtop of the family of tables created so that they look like 1 table and use the _TABLE_SUFFIX (well, when they roll out the feature, for now using UNION ALL) and _PARTITIONTIME columns to reduce the set of rows scanned for queries. So far so good.
What we are unsure of how to handle properly is schema changes. The schema of our log messages changes frequently. Having a manual process to keep BigQuery in sync is not tenable. Ideally our streaming pipeline would detect the change and apply the schema update (for adding columns) or table creation (for adding an event type) as necessary. We have tooling up-stream so that we know all schema updates will be backwards compatible.
My understanding is that all of the shards must have the same schema. How do we apply the schema update in such a fashion that:
We don't break queries that are run during the update.
We don't break streaming inserts.
Is #1 possible? I don't believe we can atomically change the schema of all the sharded tables.
For #2 I presume we have to stop our streaming pipelines while the schema update process is occurring.
Thanks,
--Ben
Wildcard tables with _TABLE_SUFFIX is available https://cloud.google.com/bigquery/docs/querying-wildcard-tables and you can use it even if the schemas of the tables are different, they just need to have compatible schemas. With UNION ALL, you need all the tables to have the same schema so it will not work if you're updating schemas at the same time.
Streaming insert will also work if you only specify a subset of fields. However you cannot add new fields as part of the streaming insert, you'll have to update table first and then insert the data with new schema.

Result of Bigquery job running on a table in which data is loaded via streamingAPI

I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.

Update or Delete tables with streaming buffer in BigQuery?

I'm getting this following error when trying to delete records from a table created through GCP Console and updated with GCP BigQuery Node.js table insert function.
UPDATE or DELETE DML statements are not supported over table stackdriver-360-150317:my_dataset.users with streaming buffer
The table was created without streaming features. And from what I'm reading in documentation Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements.
Does it mean that once a record has been inserted with this function into a table, there's no way to delete records? At all? If that's the case, does it mean that table needs to be deleted and recreated from scratch? If that's not the case. Can you please suggest a workaround to avoid this issue?
Thanks!
Including new error message for SEO: "UPDATE or DELETE statement over table ... would affect rows in the streaming buffer, which is not supported" -- Fh
To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer or, when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column, so even with a simple WHERE query can be checked.
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table but it can take up to 90 minutes to become available for copy/export and other operations. You probably have to wait up to 90 minutes so all buffer is persisted on the cluster. You can use queries to see if the streaming buffer is empty or not like you mentioned.
If you use load job to create the table, you won't have streaming buffer, but probably you streamed some values to it.
Note the answer below to work with tables that have ongoing streaming buffers. Just use a WHERE to filter out the latest minutes of data and your queries will work. -- Fh
Make sure to change your filters so they don't include data that could be in the current streaming buffer.
For example, this query fails while I'm streaming to this table:
DELETE FROM `project.dataset.table`
WHERE id LIKE '%-%'
Error: UPDATE or DELETE statement over table project.dataset.table would affect rows in the streaming buffer, which is not supported
You can fix it by only deleting older records:
DELETE FROM `project.dataset.table`
WHERE id LIKE '%-%'
AND ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 40 MINUTE)
4282 rows affected.