Update or Delete tables with streaming buffer in BigQuery? - google-bigquery

I'm getting this following error when trying to delete records from a table created through GCP Console and updated with GCP BigQuery Node.js table insert function.
UPDATE or DELETE DML statements are not supported over table stackdriver-360-150317:my_dataset.users with streaming buffer
The table was created without streaming features. And from what I'm reading in documentation Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements.
Does it mean that once a record has been inserted with this function into a table, there's no way to delete records? At all? If that's the case, does it mean that table needs to be deleted and recreated from scratch? If that's not the case. Can you please suggest a workaround to avoid this issue?
Thanks!
Including new error message for SEO: "UPDATE or DELETE statement over table ... would affect rows in the streaming buffer, which is not supported" -- Fh

To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer or, when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column, so even with a simple WHERE query can be checked.
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table but it can take up to 90 minutes to become available for copy/export and other operations. You probably have to wait up to 90 minutes so all buffer is persisted on the cluster. You can use queries to see if the streaming buffer is empty or not like you mentioned.
If you use load job to create the table, you won't have streaming buffer, but probably you streamed some values to it.
Note the answer below to work with tables that have ongoing streaming buffers. Just use a WHERE to filter out the latest minutes of data and your queries will work. -- Fh

Make sure to change your filters so they don't include data that could be in the current streaming buffer.
For example, this query fails while I'm streaming to this table:
DELETE FROM `project.dataset.table`
WHERE id LIKE '%-%'
Error: UPDATE or DELETE statement over table project.dataset.table would affect rows in the streaming buffer, which is not supported
You can fix it by only deleting older records:
DELETE FROM `project.dataset.table`
WHERE id LIKE '%-%'
AND ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 40 MINUTE)
4282 rows affected.

Related

how to write a Trigger to insert data from aurora to redshift

I am having some data in aurora mysql db, I would like to do two things:
HISTORICAL DATA:
To read the data from aurora(say TABLE A) do some processing and update some columns of a table in redshift(say TABLE B).
ALSO,
LATEST DAILY LOAD
To have a trigger like condition where whenever a new row is inserted in aurora table A then a trigger should update the columns in redshift table B with some processing.
what should be the best approach to handle such situation. Please understand I don't have a simple read and insert situation , I also have to perform some process as well between read and write.
Not sure if you have already solved the issue and if so please share the details.
We are looking at following approach
A cron will write the daily data batch into s3 (say 1 month or order)
Upon s3 arrival, load that file into Redshift via copy command (https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html)
Looking for more ideas/thoughts for sure.

Clarification about streaming buffer in big query

I just started to discover Big Query in GCP for my learning purpose. so I created two tables and tried to insert, delete and update queries by using python API.
I'm able to update a table called table_1 any time using the below
query
UPDATE *****.*****.table_1 SET col_1 = 'value_1', col_2 = 'value_2' WHERE col3 = 'value_3'
and it returns This statement modified 2 rows in ****:****.Projects.
But when I try to update the table called table_2 using a query in the
same way it returns
UPDATE or DELETE statement over table ***.***.table_2 would affect rows in the streaming buffer, which is not supported
so I created tables and perform operations in the same way, my problem is why I'm getting this error only for the table_2
Thank you
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table but it can take up to 90 minutes to become available for copy/export and other operations like UPDATE or DELETE. You probably have to wait up to 90 minutes so all buffers are persisted on the cluster. You can check the ‘tables.get’ response for a section named ‘streamingBuffer’ to check whether the table has a streaming buffer or not.
If you have used a load job to create the table, you won't have a streaming buffer, but probably you streamed some values to it.
You can also refer to this documentation [1] for more information
[1] https://cloud.google.com/bigquery/streaming-data-into-bigquery

Google BigQuery Streaming failed sometimes if I do delete table and create table first, before streaming

I am streaming data into a BigQuery table.
Delete the old table
Create a new table with same name and same schema
Stream data into new table
I had done this quite a few times before, it was working fine. But recently I started to see the above approach not working.
After streaming is done (no error reported), I query the table, sometimes it worked. Sometimes, I got empty table. (Same script, same data, run many times, the results are different. Sometimes works, sometime not.)
And to add to the mystery, when I streamed large amount data, it seemed working most of the times. But when I streamed small amount data, then it failed most of the times.
But if I just do
Create a new table
Stream data into the new table
It always works.
I tried this both in Google Apps Scrip and PHP Google Cloud Client Library for BigQuery. I had the same problems.
So I tried this in Google Apps Script
Delete the old table
Sleep 10 seconds, so the delete job should be done
Create a new table with same name and same schema
Sleep 10 seconds, so the create job should be done
Stream data into new table
It still gave me the same problems.
But there are no error reported or logged.
Additional Information:
I tried again.
If I wait until the stream buffer is empty, and then run the script. The results are always correct. The new data streamed into the new table successfully.
But if I run the script, right after previous running, then the results are empty. The data is not streamed into the new table.
So error seems happening when I "delete the old table and create the new table" when stream buffer is not empty.
But according to the answer from this thread, BigQuery Stream and Delete while streaming buffer is not empty?,
the old table and new table (even they are with the same name and same schema), they are with two different "object id". They are actually two different tables. After I delete the old table, the old records in stream buffer would be dropped too. Stream buffer is empty or not, it should not affect my next steps, create a new table and stream new data to the new table.
On the other hand, if I try to "truncate old table", instead of "delete old table and create a new table", while there might still be data in stream buffer, then "DML statement cannot modify data still in stream buffer", so "truncate old table" would fail.
In simple words, in this use case,
I cannot truncate the old table, because the steam buffer may not be empty.
I am supposed to "delete old table and create new table, then stream data to new table". But it seems it is the root of my current problems, my new data cannot be streamed to new table (even the new table is with a new object id, and it should not be affected by the fact I just delete an old table)
Avoid truncating and recreating tables while streaming.
From the official docs:
https://cloud.google.com/bigquery/troubleshooting-errors#streaming
Table Creation/Deletion - Streaming to a nonexistent table will return a variation of a notFound response. Creating the table in response may not immediately be recognized by subsequent streaming inserts. Similarly, deleting and/or recreating a table may create a period of time where streaming inserts are effectively delivered to the old table and will not be present in the newly created table.
Table Truncation - Truncating a table's data (e.g. via a query job that uses writeDisposition of WRITE_TRUNCATE) may similarly cause subsequent inserts during the consistency period to be dropped.
To avoid losing data: Create a new table with a different name.
I posted in another thread of mine regarding streaming into BigQuery. Now as a rule, I am trying to avoid streaming if I can.
Load the data to Cloud Storage
Then load data from Cloud Storage to BigQuery
Which will solve many streaming related issues.

BigQuery standard sql not deleted?

I can not delete the range defined by where.
My query:
delete from `dataset.events1` as t where t.group='error';
Result:
Error: UPDATE or DELETE statement over table dataset.events1 would affect rows in the streaming buffer, which is not supported.
According to the BQ docs:
Rows that were written to a table recently via streaming (using the tabledata.insertall method) cannot be modified using UPDATE, DELETE, or MERGE statements. Recent writes are typically those that occur within the last 30 minutes. Note that all other rows in the table remain modifiable by using UPDATE, DELETE, or MERGE statements.
This looks like the error you're facing.
You can check if your table has a streaming buffer attached through the BigQuery API.
This error message is considered as an expected behavior when querying rows that were recently streamed into the table in order to maintain the data consistency. Based on this, it is required to wait until the buffer is flushed, which can take up to 90 minutes to become available for copy/export and other operations, otherwise you would get the same error.
To validate if the table has an active streaming buffer process, you can check the tables.get response and verify if it contains a section named streamingBuffer.

When does BigQuery flush the streaming output buffer

I know this question has been asked in a different form a while ago. But now that BQ allows DML on partitioned table, its more important to understand when the streaming buffer is flushed so that we can perform DML on tables for maintenance.
This is very important now since
I have 1500 partitioned tables.
Each table has atleast 200
partitions.
Now I have to update all the tables since we are performing some sort of hashing for GDPR.
If I cant run the DML, then
I have to restate the 200 * 1500 partitions by joining with a reference table.
If I can run the DML then I just have to run 1500 udpate statements.
I have stopped the streaming and have been waiting since > 90 minutes and yet still get the same error that I cant run DML since the table has streaming buffer. Any response with your own experience would be highly appreciated.
Answer is "it depends" and mostly based on size of data you stream to buffer - but it also based on algorithmic tuning on BQ side. As of now - there is no definite time you can somehow calculate before data will flush. And there is no mechanism to invoke flush of buffer manually.
So apparently BigQuery now allows update on older partitions of partitioned tables with streaming buffer now. But not on the streaming buffer itself.
For example :
update
`dataset.table_name`
set column = 'value'
where _PARTITIONTIME = '2018-05-01'
Works beautifully.
But
update
`dataset.table_name`
set column = 'value'
where _PARTITIONTIME is null
Doesn't work and fails with the below error:
UPDATE or DELETE statement over table dataset.table_name would affect rows in the streaming buffer, which is not supported