Clarification about streaming buffer in big query - google-bigquery

I just started to discover Big Query in GCP for my learning purpose. so I created two tables and tried to insert, delete and update queries by using python API.
I'm able to update a table called table_1 any time using the below
query
UPDATE *****.*****.table_1 SET col_1 = 'value_1', col_2 = 'value_2' WHERE col3 = 'value_3'
and it returns This statement modified 2 rows in ****:****.Projects.
But when I try to update the table called table_2 using a query in the
same way it returns
UPDATE or DELETE statement over table ***.***.table_2 would affect rows in the streaming buffer, which is not supported
so I created tables and perform operations in the same way, my problem is why I'm getting this error only for the table_2
Thank you

Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table but it can take up to 90 minutes to become available for copy/export and other operations like UPDATE or DELETE. You probably have to wait up to 90 minutes so all buffers are persisted on the cluster. You can check the ‘tables.get’ response for a section named ‘streamingBuffer’ to check whether the table has a streaming buffer or not.
If you have used a load job to create the table, you won't have a streaming buffer, but probably you streamed some values to it.
You can also refer to this documentation [1] for more information
[1] https://cloud.google.com/bigquery/streaming-data-into-bigquery

Related

how to write a Trigger to insert data from aurora to redshift

I am having some data in aurora mysql db, I would like to do two things:
HISTORICAL DATA:
To read the data from aurora(say TABLE A) do some processing and update some columns of a table in redshift(say TABLE B).
ALSO,
LATEST DAILY LOAD
To have a trigger like condition where whenever a new row is inserted in aurora table A then a trigger should update the columns in redshift table B with some processing.
what should be the best approach to handle such situation. Please understand I don't have a simple read and insert situation , I also have to perform some process as well between read and write.
Not sure if you have already solved the issue and if so please share the details.
We are looking at following approach
A cron will write the daily data batch into s3 (say 1 month or order)
Upon s3 arrival, load that file into Redshift via copy command (https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html)
Looking for more ideas/thoughts for sure.

BigQuery standard sql not deleted?

I can not delete the range defined by where.
My query:
delete from `dataset.events1` as t where t.group='error';
Result:
Error: UPDATE or DELETE statement over table dataset.events1 would affect rows in the streaming buffer, which is not supported.
According to the BQ docs:
Rows that were written to a table recently via streaming (using the tabledata.insertall method) cannot be modified using UPDATE, DELETE, or MERGE statements. Recent writes are typically those that occur within the last 30 minutes. Note that all other rows in the table remain modifiable by using UPDATE, DELETE, or MERGE statements.
This looks like the error you're facing.
You can check if your table has a streaming buffer attached through the BigQuery API.
This error message is considered as an expected behavior when querying rows that were recently streamed into the table in order to maintain the data consistency. Based on this, it is required to wait until the buffer is flushed, which can take up to 90 minutes to become available for copy/export and other operations, otherwise you would get the same error.
To validate if the table has an active streaming buffer process, you can check the tables.get response and verify if it contains a section named streamingBuffer.

When does BigQuery flush the streaming output buffer

I know this question has been asked in a different form a while ago. But now that BQ allows DML on partitioned table, its more important to understand when the streaming buffer is flushed so that we can perform DML on tables for maintenance.
This is very important now since
I have 1500 partitioned tables.
Each table has atleast 200
partitions.
Now I have to update all the tables since we are performing some sort of hashing for GDPR.
If I cant run the DML, then
I have to restate the 200 * 1500 partitions by joining with a reference table.
If I can run the DML then I just have to run 1500 udpate statements.
I have stopped the streaming and have been waiting since > 90 minutes and yet still get the same error that I cant run DML since the table has streaming buffer. Any response with your own experience would be highly appreciated.
Answer is "it depends" and mostly based on size of data you stream to buffer - but it also based on algorithmic tuning on BQ side. As of now - there is no definite time you can somehow calculate before data will flush. And there is no mechanism to invoke flush of buffer manually.
So apparently BigQuery now allows update on older partitions of partitioned tables with streaming buffer now. But not on the streaming buffer itself.
For example :
update
`dataset.table_name`
set column = 'value'
where _PARTITIONTIME = '2018-05-01'
Works beautifully.
But
update
`dataset.table_name`
set column = 'value'
where _PARTITIONTIME is null
Doesn't work and fails with the below error:
UPDATE or DELETE statement over table dataset.table_name would affect rows in the streaming buffer, which is not supported

Update or Delete tables with streaming buffer in BigQuery?

I'm getting this following error when trying to delete records from a table created through GCP Console and updated with GCP BigQuery Node.js table insert function.
UPDATE or DELETE DML statements are not supported over table stackdriver-360-150317:my_dataset.users with streaming buffer
The table was created without streaming features. And from what I'm reading in documentation Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements.
Does it mean that once a record has been inserted with this function into a table, there's no way to delete records? At all? If that's the case, does it mean that table needs to be deleted and recreated from scratch? If that's not the case. Can you please suggest a workaround to avoid this issue?
Thanks!
Including new error message for SEO: "UPDATE or DELETE statement over table ... would affect rows in the streaming buffer, which is not supported" -- Fh
To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer or, when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column, so even with a simple WHERE query can be checked.
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table but it can take up to 90 minutes to become available for copy/export and other operations. You probably have to wait up to 90 minutes so all buffer is persisted on the cluster. You can use queries to see if the streaming buffer is empty or not like you mentioned.
If you use load job to create the table, you won't have streaming buffer, but probably you streamed some values to it.
Note the answer below to work with tables that have ongoing streaming buffers. Just use a WHERE to filter out the latest minutes of data and your queries will work. -- Fh
Make sure to change your filters so they don't include data that could be in the current streaming buffer.
For example, this query fails while I'm streaming to this table:
DELETE FROM `project.dataset.table`
WHERE id LIKE '%-%'
Error: UPDATE or DELETE statement over table project.dataset.table would affect rows in the streaming buffer, which is not supported
You can fix it by only deleting older records:
DELETE FROM `project.dataset.table`
WHERE id LIKE '%-%'
AND ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 40 MINUTE)
4282 rows affected.

Either insert or update database records via Apache NiFi flow

I am trying to transfer data between two databases with similar structure of tables using NiFi. Example of data structure:
User: {varchar name, integer id}.
There are no "Maximum-value Columns" so it is impossible to determine if there is new data or not. So each time I create "snapshot" of the full table content. The problem is that it is unclear either particular record should be inserted or updated in the target database.
I created two branches of processors: with inserts and with updates. Only insert works for new records and only update for existing. But (!) PutSQL processor works with bunch of flow files.
For example batch size is 100 and processors work once a day. Assume there was 98 records yesterday. They will be inserted. Today there are 200 records (98 from yesterday and 102 new). In this flow if NiFi tries to update first 100 records and insert them then both actions will fail: first 98 records should be updated while last 2 should be inserted.
How to solve this issue? I know it is possible to use batch size 1 but it work too slow.
I recommend solving this in your SQL statements, since NiFi will not know the prior status of the records. A MERGE statement would be ideal, if your database supports it (Oracle, SQL Server, MySQL insert). Otherwise, you can craft both an INSERT and an UPDATE for each record in the source table, making them conditional on the user existing in the table.