Tables that have been written to recently via BigQuery Streaming
(tabledata.insertall) cannot be modified using UPDATE or DELETE
statements. To check if the table has a streaming buffer, check the
tables.get response for a section named streamingBuffer. If it is
absent, the table can be modified using UPDATE or DELETE statements.
When I try to modify my table (rows were recently inserted data, table created few days ago)
delete table_dataset.table1 where true
I have following error - Error: UPDATE or DELETE DML statements are not supported over table with streaming buffer However once I deleted all these records somehow maybe after some delay.
What is the streaming buffer ? When exactly I can modify my table ? If I use JOB which create table or export data from another source can I run UPDATE/DELETE DDL?
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table but it can take up to 90 minutes to become available for copy/export and other operations. You probably have to wait up to 90 minutes so all buffer is persisted on the cluster. You can use queries to see if the streaming buffer is empty or not like you mentioned.
If you use load job to create the table, you won't have streaming buffer.
Related
I am using MonetDB (MDB) for OLAP queries. I am storing source data in PostgreSQL (PGSQL) and syncing it with MonetDB in batches written in Python.
In PGSQL there is a wide table with ID (non-unique) and few columns. Every few seconds Python script takes a batch of 10k records changed in the PGSQL and uploads them to MDB.
The process of upload to MDB is as follows:
Create staging table in MDB
Use COPY command to upload 10k records into the staging table.
DELETE from destination table all IDs that are in staging table.
INSERT to the destination table all rows from staging table.
So, it is basically a DELETE & INSERT. I cannot use MERGE statement, because I do not have a PK - one ID can have multiple values in the destination. So I need to do a delete and full insert for all IDs currently synced.
Now to the problem: the DELETE is slow.
When I do a DELETE on a destination table, deleting 10k records in table of 25M rows, it will take 500ms.
However! If I run simple SELECT * FROM destination WHERE id = 1 and THEN do a DELETE, it takes 2ms.
I think that it has something to do with automatic creation of auxiliary indices. But this is where my knowledge ends.
I tried to solve this problem of "pre-heating" by doing the lookup myself and it works - but only for the first DELETE after pre-heat.
Once I do DELETE and INSERT, the next DELETE gets again slow. And doing the pre-heating before each DELETE does not make sense, because the pre-heat itself takes 500ms.
Is there any way on how to sync data to MDB without breaking auxiliary indices already built? Or make the DELETE faster without pre-heat? Or should I use some different technique to sync data into MDB without PK (does MERGE has the same problem?).
Thanks!
I am writing an archival script (in Python using psycopg2) that needs to pull a very large amount of data out of a PostgreSQL database (9.4), process, upload and then delete it from the database.
I start a transaction, execute a select statement to create a named cursor, fetch N rows at a time from the cursor and do processing and uploading of parts (using S3 multipart upload). Once the cursor is depleted and no errors occurred, I finalize the upload and execute a delete statement using the same conditions as I did in select. If delete succeeds, I commit the transaction.
The database is being actively written to and it is important that both the same rows get archived and deleted and that reads and writes to the database (including the table being archived) continue uninterrupted. That said, the tables being archived contain logs, so existing records are never modified, only new records are added.
So the questions I have are:
What level of isolation should I use to ensure same rows get archived and deleted?
What impact will these operations have on database read/write ability? Does anything get write or read locked in the process I described above?
You have two good options:
Get the data with
SELECT ... FOR UPDATE
so that the rows get locked. Then the are guaranteed to be there when you delete them.
Use
DELETE FROM ... RETURNING *
Then insert the returned rows into your archive.
The second solution is better, because you need only one statement.
Nothing bad can happen. If the transaction fails for whatever reason, no row will be deleted.
You can use the default READ COMMITTED isolation level for both solutions.
Thanks is advance for any help. Here is the scenario that I am trying to recreate in Mulesoft.
1,500,000 records in a table. Here is the current process that we use.
Start a transaction.
delete all records from the table.
reload the table from a flat file.
commit the transaction.
in the end we need the file in a good state, thus the use of the transaction. If there is any failure, the data in the table will be rolled back to the initial valid state.
I was able to get the speed that we needed by using the Batch element < 10 minutes, but it appears that transactions are not supported around the whole batch flow.
Any ideas how I could get this to work in Mulesoft?
Thanks again.
A little different workflow but how about:
Load temp table from flat file
If successful drop original table
Rename temp table to original table name
You can keep your Mule batch processing workflow to load the temp table and forget about rolling back.
For this you might try the following:
Use XA transactions (since more than one connector will be used,
regardless of the use of the same transport or not)
Enlist in the transaction the resource used in the custom Java code.
This also can be applied within the same transport (e.g. JDBC on the Mule configuration and also on the Java component), so it's not restricted to the case demonstrated in the PoC, which is only given as a reference.
Please refer to this article https://dzone.com/articles/passing-java-arrays-in-oracle-stored-procedure-fro
From temp table poll records.You can contruct array with any number of records. With 100K size will only involve 15 round trips in total.
To determine error records you can insert records in an error table but that has to be implemented in database procedure.
I'm trying to run multiple simultaneous jobs in order to load around 700K record to a single BigQuery table. My code (Java) creates the schema from the records of is job, and updates the BigQuery schema, if needed.
Workflow is as follows:
A single job creates the table and sets the (initial) schema.
For each load job we create the schema from the records of the job. Then we pull the existing table schema from BigQuery, and if it's not a superset of the schema associated with the job, we update the schema with the new merged schema. The last part (starting from pulling the existing schema) is synced (using a lock) - only one job performs it at a time. The update of the schema is using the UPDATE method, and the lock is released only after the client update method returns.
I was expecting to avoid encountering schema update errors using this workflow. I'm assuming that once the client returns from the update job, then the table is updated, and that jobs that are in process can't be hurt from the schema update.
Nevertheless, I still get schema update errors from time to time. Is the update method atomic? How do I know when a schema was actually updated?
Updates in BigQuery are atomic, but they are applied at the end of the job. When a job completes, it makes sure that the schemas are equivalent. If there was a schema update while the job was running, this check will fail.
We should probably make sure that the schemas are compatible instead of equivalent. If you do an append with a compatible schema (i.e. you have a subset of the table schema) that should succeed, but currently BigQuery doesn't allow this. I'll file a bug.
I have a table with 4 million images. This table is participating in Merge Replication. I have to update these 4 million images to set Image binary to null as these has been moved to a new table. The moment I will start update query, Merge replication triggers will fire and they will consider that data for merge replication to subscribers and 4 million image rows will be transferred over the wire. I cannot disable merge triggers as this poses a data inconsistency issue.
I want a way so that merge triggers do not fire for this operation. Is there something like Bulk Insert for update as well?
You can use the sp_mergearticlecolumn stored procedure to drop that specific column from your subscription (temporarily if need be).
More information here: http://msdn.microsoft.com/en-us/library/ms188063.aspx