I'm using query API from com.google.cloud.bigquery.BigQuery to perform update operations Ref.
Is there a way to perform bulk update to BigQuery Table similar to insertAll or writeJsonStream.
You may try and consider as a workaround wherein you will use a for loop to loop into your query() until all your needed update have been successfully executed.
Related
I am looking to replace a table in a single transaction. I am using the bigqueryoperator with write_disposition='WRITE_TRUNCATE' and my sql is just select * from my_table. I am wondering if this will happen in a single transaction or two seperate transactions? If it is 2 transactions, is there anyway I can replace my bigquery table with select * from my_table in a single transaction?
Airflow submit the query using Job API. BigQuery documentation mention for both createDisposition & writeDisposition that
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
A side note BigQueryOperator is deprecated. You should use BigQueryInsertJobOperator
I've created sql that does an update of all values in one column
UPDATE `Blackout_POC2.measurements_2020`
SET visitor.customerId_enc = enc.encrypted
FROM `Blackout_POC2.encrypted` AS enc
WHERE dateAmsterdam="2020-01-05"
AND session.visitId = enc.visitId
AND visitor.customerId = enc.plain
where dateAmsterdam is the partition key of the measurements_2020 table, and encrypted is a non-partitioned table that holds visitId, plain and encrypted fields. The code sets all values in the customerId_enc column with values from the encrypted table.
The code works perfectly fine when I run it one day at a time, but when I run days in parallel, I occassionally (1% or so) get a stacktrace from my bq query <sql> (see below).
I thought that I could modify partitioned tables in parallel within each partition, but that seems to occassionally not be the case. Could someone point me to where this would be documented, and preferably how to avoid this?
I can probably just rerun that query again, since it is idempotent, but I would like to know why this happens.
Thanks
Bart van Deenen, data-engineer Bol.com
Error in query string: Error processing job 'bolcom-dev-blackout-339:bqjob_r131fa5b3dfd24829_0000016faec5e5da_1': domain: "cloud.helix.ErrorDomain"
code: "QUERY_ERROR" argument: "Could not serialize access to table bolcom-dev-blackout-339:Blackout_POC2.measurements_2020 due to concurrent update"
debug_info: "[CONCURRENT_UPDATE] Table modified by concurrent UPDATE/DELETE/MERGE DML or truncation at 1579185217979. Storage set job_uuid:
03d3d5ec-2118-4e43-9fec-1eae99402c86:20200106, instance_id: ClonedTable-1579183484786, Reason: code=CONCURRENT_UPDATE message=Could not serialize
access to table bolcom-dev-blackout-339:Blackout_POC2.measurements_2020 due to concurrent update debug=Table modified by concurrent UPDATE/DELETE/MERGE
DML or truncation at 1579185217979. Storage set job_uuid: 03d3d5ec-2118-4e43-9fec-1eae99402c86:20200106, instance_id: ClonedTable-1579183484786
errorProto=domain: \"cloud.helix.ErrorDomain\"\ncode: \"QUERY_ERROR\"\nargument: \"Could not serialize access to table bolcom-dev-
blackout-339:Blackout_POC2.measurements_2020 due to concurrent update\"\ndebug_info: \"Table modified by concurrent UPDATE/DELETE/MERGE DML or
truncation at 1579185217979. Storage set job_uuid: 03d3d5ec-2118-4e43-9fec-1eae99402c86:20200106, instance_id: ClonedTable-1579183484786\"\n\n\tat
com.google.cloud.helix.common.Exceptions$Public.concurrentUpdate(Exceptions.java:381)\n\tat
com.google.cloud.helix.common.Exceptions$Public.concurrentUpdate(Exceptions.java:373)\n\tat
com.google.cloud.helix.server.metadata.StorageTrackerData.verifyStorageSetUpdate(StorageTrackerData.java:224)\n\tat
com.google.cloud.helix.server.metadata.AtomicStorageTrackerSpanner.validateUpdates(AtomicStorageTrackerSpanner.java:1133)\n\tat
com.google.cloud.helix.server.metadata.AtomicStorageTrackerSpanner.updateStorageSets(AtomicStorageTrackerSpanner.java:1310)\n\tat
com.google.cloud.helix.server.metadata.AtomicStorageTrackerSpanner.updateStorageSets(AtomicStorageTrackerSpanner.java:1293)\n\tat
com.google.cloud.helix.server.metadata.MetaTableTracker.updateStorageSets(MetaTableTracker.java:2274)\n\tat
com.google.cloud.helix.server.job.StorageSideEffects$1.update(StorageSideEffects.java:1123)\n\tat
com.google.cloud.helix.server.job.StorageSideEffects$1.update(StorageSideEffects.java:976)\n\tat
com.google.cloud.helix.server.metadata.MetaTableTracker$1.update(MetaTableTracker.java:2510)\n\tat
com.google.cloud.helix.server.metadata.StorageTrackerSpanner.lambda$atomicUpdate$7(StorageTrackerSpanner.java:165)\n\tat
com.google.cloud.helix.server.metadata.AtomicStorageTrackerSpanner$Factory$1.run(AtomicStorageTrackerSpanner.java:3775)\n\tat com.google.cloud.helix.se
rver.metadata.AtomicStorageTrackerSpanner$Factory.lambda$performJobWithCommitResult$0(AtomicStorageTrackerSpanner.java:3792)\n\tat
com.google.cloud.helix.server.metadata.persistence.SpannerTransactionContext$RetryCountingWork.run(SpannerTransactionContext.java:1002)\n\tat com.googl
e.cloud.helix.server.metadata.persistence.SpannerTransactionContext$Factory.executeWithResultInternal(SpannerTransactionContext.java:840)\n\tat com.goo
gle.cloud.helix.server.metadata.persistence.SpannerTransactionContext$Factory.executeOptimisticWithResultInternal(SpannerTransactionContext.java:722)\n
\tat com.google.cloud.helix.server.metadata.persistence.SpannerTransactionContext$Factory.lambda$executeOptimisticWithResult$1(SpannerTransactionContex
t.java:716)\n\tat
com.google.cloud.helix.server.metadata.persistence.SpannerTransactionContext$Factory.executeWithMonitoring(SpannerTransactionContext.java:942)\n\tat co
m.google.cloud.helix.server.metadata.persistence.SpannerTransactionContext$Factory.executeOptimisticWithResult(SpannerTransactionContext.java:715)\n\ta
t com.google.cloud.helix.server.metadata.AtomicStorageTrackerSpanner$Factory.performJobWithCommitResult(AtomicStorageTrackerSpanner.java:3792)\n\tat
com.google.cloud.helix.server.metadata.AtomicStorageTrackerSpanner$Factory.performJobWithCommitResult(AtomicStorageTrackerSpanner.java:3720)\n\tat
com.google.cloud.helix.server.metadata.StorageTrackerSpanner.atomicUpdate(StorageTrackerSpanner.java:159)\n\tat
com.google.cloud.helix.server.metadata.MetaTableTracker.atomicUpdate(MetaTableTracker.java:2521)\n\tat com.google.cloud.helix.server.metadata.StatsRequ
estLoggingTrackers$LoggingStorageTracker.lambda$atomicUpdate$8(StatsRequestLoggingTrackers.java:494)\n\tat
com.google.cloud.helix.server.metadata.StatsRequestLoggingTrackers$StatsRecorder.record(StatsRequestLoggingTrackers.java:181)\n\tat
com.google.cloud.helix.server.metadata.StatsRequestLoggingTrackers$StatsRecorder.record(StatsRequestLoggingTrackers.java:158)\n\tat
com.google.cloud.helix.server.metadata.StatsRequestLoggingTrackers$StatsRecorder.access$500(StatsRequestLoggingTrackers.java:123)\n\tat
com.google.cloud.helix.server.metadata.StatsRequestLoggingTrackers$LoggingStorageTracker.atomicUpdate(StatsRequestLoggingTrackers.java:493)\n\tat
com.google.cloud.helix.server.job.StorageSideEffects.apply(StorageSideEffects.java:1238)\n\tat
com.google.cloud.helix.server.rosy.MergeStorageImpl.commitChanges(MergeStorageImpl.java:936)\n\tat
com.google.cloud.helix.server.rosy.MergeStorageImpl.merge(MergeStorageImpl.java:729)\n\tat
com.google.cloud.helix.server.rosy.StorageStubby.mergeStorage(StorageStubby.java:937)\n\tat
com.google.cloud.helix.proto2.Storage$ServiceParameters$21.handleBlockingRequest(Storage.java:2100)\n\tat
com.google.cloud.helix.proto2.Storage$ServiceParameters$21.handleBlockingRequest(Storage.java:2098)\n\tat
com.google.net.rpc3.impl.server.RpcBlockingApplicationHandler.handleRequest(RpcBlockingApplicationHandler.java:28)\n\tat
....
BigQuery DML operations doesn't have support for multi-statement transactions; nevertheless, you can execute some concurrent statements:
UPDATE and INSERT
DELETE and INSERT
INSERT and INSERT
For example, you execute two UPDATES statements simultaneously against the table then only one of them will succeed.
Keeping this in mind, due you can execute concurrently UPDATE and INSERT statements, another possible cause is if you are executing multiple UPDATE statements simultaneously.
You could try using the Scripting feature to manage the execution flow to prevent DML concurrency.
I've created a BQ table and need to schedule a series of DML statements on it (inserts & merge). I am trying to replicate the Oracle PL/SQL functionality where you can group DML statements into a single procedure that can be scheduled.
So, the goal is (i) group a series of DML statements into a script, and (ii) schedule the script for execution. Thank you in advance for any help.
Scripting is supported in scheduled query now. However, scripting query, when being scheduled, doesn't support setting a destination table for now. You still need to use DDL/DML to make change to existing table.
CREATE OR REPLACE destinationTable AS
SELECT *
FROM sourceTable
WHERE date >= maxDate
I need to know about how the trigger execution happens for the below scenario.
I have 20 records in a table and I have an AFTER INSERT, UPDATE trigger on that table. When I'm updating all the records in that table using a MERGE or batch update statement, how will the trigger execute?
Does it execute for each row by row?
Or is it executing once per a batch (once for all 20 records)?
If it is execute once per batch do we need to write a loop inside the trigger to perform a task for each row?
Triggers in SQL Server always execute once per batch - there's no option for "for each row" triggers in SQL Server.
When you mass-update your table, the trigger will receive all the updated rows at once in the inserted and deleted pseudo tables and needs to deal with them accordingly - as a set of data - not a single row
I need to load data from different remote database into our own database. I write a single "complex" query using WITH statement. It is around 18 Million rows of data.
What is most efficient way to do the insert?
using cursor insert one by one
using INSERT INTO
or is there any other way?
The fastest way to do anything should be to use a single SQL statement. The next most efficient approach is to use a cursor doing BULK COLLECT operations to minimize context shifts between the SQL and PL/SQL engines. The least efficient approach is to use a cursor and process the data row-by-row.
As Justin wrote, the most efficient approach is to use a single SQL statement ( insert into ... select ... ). Additionally you can take advantage of direct-path insert
18 million rows will require quite a bit of rollback for your single insert stmt scenario. A cursor for loop would be much slower but you'd be able to commit every x rows.
Personally, I'd go old school and dump out to a file and load via sqlldr or data pump, esp as this is across databases.
You could use Data Synchronisation Studio and change the select statement to take 1 million at a time (I think 18m at once would probably overload your machine)