If configuration.load.writeDisposition is set to WRITE_TRUNCATE during a load job, is there a period of time when querying the table would raise an error?
The whole period when the job is marked as PENDING and/or RUNNING?
A small moment when the table is replaced at the end of the load job?
What would be the error? status.errors[].reason => "notFound"?
The WRITE_TRUNCATE is atomic and gets applied at the end of the load job. So any queries that happen during that time will see either only the old data or all of the new data. There should be no cases where you'd get an error querying the table.
If the load failed, then there should be no change to the table, and if it succeeded, all of the data should appear at once in the table.
If the table didn't already exist, and a load job specified CREATE_IF_NEEDED, then querying the table would give not found until the load job completed.
We're working on a doc rewrite that will make this more clear.
Related
I have a job that inserts data in BigQuery using WRITE_TRUNCATE disposition. This will truncate all data already in the table and replace it with the results of the job.
However, in some cases the table I want to insert data into has Row Level Security (RLS) rules. In this case, TRUNCATE removes the RLS as well (which I don't want).
As said here, with WRITE_DISPOSITIONS :
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
I am looking for a way to remove rows from my table without removing RLS, occuring as one atomic update with the append action upon job completion.
I have seen that DELETE FROM [TABLE] WHERE TRUE does not removes RLS. But I can't find a way to use it instead of TRUNCATE through BigQuery Framework. Is there a way to do it ?
Thanks is advance for any help. Here is the scenario that I am trying to recreate in Mulesoft.
1,500,000 records in a table. Here is the current process that we use.
Start a transaction.
delete all records from the table.
reload the table from a flat file.
commit the transaction.
in the end we need the file in a good state, thus the use of the transaction. If there is any failure, the data in the table will be rolled back to the initial valid state.
I was able to get the speed that we needed by using the Batch element < 10 minutes, but it appears that transactions are not supported around the whole batch flow.
Any ideas how I could get this to work in Mulesoft?
Thanks again.
A little different workflow but how about:
Load temp table from flat file
If successful drop original table
Rename temp table to original table name
You can keep your Mule batch processing workflow to load the temp table and forget about rolling back.
For this you might try the following:
Use XA transactions (since more than one connector will be used,
regardless of the use of the same transport or not)
Enlist in the transaction the resource used in the custom Java code.
This also can be applied within the same transport (e.g. JDBC on the Mule configuration and also on the Java component), so it's not restricted to the case demonstrated in the PoC, which is only given as a reference.
Please refer to this article https://dzone.com/articles/passing-java-arrays-in-oracle-stored-procedure-fro
From temp table poll records.You can contruct array with any number of records. With 100K size will only involve 15 round trips in total.
To determine error records you can insert records in an error table but that has to be implemented in database procedure.
We have recently had a test failing, and it brings up a question consistency model for BigQuery: After we create a table, should other operations immediately see that table?
Background:
Our test creates a table in BigQuery with some data, waits for the job to complete, and then checks whether the table exists.
gbq.write_gbq(df, dataset_id, table_name, project_id=project_id, block=True)
assert table_name in gbq.list_tables(dataset_id, project_id=project_id) # fails
FYI block=True runs wait_for_job, so waits for the job to complete.
Yes, the table should be ready for usage just after creation.
But I suspect that the issue is not with BigQuery.
Notice that in the docs, the tables.list() operation have this nextPageToken parameter. You probably will have to use it in order to retrieve all tables in your dataset.
This code has one example on how to use it. Basically while there's a pageToken being defined then not all tables have been listed yet.
I have staging table in job in SQL server which dumped data in another table and once data is dumped to transaction table I have truncated staging table.
Now, problem occurs if job is failed then data is transaction table is
roll back and all data is placed in staging table. So staging table already consist data and if I rerun the job then it merges all new data with existing data in staging table.
I want my staging table to empty when the job will run.
Can I make use of temp table in this scenario?
This is a common scenario in data warehousing project and the answer is logical instead of technical. You've two approaches to deal with these scenario:
If your data is important then first check if staging is empty or not. If the table is not empty then it means last job failed; in this case instead of inserting into staging do a insert-update operation and then continue with the job steps. If the table is empty then it means that last job was successful then new data will be insert only.
If you can afford data loss from last job then make a habit to truncate table before running your package.
I'm trying to run multiple simultaneous jobs in order to load around 700K record to a single BigQuery table. My code (Java) creates the schema from the records of is job, and updates the BigQuery schema, if needed.
Workflow is as follows:
A single job creates the table and sets the (initial) schema.
For each load job we create the schema from the records of the job. Then we pull the existing table schema from BigQuery, and if it's not a superset of the schema associated with the job, we update the schema with the new merged schema. The last part (starting from pulling the existing schema) is synced (using a lock) - only one job performs it at a time. The update of the schema is using the UPDATE method, and the lock is released only after the client update method returns.
I was expecting to avoid encountering schema update errors using this workflow. I'm assuming that once the client returns from the update job, then the table is updated, and that jobs that are in process can't be hurt from the schema update.
Nevertheless, I still get schema update errors from time to time. Is the update method atomic? How do I know when a schema was actually updated?
Updates in BigQuery are atomic, but they are applied at the end of the job. When a job completes, it makes sure that the schemas are equivalent. If there was a schema update while the job was running, this check will fail.
We should probably make sure that the schemas are compatible instead of equivalent. If you do an append with a compatible schema (i.e. you have a subset of the table schema) that should succeed, but currently BigQuery doesn't allow this. I'll file a bug.