We have recently had a test failing, and it brings up a question consistency model for BigQuery: After we create a table, should other operations immediately see that table?
Background:
Our test creates a table in BigQuery with some data, waits for the job to complete, and then checks whether the table exists.
gbq.write_gbq(df, dataset_id, table_name, project_id=project_id, block=True)
assert table_name in gbq.list_tables(dataset_id, project_id=project_id) # fails
FYI block=True runs wait_for_job, so waits for the job to complete.
Yes, the table should be ready for usage just after creation.
But I suspect that the issue is not with BigQuery.
Notice that in the docs, the tables.list() operation have this nextPageToken parameter. You probably will have to use it in order to retrieve all tables in your dataset.
This code has one example on how to use it. Basically while there's a pageToken being defined then not all tables have been listed yet.
Related
We are working on a data warehouse using IBM DB2 and we wanted to load data by partition exchange. That means we prepare a temporary table with the data we want to load into the target table and then use that entire table as a data partition in the target table. If there was previous data we just discard the old partition.
Basically you just do "ALTER TABLE target_table ATTACH PARTITION pname [starting and ending clauses] FROM temp_table".
It works wonderfully, but only for one operation at a time. If we do multiple loads in parallel or try to attach multiple partitions to the same table it's raining deadlock errors from the database.
From what I understand, the problem isn't necessarily with parallel access to the target table itself (locking it changes nothing), but accesses to system catalog tables in the background.
I have combed through the DB2 documentation but the only reference to the topic of concurrent DDL statements I found at all was to avoid doing them. The answer to this question, can't be to simply not attempt it?
Does anyone know a way to deal with this problem?
I tried to have a global, single synchronization table to lock if you want to attach any partitions, but it didn't help either. Either I'm missing something (implicit commits somewhere?) or some of the data catalog updates even happen asynchronously, which makes the whole problem much worse. If that is the case, is there are any chance at all to query if the attach is safe to perform at any given moment?
Hi I am running ETL via Python .
I have simple sql file that I run from Python like
truncate table foo_stg;
insert into foo_stg
(
select blah,blah .... from tables
);
truncate table foo;
insert into foo
(
select * from foo_stg
);
This query sometimes takes lock on table which it does not release .
Due to which other processes get queued .
Now I check which table has the lock and kill the process that had caused the lock .
I want to know what changes I can make in my code to mitigate such issues ?
Thanks in Advance!!!
The TRUNCATE is probably breaking your transaction logic. Recommend doing all truncates upfront. I'd also recommend adding some processing logic to ensure that each instance of the ETL process either: A) has exclusive access to the staging tables or B) uses a separate set of staging tables.
TRUNCATE in Redshift (and many other DBs) does an implicit COMMIT.
…be aware that TRUNCATE commits the transaction in which it is run.
Redshift tries to makes this clear by returning the following INFO message to confirm success: TRUNCATE TABLE and COMMIT TRANSACTION. However, this INFO message may not be displayed by the SQL client tool. Run the SQL in psql to see it.
in my case, I created a table the first time and tried to load it from the stage table using insert into a table from select c1,c2,c3 from stage;I am running this using python script.
The table is locking and not loading the data. Another interesting scenario is when I run the same insert SQL from the editor, it is loading, and after that my python script loads the same table without any locks. But the first time only the table lock is happening. Not sure what is the issue.
Thanks is advance for any help. Here is the scenario that I am trying to recreate in Mulesoft.
1,500,000 records in a table. Here is the current process that we use.
Start a transaction.
delete all records from the table.
reload the table from a flat file.
commit the transaction.
in the end we need the file in a good state, thus the use of the transaction. If there is any failure, the data in the table will be rolled back to the initial valid state.
I was able to get the speed that we needed by using the Batch element < 10 minutes, but it appears that transactions are not supported around the whole batch flow.
Any ideas how I could get this to work in Mulesoft?
Thanks again.
A little different workflow but how about:
Load temp table from flat file
If successful drop original table
Rename temp table to original table name
You can keep your Mule batch processing workflow to load the temp table and forget about rolling back.
For this you might try the following:
Use XA transactions (since more than one connector will be used,
regardless of the use of the same transport or not)
Enlist in the transaction the resource used in the custom Java code.
This also can be applied within the same transport (e.g. JDBC on the Mule configuration and also on the Java component), so it's not restricted to the case demonstrated in the PoC, which is only given as a reference.
Please refer to this article https://dzone.com/articles/passing-java-arrays-in-oracle-stored-procedure-fro
From temp table poll records.You can contruct array with any number of records. With 100K size will only involve 15 round trips in total.
To determine error records you can insert records in an error table but that has to be implemented in database procedure.
I'm trying to run multiple simultaneous jobs in order to load around 700K record to a single BigQuery table. My code (Java) creates the schema from the records of is job, and updates the BigQuery schema, if needed.
Workflow is as follows:
A single job creates the table and sets the (initial) schema.
For each load job we create the schema from the records of the job. Then we pull the existing table schema from BigQuery, and if it's not a superset of the schema associated with the job, we update the schema with the new merged schema. The last part (starting from pulling the existing schema) is synced (using a lock) - only one job performs it at a time. The update of the schema is using the UPDATE method, and the lock is released only after the client update method returns.
I was expecting to avoid encountering schema update errors using this workflow. I'm assuming that once the client returns from the update job, then the table is updated, and that jobs that are in process can't be hurt from the schema update.
Nevertheless, I still get schema update errors from time to time. Is the update method atomic? How do I know when a schema was actually updated?
Updates in BigQuery are atomic, but they are applied at the end of the job. When a job completes, it makes sure that the schemas are equivalent. If there was a schema update while the job was running, this check will fail.
We should probably make sure that the schemas are compatible instead of equivalent. If you do an append with a compatible schema (i.e. you have a subset of the table schema) that should succeed, but currently BigQuery doesn't allow this. I'll file a bug.
If configuration.load.writeDisposition is set to WRITE_TRUNCATE during a load job, is there a period of time when querying the table would raise an error?
The whole period when the job is marked as PENDING and/or RUNNING?
A small moment when the table is replaced at the end of the load job?
What would be the error? status.errors[].reason => "notFound"?
The WRITE_TRUNCATE is atomic and gets applied at the end of the load job. So any queries that happen during that time will see either only the old data or all of the new data. There should be no cases where you'd get an error querying the table.
If the load failed, then there should be no change to the table, and if it succeeded, all of the data should appear at once in the table.
If the table didn't already exist, and a load job specified CREATE_IF_NEEDED, then querying the table would give not found until the load job completed.
We're working on a doc rewrite that will make this more clear.