I'm trying to run multiple simultaneous jobs in order to load around 700K record to a single BigQuery table. My code (Java) creates the schema from the records of is job, and updates the BigQuery schema, if needed.
Workflow is as follows:
A single job creates the table and sets the (initial) schema.
For each load job we create the schema from the records of the job. Then we pull the existing table schema from BigQuery, and if it's not a superset of the schema associated with the job, we update the schema with the new merged schema. The last part (starting from pulling the existing schema) is synced (using a lock) - only one job performs it at a time. The update of the schema is using the UPDATE method, and the lock is released only after the client update method returns.
I was expecting to avoid encountering schema update errors using this workflow. I'm assuming that once the client returns from the update job, then the table is updated, and that jobs that are in process can't be hurt from the schema update.
Nevertheless, I still get schema update errors from time to time. Is the update method atomic? How do I know when a schema was actually updated?
Updates in BigQuery are atomic, but they are applied at the end of the job. When a job completes, it makes sure that the schemas are equivalent. If there was a schema update while the job was running, this check will fail.
We should probably make sure that the schemas are compatible instead of equivalent. If you do an append with a compatible schema (i.e. you have a subset of the table schema) that should succeed, but currently BigQuery doesn't allow this. I'll file a bug.
Related
We are working on a data warehouse using IBM DB2 and we wanted to load data by partition exchange. That means we prepare a temporary table with the data we want to load into the target table and then use that entire table as a data partition in the target table. If there was previous data we just discard the old partition.
Basically you just do "ALTER TABLE target_table ATTACH PARTITION pname [starting and ending clauses] FROM temp_table".
It works wonderfully, but only for one operation at a time. If we do multiple loads in parallel or try to attach multiple partitions to the same table it's raining deadlock errors from the database.
From what I understand, the problem isn't necessarily with parallel access to the target table itself (locking it changes nothing), but accesses to system catalog tables in the background.
I have combed through the DB2 documentation but the only reference to the topic of concurrent DDL statements I found at all was to avoid doing them. The answer to this question, can't be to simply not attempt it?
Does anyone know a way to deal with this problem?
I tried to have a global, single synchronization table to lock if you want to attach any partitions, but it didn't help either. Either I'm missing something (implicit commits somewhere?) or some of the data catalog updates even happen asynchronously, which makes the whole problem much worse. If that is the case, is there are any chance at all to query if the attach is safe to perform at any given moment?
I have a job that inserts data in BigQuery using WRITE_TRUNCATE disposition. This will truncate all data already in the table and replace it with the results of the job.
However, in some cases the table I want to insert data into has Row Level Security (RLS) rules. In this case, TRUNCATE removes the RLS as well (which I don't want).
As said here, with WRITE_DISPOSITIONS :
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
I am looking for a way to remove rows from my table without removing RLS, occuring as one atomic update with the append action upon job completion.
I have seen that DELETE FROM [TABLE] WHERE TRUE does not removes RLS. But I can't find a way to use it instead of TRUNCATE through BigQuery Framework. Is there a way to do it ?
According to http://blogs.msdn.com/b/sqlcat/archive/2011/10/17/updating-a-database-snapshot.aspx I should be able to successfully execute an INSERT, UPDATE and DELETE against a Database Snapshot.
The idea is to create a view of a table before you create the snapshot, and then create the snapshot, and update the View in the snapshot.
I have tried this on my SQL Server 2014 (v12.0.2269) and I still get the error
Failed to update database "Snapshot2015_07" because the database is read-only.
The reason I am keen for this to work is that financials need to be frozen at a particular date, but need to be updated if errors are found in the snapshot.
Has anyone had success recently doing this?
I know there are alternatives like AutoAudit, but it is a lot of work to implement for 1-2 updates/deletes on a database with multiple tables with 5 million + rows
The view has to specify the database name (which is the original database name, not the snapshot database name), along with the schema and table name. Ensure the view you created specifies those three parts of the fully qualified object name.
I'm trying to figure out if there's a method for copying the contents of a main schema into a table of another schema, and then, somehow updating that copy or "refreshing" the copy as the main schema gets updated.
For example:
schema "BBLEARN", has table users
SELECT * INTO SIS_temp_data.dbo.bb_users FROM BBLEARN.dbo.users
This selects and inserts 23k rows into the table bb_course_users in my placeholder schema SIS_temp_data.
Thing is, the users table in the BBLEARN schema gets updated on a constant basis, whether or not new users get added, or there are updates to accounts or disables or enables, etc. The main reason for copying the table into a temp table is for data integration purposes and is unrelated to the question at hand.
So, is there a method in SQL Server that will allow me to "update" my new table in the spare schema based on when the data in the main schema gets updated? Or do I just need to run a scheduled task that does a SELECT * INTO every few hours?
Thank you.
You could create a trigger which updates the spare table whenever an updated or insert is performed on the main schema
see http://msdn.microsoft.com/en-us/library/ms190227.aspx
If configuration.load.writeDisposition is set to WRITE_TRUNCATE during a load job, is there a period of time when querying the table would raise an error?
The whole period when the job is marked as PENDING and/or RUNNING?
A small moment when the table is replaced at the end of the load job?
What would be the error? status.errors[].reason => "notFound"?
The WRITE_TRUNCATE is atomic and gets applied at the end of the load job. So any queries that happen during that time will see either only the old data or all of the new data. There should be no cases where you'd get an error querying the table.
If the load failed, then there should be no change to the table, and if it succeeded, all of the data should appear at once in the table.
If the table didn't already exist, and a load job specified CREATE_IF_NEEDED, then querying the table would give not found until the load job completed.
We're working on a doc rewrite that will make this more clear.