Streaming Inserts during BigQuery Overwrite Job

Streaming Inserts during BigQuery Overwrite Job - google-bigquery

I have a BigQuery table and I want to use a job with writeDisposition WRITE_TRUNCATE to overwrite the table with a subset of its rows. I am doing this because I'm trying to mimic a DELETE FROM … WHERE … operation.
Suppose while the job is running, I am simultaneously trying to stream rows into the table. Is it possible for rows to be inserted while the job is running and so be overwritten when the job completes? Or is there a locking mechanism that will prevent the rows from being inserted until the job finishes?

In this case you need to stop the streaming jobs until you do your operation. And resume once you are done with it. There is no locking.
Also you should allow some cooling down period after you stop streaming inserts, as they are processed in background and you need to let the system to finish.

Because of the table metadata caching layer in streaming system, it currently needs about 10 minutes to realize that a table has been truncated. During this ~10min, all streamed data will be dropped (because they are considered as part of truncated data).
As Pentium10 suggested, it's recommended to pause the streaming requests if you are doing a WRITE_TRUNCATE, and resume it ~10min after truncation is done.

Related

Understanding locks and query status in Snowflake (multiple updates to a single table)

While using the python connector for snowflake with queries of the form
UPDATE X.TABLEY SET STATUS = %(status)s, STATUS_DETAILS = %(status_details)s WHERE ID = %(entry_id)s
, sometimes I get the following message:
(snowflake.connector.errors.ProgrammingError) 000625 (57014): Statement 'X' has locked table 'XX' in transaction 1588294931722 and this lock has not yet been released.
and soon after that
Your statement X' was aborted because the number of waiters for this lock exceeds the 20 statements limit
This usually happens when multiple queries are trying to update a single table. What I don't understand is that when I see the query history in Snowflake, it says the query finished successfully (Succeded Status) but in reality, the Update never happened, because the table did not alter.
So according to https://community.snowflake.com/s/article/how-to-resolve-blocked-queries I used
SELECT SYSTEM$ABORT_TRANSACTION(<transaction_id>);
to release the lock, but still, nothing happened and even with the succeed status the query seems to not have executed at all. So my question is, how does this really work and how can a lock be released without losing the execution of the query (also, what happens to the other 20+ queries that are queued because of the lock, sometimes it seems that when the lock is released the next one takes the lock and have to be aborted as well).
I would appreciate it if you could help me. Thanks!

Not sure if Sergio got an answer to this. The problem in this case is not with the table. Based on my experience with snowflake below is my understanding.
In snowflake, every table operations also involves a change in the meta table which keeps track of micro partitions, min and max. This meta table supports only 20 concurrent DML statements by default. So if a table is continuously getting updated and getting hit at the same partition, there is a chance that this limit will exceed. In this case, we should look at redesigning the table updation/insertion logic. In one of our use cases, we increased the limit to 50 after speaking to snowflake support team

UPDATE, DELETE, and MERGE cannot run concurrently on a single table; they will be serialized as only one can take a lock on a table at at a time. Others will queue up in the "blocked" state until it is their turn to take the lock. There is a limit on the number of queries that can be waiting on a single lock.
If you see an update finish successfully but don't see the updated data in the table, then you are most likely not COMMITting your transactions. Make sure you run COMMIT after an update so that the new data is committed to the table and the lock is released.
Alternatively, you can make sure AUTOCOMMIT is enabled so that DML will commit automatically after completion. You can enable it with ALTER SESSION SET AUTOCOMMIT=TRUE; in any sessions that are going to run an UPDATE.

Can we check if table in bigquery is in locked or DML operation is being performed

There is a BQ table which has multiple data load/update/delete jobs scheduled in. Since this is automated jobs many of it are failing due to concurrent update issue.
I need to know if we have a provision in BigQuery to check if the table is already locked by DML operation and can we serialize the queries so that no job fails

You could use the job ID generated by client code to keep track of the job status and only begin the next query when that job is done. This and this describe that process.
Alternatively you could try exponential backoff to retry the query a certain amount of times to prevent automatic failure of the query due to locked tables.

Why is an implicit table lock being released prior to end of transaction in RedShift?

I have an ETL process that is building dimension tables incrementally in RedShift. It performs actions in the following order:
Begins transaction
Creates a table staging_foo like foo
Copies data from external source into staging_foo
Performs mass insert/update/delete on foo so that it matches staging_foo
Drop staging_foo
Commit transaction
Individually this process works, but in order to achieve continuous streaming refreshes to foo and redundancy in the event of failure, I have several instances of the process running at the same time. And when that happens I occasionally get concurrent serialization errors. This is because both processes are replaying some of the same changes to foo from foo_staging in overlapping transactions.
What happens is that the first process creates the staging_foo table, and the second process is blocked when it attempts to create a table with the same name (this is what I want). When the first process commits its transaction (which can take several seconds) I find that the second process gets unblocked before the commit is complete. So it appears to be getting a snapshot of the foo table before the commit is in place, which causes the inserts/updates/deletes (some of which may be redundant) to fail.
I am theorizing based on the documentation http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html where it says:
Concurrent transactions are invisible to each other; they cannot detect each other's changes. Each concurrent transaction will create a snapshot of the database at the beginning of the transaction. A database snapshot is created within a transaction on the first occurrence of most SELECT statements, DML commands such as COPY, DELETE, INSERT, UPDATE, and TRUNCATE, and the following DDL commands :
ALTER TABLE (to add or drop columns)
CREATE TABLE
DROP TABLE
TRUNCATE TABLE
The documentation quoted above is somewhat confusing to me because it first says a snapshot will be created at the beginning of a transaction, but subsequently says a snapshot will be created only at the first occurrence of some specific DML/DDL operations.
I do not want to do a deep copy where I replace foo instead of incrementally updating it. I have other processes that continually query this table so there is never a time when I can replace it without interruption. Another question asks a similar question for deep copy but it will not work for me: How can I ensure synchronous DDL operations on a table that is being replaced?
Is there a way for me to perform my operations in a way that I can avoid concurrent serialization errors? I need to ensure that read access is available for foo so I can't LOCK that table.

OK, Postgres (and therefore Redshift [more or less]) uses MVCC (Multi Version Concurrency Control) for transaction isolation instead of a db/table/row/page locking model (as seen in SQL Server, MySQL, etc.). Simplistically every transaction operates on the data as it existed when the transaction started.
So your comment "I have several instances of the process running at the same time" explains the problem. If Process 2 starts while Process 1 is running then Process 2 has no visibility of the results from Process 1.

How to stop oracle undo process?

I want to know how to stop undo process of oracle ? I tried to delete millions of rows of a big table and in the middle of process I killed session but It started to undo delete and for a bout two hours database got dramatically slow. I didn't want the undo process to be continued. Is there any way to stop it ?

You can't stop the process of rolling back the transaction because doing so would leave the database in an inconsistent state.
When you are executing a long-running delete process, Oracle will likely be writing the changed blocks to your data files before you decide whether to commit or rollback the transaction. If you interrupted the process in the middle of executing the transaction, there will be some changed blocks on disk, some changed blocks in memory, and some unchanged blocks. Rolling back the transaction is the only way to return the database to the state it was in before you started executing the DELETE statement.

Row-by-row delete processes can, as you've found, be exceedingly slow. If the deletions are all done in a single transaction, as appears to be the case here, they can become even slower. You might want to consider the following options:
If you're deleting all the rows in the table you might want to consider using the TRUNCATE TABLE statement.
If you're not deleting all the rows in the table you should probably change your procedure to COMMIT after a certain number of rows are deleted.
In the meantime you're going to have to wait until that rollback process completes.
Share and enjoy.

and when you try truncating the table while it's still deleting you'll be seeing an ORA-00054 "resource busy and acquire with NOWAIT specified or timeout expired"

How to issue select query while performing batch insert

I have a question regarding how to handle sql queries to a table while performing batch inserts to the same table.
I have an ASP.NET web application that creates lots of objects (perhaps 50000) that are inserted in a batch fashion to a table using nHibernate. Even with the Nhibernate optimizations in place this takes up to two minutes. I perform this in a database transaction with isolation level set to read commited.
During the batch insert clients in the web application must be able to read previously created data in this table. However, they should not be able to read uncommited data. My problem is that if I use isolation level "read committed" on the select queries they time out because they are waiting for the batch insert job to finish.
Is there any way to query the database in such a way so that the query runs fast and returns all of the committed rows in the table without waiting on the batch insert job to finish? I do not want to return any uncommitted data.
I have tested setting the isolation level to "snapshot" and that seems to solve my problem, but is it the best approach?
Best regards Whimsical

SNAPSHOT isolation returns data that existed prior to the beginning of the transaction, and it doesn't a lock on the table so it doesn't block. It also ignores other locking transactions, so in your scenario, it sounds like the best fit for you. What it does mean is that since your data is being inserted in a batch, no data from that batch will be available to the SELECT statement until the batch completes (i.e.)
Time 1: Dataset A exists in Table
Time 2: Batch starts inserting dataset B into table (but doesn't commit).
Time 3: App takes snapshot, and reads in dataset A.
Time 4: App finishes returing dataset A (and only dataset A).
Time 5: Batch finishes writing dataset B; Dataset A and DataSet B are
both available in table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas