We have a DAG that as first task aggregates a table (A) into a staging table (B).
After that there is a task that reads from the staging table (B), and writes to another table (C).
However, the second task reads from the aggregated table (B) before it has been fully updated, which causes table C to have old data or sometimes it is empty. Airflow still logs everything as successful.
Updating table B is done as (pseudo):
delete all rows;
insert into table b
select xxxx from table A;
Task Concurrency is set as 10
pool size: 5
max_overflow: 10
Using local executor
Redshift seems to have a commit queue. Could it be that redshift tells airflow it has committed when the commit is in fact still in a queue, and the next task thus reads before the real commit takes place?
We have tried wrapping the update of table B in a transaction as (pseudo):
begin
delete all rows;
insert into table b
select xxxx from table A;
commit;
But even that does not work. For some reason airflow manages starting the second task before the first task is not fully committed.
UPDATE
It turned out there was a mistake in the dependencies. Downstream tasks were waiting for incorrect task to finish.
For future reference, never be 100 % sure you have checked everything. Check and recheck the whole flow.
You can achieve this goal by setting wait_for_downstream to True.
From https://airflow.apache.org/docs/stable/_api/airflow/operators/index.html :
when set to true, an instance of task X will wait for tasks
immediately downstream of the previous instance of task X to finish
successfully before it runs.
You can set this parameter at the default_dag_args level or at the tasks (operators) level.
default_dag_args = {
'wait_for_downstream': True,
}
Related
I have a system that's set up as a series of jobs and tasks. Each job is made up of several tasks, and there's another task_progress table that does nothing but record the current state of each task. (This information is kept separate from the main tasks table for business reasons that are not relevant to this issue.)
The jobs table has an overall status column, which needs to get updated to completed when all of the job's tasks reach a terminal state (ok or error). This is handled by a trigger function:
CREATE OR REPLACE FUNCTION update_job_status_when_progress_changes()
RETURNS trigger AS $$
DECLARE
current_job_id jobs.id%TYPE;
pending integer;
BEGIN
SELECT tasks.job_id INTO current_job_id
FROM tasks WHERE tasks.id = NEW.task_id;
SELECT COUNT(*) INTO pending
FROM task_progress
JOIN tasks ON task_progress.task_id = tasks.id
WHERE tasks.job_id = current_job_id
AND task_progress.status NOT IN ('ok', 'error');
IF pending = 0 THEN
UPDATE jobs
SET status = 'completed', updated_at = NOW() AT TIME ZONE 'utc'
WHERE jobs.id = current_job_id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql
CREATE TRIGGER task_progress_update_job_status
AFTER UPDATE OR DELETE
ON task_progress
FOR EACH ROW
EXECUTE PROCEDURE update_job_status_when_progress_changes()
It's been almost entirely fine. But sometimes – like, maybe once every few hundred jobs – a job will fail to flip over to completed status. The progress rows are all correct; the business logic that displays a % complete based on the contents of the task_progress table hits 100%, but the status of the job stays at processing. We've been unable to reproduce it reliably; it's just something that happens now and then. But it's frustrating, and I'd like to nail it down.
There are no transactions involved; each task progress is updated atomically by the process that completes the task.
Is it possible to hit a situation where, e.g., the last two tasks in a job complete almost simultaneously, causing the trigger for Task A to see that Task B is still pending, and vice versa? I thought FOR EACH ROW was supposed to prevent race conditions like this, but I can't explain what I'm seeing otherwise.
What's my best option here?
Yes, there is a race condition. If the last two tasks complete at about the same time, the trigger functions can run concurrently. Since the trigger runs as part of the transaction, and the transactions are both not committed yet, none of the trigger functions can see the data modifications made by the other transaction. So each believes there is still a task open.
You could use an advisory lock to make sure that that cannot happen: right before the SELECT count(*) ..., add
SELECT pg_advisory_xact_lock(42);
That makes sure that no session will execute the query while another session that has already executed the query is still not committed, because the lock is held until the end of the transaction.
In my application which is implemented using SoftwareAG Webmethods, I'm using JMS to process more than 2 millions of records in parallel, so basically each JMS thread will have a batch of records (lets's x1000) to process and then insert into database table (let's call it table A) and after each thread inserts each batch they will send result message on JMS which I will aggregate later to update the process status.
The problem I'm facing now is that the thread will process its batch, insert and put the result message on JMS queue but the insert transactions will get queued in the mssql database but it doesn't wait in the application itself. it considers it as done and continues with the next line of logic.
Therefore the process on each thread is completed and the main process is marked as completed while there are a lot of records still waiting to get inserted into the database yet.
so my question is that is there any trigger in mssql that can be used for when the queued transactions on a table are finished?
I suggest you instead of INSERT batches, use batches that will create a jobs with two steps. First step is insertion of data and second insert data about batch complete in some results table. After that you can check table with results.
I have an ETL process that is building dimension tables incrementally in RedShift. It performs actions in the following order:
Begins transaction
Creates a table staging_foo like foo
Copies data from external source into staging_foo
Performs mass insert/update/delete on foo so that it matches staging_foo
Drop staging_foo
Commit transaction
Individually this process works, but in order to achieve continuous streaming refreshes to foo and redundancy in the event of failure, I have several instances of the process running at the same time. And when that happens I occasionally get concurrent serialization errors. This is because both processes are replaying some of the same changes to foo from foo_staging in overlapping transactions.
What happens is that the first process creates the staging_foo table, and the second process is blocked when it attempts to create a table with the same name (this is what I want). When the first process commits its transaction (which can take several seconds) I find that the second process gets unblocked before the commit is complete. So it appears to be getting a snapshot of the foo table before the commit is in place, which causes the inserts/updates/deletes (some of which may be redundant) to fail.
I am theorizing based on the documentation http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html where it says:
Concurrent transactions are invisible to each other; they cannot detect each other's changes. Each concurrent transaction will create a snapshot of the database at the beginning of the transaction. A database snapshot is created within a transaction on the first occurrence of most SELECT statements, DML commands such as COPY, DELETE, INSERT, UPDATE, and TRUNCATE, and the following DDL commands :
ALTER TABLE (to add or drop columns)
CREATE TABLE
DROP TABLE
TRUNCATE TABLE
The documentation quoted above is somewhat confusing to me because it first says a snapshot will be created at the beginning of a transaction, but subsequently says a snapshot will be created only at the first occurrence of some specific DML/DDL operations.
I do not want to do a deep copy where I replace foo instead of incrementally updating it. I have other processes that continually query this table so there is never a time when I can replace it without interruption. Another question asks a similar question for deep copy but it will not work for me: How can I ensure synchronous DDL operations on a table that is being replaced?
Is there a way for me to perform my operations in a way that I can avoid concurrent serialization errors? I need to ensure that read access is available for foo so I can't LOCK that table.
OK, Postgres (and therefore Redshift [more or less]) uses MVCC (Multi Version Concurrency Control) for transaction isolation instead of a db/table/row/page locking model (as seen in SQL Server, MySQL, etc.). Simplistically every transaction operates on the data as it existed when the transaction started.
So your comment "I have several instances of the process running at the same time" explains the problem. If Process 2 starts while Process 1 is running then Process 2 has no visibility of the results from Process 1.
I'm going to make up some sql here. What I want is something like the following:
select ... for update priority 2; // Session 2
So when I run in another session
select ... for update priority 1; // Session 1
It immediately returns, and throws an error in session 2 (and hence does a rollback), and locks the row in session 1.
Then, whilst session 1 holds the lock, running the following in session 2.
select ... for update priority 2; // Session 2
Will wait until session 1 releases the lock.
How could I implement such a scheme, as the priority x is just something I've made up. I only need something that can do two priority levels.
Also, I'm happy to hide all my logic in PL/SQL procedures, I don't need this to work for generic SQL statements.
I'm using Oracle 10g if that makes any difference.
I'm not aware of a way to interrupt an atomic process in Oracle like you're suggesting. I think the only thing you could do would be to programmaticaly break down your larger processes into smaller ones and poll some type of sentinel table. So instead of doing a single update for 1 million rows perhaps you could write a proc that would update 1k, check a jobs table (or something similar) to see if there's a higher priority process running, and if a higher priority process is running, to pause its own execution through a wait loop. This is the only thing I can think that would keep your session alive during this process.
If you truly want to abort the progress of your currently running, lower priority thread and losing your session is acceptable, then I would suggest a jobs table again that registered the SQL that was being run and the session ID that it is run on. If you run a higher priority statement it should again check the jobs table and then issue a kill command to the low priority session (http://www.oracle-base.com/articles/misc/KillingOracleSessions.php) along with inserting a record into the jobs table to note the fact that it was killed. When a higher-priority process finishes it could check the jobs table to see if it was responsible for killing anything and if so, reissue it.
That's what resource manager was implemented for.
We have a number of local Subscriptions that a vendor uses to Push us data every morning. We're looking to have more info about when this happens and specifically when it finishes using T-SQL.
I tried this:
exec sp_replmonitorsubscriptionpendingcmds 'SQL03', 'RSSPA_Common', 'RSSPA_Common_ORA_Tran',
'FBHBGISSQL01', 'RSSPA_Fish', 0
but get this message:
Msg 21482, Level 16, State 1, Procedure sp_replmonitorsubscriptionpendingcmds, Line 32
sp_replmonitorsubscriptionpendingcmds can only be executed in the "distribution" database.
How can I tell when this Subscription is being used?
Since the monitoring is handled at the distributor (which you don't seem to have access to) you could try a work-around. The first makes the assumption that you have DDL right's to the replicated database.
Add a trigger to one of the replicated tables, such as the the last one to finish with the updates which is likely the largest one. This will incur an overhead so keep the trigger simple. Design the trigger to update a timestamp in a simple table and used a SQL Agent job to monitor that table and when the timestamp has gone stale for like 1 hour then kick off another process or send a notification.
create table Repl_Monitor (
LastUpdate datetime not null
);
GO
insert into Repl_Monitor(LastUpdate)
select GETDATE(); --seed record
GO
create trigger trg_Repl_Monitor
on dbo.[<replicated table to be montiored>]
for update, insert, delete
as
update Repl_Monitor
set LastUpdate = GETDATE()
GO
If the daily push is comprised of a lot of inserted/delete records another work-around would be to monitor the "rows" in the sysindexes every minute and then notify once the "rows" count stops fluctuating after a period of time.
select top 1 rows from sysindexes
where id = OBJECT_ID('tableName')
and rows > 0
This has the benefit of a negligible overhead but isn't as accurate as the trigger.
Cheers!