AIRFLOW trigger DAG (dynamically) multiple times - dynamic

I have a DAG where data elements are generated and stored in a REDIS database.
When the storage of all elements are done, I want to trigger a further processing DAG.
Right now, I take the top entry from REDIS and give its key as argument to my secondary DAG. Then, the secondary DAG processes this entry and triggers itself again if there are further entries in the REDIS.
trigger_local_sync = TriggerDagRunOperator(
task_id="trigger_local_sync",
trigger_dag_id="local_sync",
conf={
"entry_id": "{{ task_instance.xcom_pull(key='entry_id') }}"
},
)
parse_term_entries >> get_top_entry_id >> trigger_local_sync
However, is it also possible to dynamically trigger the local_sync DAG multiple times for each entry in the REDIS database so the tasks can then run in parallel?

Related

Task in same airflow DAG starts before previous task is committed

We have a DAG that as first task aggregates a table (A) into a staging table (B).
After that there is a task that reads from the staging table (B), and writes to another table (C).
However, the second task reads from the aggregated table (B) before it has been fully updated, which causes table C to have old data or sometimes it is empty. Airflow still logs everything as successful.
Updating table B is done as (pseudo):
delete all rows;
insert into table b
select xxxx from table A;
Task Concurrency is set as 10
pool size: 5
max_overflow: 10
Using local executor
Redshift seems to have a commit queue. Could it be that redshift tells airflow it has committed when the commit is in fact still in a queue, and the next task thus reads before the real commit takes place?
We have tried wrapping the update of table B in a transaction as (pseudo):
begin
delete all rows;
insert into table b
select xxxx from table A;
commit;
But even that does not work. For some reason airflow manages starting the second task before the first task is not fully committed.
UPDATE
It turned out there was a mistake in the dependencies. Downstream tasks were waiting for incorrect task to finish.
For future reference, never be 100 % sure you have checked everything. Check and recheck the whole flow.
You can achieve this goal by setting wait_for_downstream to True.
From https://airflow.apache.org/docs/stable/_api/airflow/operators/index.html :
when set to true, an instance of task X will wait for tasks
immediately downstream of the previous instance of task X to finish
successfully before it runs.
You can set this parameter at the default_dag_args level or at the tasks (operators) level.
default_dag_args = {
'wait_for_downstream': True,
}

Can we check if table in bigquery is in locked or DML operation is being performed

There is a BQ table which has multiple data load/update/delete jobs scheduled in. Since this is automated jobs many of it are failing due to concurrent update issue.
I need to know if we have a provision in BigQuery to check if the table is already locked by DML operation and can we serialize the queries so that no job fails
You could use the job ID generated by client code to keep track of the job status and only begin the next query when that job is done. This and this describe that process.
Alternatively you could try exponential backoff to retry the query a certain amount of times to prevent automatic failure of the query due to locked tables.

Parallel execution of stored procedures in job (SQL Server)

Breif:
I have five stored procedures and each of it have no dependencies. The identical stuff is that it pulling data from five different servers. We are just collating and feeding it to our server.
Question:
We have scheduled all these five in single job with 5 different steps. I want to execute it in parallel instead of sequential order.
Extra:
I can actually collate all five stored procs in to one if its possible to run it in parallel in sp level (if not possible in job level).
Thanks
The job steps are always executed in succession (hence 'steps').
To parallelize it, create five jobs with the same schedule.
There are several options. 2 of which are :
Create a SSIS job to run the 5 stored procedures in parallel.
Create 5 different SQL Server Agent jobs scheduled at the same time
It's also possible to create a main job and add steps to call child jobs (each calling it's own stored procedure) asynchronously as suggested at https://dba.stackexchange.com/questions/31104/calling-a-sql-server-job-within-another-job/31105#31105
However for more complicated flow it's better to use SSIS packages which are design to handle different workflows.

Why is an implicit table lock being released prior to end of transaction in RedShift?

I have an ETL process that is building dimension tables incrementally in RedShift. It performs actions in the following order:
Begins transaction
Creates a table staging_foo like foo
Copies data from external source into staging_foo
Performs mass insert/update/delete on foo so that it matches staging_foo
Drop staging_foo
Commit transaction
Individually this process works, but in order to achieve continuous streaming refreshes to foo and redundancy in the event of failure, I have several instances of the process running at the same time. And when that happens I occasionally get concurrent serialization errors. This is because both processes are replaying some of the same changes to foo from foo_staging in overlapping transactions.
What happens is that the first process creates the staging_foo table, and the second process is blocked when it attempts to create a table with the same name (this is what I want). When the first process commits its transaction (which can take several seconds) I find that the second process gets unblocked before the commit is complete. So it appears to be getting a snapshot of the foo table before the commit is in place, which causes the inserts/updates/deletes (some of which may be redundant) to fail.
I am theorizing based on the documentation http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html where it says:
Concurrent transactions are invisible to each other; they cannot detect each other's changes. Each concurrent transaction will create a snapshot of the database at the beginning of the transaction. A database snapshot is created within a transaction on the first occurrence of most SELECT statements, DML commands such as COPY, DELETE, INSERT, UPDATE, and TRUNCATE, and the following DDL commands :
ALTER TABLE (to add or drop columns)
CREATE TABLE
DROP TABLE
TRUNCATE TABLE
The documentation quoted above is somewhat confusing to me because it first says a snapshot will be created at the beginning of a transaction, but subsequently says a snapshot will be created only at the first occurrence of some specific DML/DDL operations.
I do not want to do a deep copy where I replace foo instead of incrementally updating it. I have other processes that continually query this table so there is never a time when I can replace it without interruption. Another question asks a similar question for deep copy but it will not work for me: How can I ensure synchronous DDL operations on a table that is being replaced?
Is there a way for me to perform my operations in a way that I can avoid concurrent serialization errors? I need to ensure that read access is available for foo so I can't LOCK that table.
OK, Postgres (and therefore Redshift [more or less]) uses MVCC (Multi Version Concurrency Control) for transaction isolation instead of a db/table/row/page locking model (as seen in SQL Server, MySQL, etc.). Simplistically every transaction operates on the data as it existed when the transaction started.
So your comment "I have several instances of the process running at the same time" explains the problem. If Process 2 starts while Process 1 is running then Process 2 has no visibility of the results from Process 1.

Running pg_dump while insert queries are still running?

If I run pg_dump to dump a table into a SQL file, does it take a snapshot of the last row in the table, and dump all the rows up to this row?
Or does it keep dumping all the rows, even those that were inserted after pg_dump was ran?
A secondary question is: Is it a good idea to stop all insert queries before running pg_dump?
It will obtain a shared lock on your tables when you run the pg_dump. Any transactions completed after you run the dump will not be included. So when the dump is finished, if there are current transactions in process that haven't been committed, they won't be included in the dump.
There is another pg_dump option with which it can be run:
--lock-wait-timeout=timeout
Do not wait forever to acquire shared table locks at the beginning of
the dump. Instead fail if unable to lock a table within the specified
timeout. The timeout may be specified in any of the formats accepted
by SET statement_timeout. (Allowed formats vary depending on the
server version you are dumping from, but an integer number of
milliseconds is accepted by all versions.)