Airflow DAG Steps Dependencies - amazon-emr

I have Airflow DAG written as below:
with DAG(
dag_id='dag',
default_args={
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
},
dagrun_timeout=timedelta(hours=2),
start_date=datetime(2021, 9, 28, 11),
schedule_interval='10 * * * *'
) as dag:
create_job_flow = EmrCreateJobFlowOperator(
task_id='create_job_flow',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_default',
)
job_step = EmrAddStepsOperator(
task_id='job_step',
job_flow_id=create_job_flow.output,
aws_conn_id='aws_default',
steps=JOB_SETP,
)
job_step_sensor = EmrStepSensor(
task_id='job_step_sensor',
job_flow_id=create_job_flow.output,
step_id="{{ task_instance.xcom_pull(task_ids='job_step', key='return_value')[0] }}",
aws_conn_id='aws_default',
)
read_file = PythonOperator(
task_id="read_file",
python_callable=get_date_information
)
alter_partitions = PythonOperator(
task_id="alter_partitions",
python_callable=update_partitions
)
remove_cluster = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id=create_job_flow.output,
aws_conn_id='aws_default',
)
create_job_flow.set_downstream(job_step)
job_step.set_downstream(job_step_sensor)
job_step_sensor.set_downstream(read_file)
read_file.set_downstream(alter_partitions)
alter_partitions.set_downstream(remove_cluster)
So this is basically creating an EMR cluster, starting a Step in it and sensing that step. Then execute some Python functions and finally terminate the cluster. The view of the DAG in Airflow UI is as below:
Here, create_job_flow is also pointing to remove_cluster (maybe because the job_flow_id has a reference to create_job_flow) whereas I only set the downstream of alter_partitions to remove_cluster. Would this happen that before reaching job_step, it would remove the cluster as in that case the cluster will already be deleted before executing the split_job_step and hence that is the problem. Is there any way to remove the link between create_job_flow and remove_cluster? Or would it wait to finish alter_partitions and would then execute remove_cluster?

The "remove_cluster" task will wait until the "alter_partitions" task is completed. The extra edge between "create_job_flow" and "remove_cluster" (as well as between "create_job_flow" and "job_step_sensor") is a feature of the TaskFlow API and the XComArg concept, namely the use of an operator's .output property. (Check out this documentation for another example.)
In both the "remove_cluster" and "job_step_sensor" tasks, job_flow_id=create_job_flow.output is an input arg. Behind the scenes, when an operator's .output is used in a templated field as an input of another task, a dependency is automatically created. This feature ensures what were previously implicit task dependencies between tasks using other tasks' XComs are now explicit.
This pipeline will execute serially as written and desired (assuming the trigger_rule is "all_success" which is the default). The "remove_cluster" task won't execute until both the "create_job_flow" and "alter_partitions" tasks are complete (which is effectively a serial execution).

Related

Sensor file from gcs to bigquery. If success load it to GCS bucket, else load it to another bucket

I want to build a Dag like this
Description:
I have a file named population_YYYYMMDD.csv in local. Then, I load it to GCS bucket - folder A using GCSObjectExistenceSensor => Done
Then, I transform it using DataflowTemplatedJobStartOperator. Transform sth like column name, data type,...
Base on whether the population_YYYYMMDD file was success or failure
If success, I want to load it into Bigquery - dataset A, table named population_YYYYMMDD. And the csv file will move to another folder - Success Folder (Same or Different Bucket is also ok)
If failure, the csv file will move to Failure Folder
You can make use of the BranchPythonOperator in Airflow for implementing the conditional block in your case. This operator calls a python callable function which in turn returns the task_id of the next task to be executed.
Since you've not shared any code, here is a simple exampler dag you can follow. Pls feel free to replace the DummyOperator with your requirement specific operators.
from airflow import DAG
from airflow.operators.python import BranchPythonOperator
from airflow.operators.dummy import DummyOperator
from datetime import datetime
from airflow.models.taskinstance import TaskInstance
default_args = {
'start_date': datetime(2020, 1, 1)
}
def _choose_best_model(**kwargs):
dag_instance = kwargs['dag']
execution_date= kwargs['execution_date']
operator_instance = dag_instance.get_task("Dataflow-operator")
task_status = TaskInstance(operator_instance, execution_date).current_state()
if task_status == 'success':
return 'upload-to-BQ'
else:
return 'upload-to-GCS'
with DAG('branching', schedule_interval=None, default_args=default_args, catchup=False) as dag:
task1 = DummyOperator(task_id='Local-to-GCS')
task2 = DummyOperator(task_id='Dataflow-operator')
branching = BranchPythonOperator(
task_id='conditional-task',
python_callable=_choose_best_model,
provide_context=True
)
accurate = DummyOperator(task_id='upload-to-BQ')
inaccurate = DummyOperator(task_id='upload-to-GCS')
task1>>task2
branching >> [accurate, inaccurate]
Here _choose_best_model function is called by BranchPythonOperator, this function determines the state of Task2 (which in your case would be DataflowTemplatedJobStartOperator) and if the task is successful it will return the one task skipping the other.

Airflow/SQLAlchemy Error - Loading context has changed within a load/refresh handler

I am attempting to use clairvoyant's db-cleanup dag to clear metadata in our xcom table, but when I run it, I receive the following warning, printed thousands of times before I manually stop the job in order to not take down our mysql instance:
SAWarning: Loading context for <BaseXCom at 0x7f26f789b370> has changed within a load/refresh handler, suggesting a row refresh operation took place. If this event handler is expected to be emitting row refresh operations within an existing load or refresh operation, set restore_load_context=True when establishing the listener to ensure the context remains unchanged when the event handler completes.
The other cleanup tasks work fine, but it is the xcom table in particular I am having trouble with. We have hundreds/thousands of active dags and so the xcom table is constantly being written to nearly every second or two. I think that is what is causing this error, the fact that the data is continually changing while it is being queried.
I have been unable to find the cause of this or any examples of how this can be resolved. I tried adding a "restore_load_context":True line as per SQLAlchemy docs but it did not work.
Here are the snippets I attempted to add to the database object and the cleanup task:
{
"airflow_db_model": XCom,
"age_check_column": XCom.execution_date,
"keep_last": False,
"keep_last_filters": None,
"keep_last_group_by": None,
"restore_load_context":True
},
....
def cleanup_function(**context):
logging.info("Retrieving max_execution_date from XCom")
max_date = context["ti"].xcom_pull(
task_ids=print_configuration.task_id, key="max_date"
)
max_date = dateutil.parser.parse(max_date) # stored as iso8601 str in xcom
airflow_db_model = context["params"].get("airflow_db_model")
state = context["params"].get("state")
age_check_column = context["params"].get("age_check_column")
keep_last = context["params"].get("keep_last")
keep_last_filters = context["params"].get("keep_last_filters")
keep_last_group_by = context["params"].get("keep_last_group_by")
restore_load_context = context["params"].get("restore_load_context")
In order to not paste too much code here, I am using the same code in the db-cleanup dag. Has anyone encountered this and found a way to resolve?
I am very inexperienced with sqlalchemy and am entirely unsure where else to place this code or how to go about it.

Azure ADF - how to proceed with pipeline activities only after new files arrival?

I have written generic datafiles arrival checking routine using databricks notebooks which accepts filenames and time which specifies acceptable freshness of files. many pipeline uses this notebook and passes filenames tuples and at end notebook returns True or False, to indicate if next workflow activity could start or not. so far so good.
now my question is how to use this in Azure ADF pipeline such that if it fails it should wait for 30 minutes or so and check again by running above notebook again?
this notebook shall run first so that if new files are already there then it should not wait
Since you are talking about the notebook activity , you can add a wait activity "on failue " and set the time for the wait . after wait add a executepipelien actvity . This execute pipeline should point to a pipeline with a execute pipeline ( again ) pointing to the main pipeline which has the notebook activity . Basically this is just a cycle , but will only execute when you have a failure .

getting running job id from BigQueryOperator using xcom

I want to get Bigquery's job id from BigQueryOperator.
I saw in bigquery_operator.py file the following line:
context['task_instance'].xcom_push(key='job_id', value=job_id)
I don't know if this is airflow's job id or BigQuery job id, if it's BigQuery job id how can I get it using xcom from downstream task?.
I tried to do the following in downstream Pythonoperator:
def write_statistics(**kwargs):
job_id = kwargs['templates_dict']['job_id']
print('tamir')
print(kwargs['ti'].xcom_pull(task_ids='create_tmp_big_query_table',key='job_id'))
print(kwargs['ti'])
print(job_id)
t3 = BigQueryOperator(
task_id='create_tmp_big_query_table',
bigquery_conn_id='bigquery_default',
destination_dataset_table= DATASET_TABLE_NAME,
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
sql = """
#standardSQL...
The UI is great for checking whether an XCom was written to or not, which I'd recommend you do even before you try to reference it in a separate task so you don't need to worry about whether you're fetching it correctly or not. Click your create_tmp_big_query_table task -> Task Instance Details -> XCom. It'll look something like the following:
In your case, the code looks right to me, but I'm guessing your version of Airflow doesn't have the change that added saving job id into an XCom. This feature was added in https://github.com/apache/airflow/pull/5195, which is currently only on master and currently not part of the most recent stable release (1.10.3). See for yourself in the 1.10.3 version of the BigQueryOperator.
Your options are to wait for it to be in a release (...sometimes takes awhile), running off a version of master with that change, or temporarily copy over the newer version of the operator as a custom operator. In the last case, I'd suggest naming it something like BigQueryOperatorWithXcom with a note to replace it with the built-in operator once it's released.
The JOB ID within bigquery_operator.py is the BQ JOB ID. You can understand it looking at the previous lines:
if isinstance(self.sql, str):
job_id = self.bq_cursor.run_query(
sql=self.sql,
destination_dataset_table=self.destination_dataset_table,
write_disposition=self.write_disposition,
allow_large_results=self.allow_large_results,
flatten_results=self.flatten_results,
udf_config=self.udf_config,
maximum_billing_tier=self.maximum_billing_tier,
maximum_bytes_billed=self.maximum_bytes_billed,
create_disposition=self.create_disposition,
query_params=self.query_params,
labels=self.labels,
schema_update_options=self.schema_update_options,
priority=self.priority,
time_partitioning=self.time_partitioning,
api_resource_configs=self.api_resource_configs,
cluster_fields=self.cluster_fields,
encryption_configuration=self.encryption_configuration
)
elif isinstance(self.sql, Iterable):
job_id = [
self.bq_cursor.run_query(
sql=s,
destination_dataset_table=self.destination_dataset_table,
write_disposition=self.write_disposition,
allow_large_results=self.allow_large_results,
flatten_results=self.flatten_results,
udf_config=self.udf_config,
maximum_billing_tier=self.maximum_billing_tier,
maximum_bytes_billed=self.maximum_bytes_billed,
create_disposition=self.create_disposition,
query_params=self.query_params,
labels=self.labels,
schema_update_options=self.schema_update_options,
priority=self.priority,
time_partitioning=self.time_partitioning,
api_resource_configs=self.api_resource_configs,
cluster_fields=self.cluster_fields,
encryption_configuration=self.encryption_configuration
)
for s in self.sql]
Eventually, run_with_configuration method returns self.running_job_id from BQ

how should I use the right owner task in airflow?

I dont understand the "owner" in airflow. the comment of ower is "the owner of the task, using the unix username is recommended".
I wrote some the following code.
Default_args = {
'owner': 'max',
'depends_on_past': False,
'start_date': datetime(2016, 7, 14),
'email': ['max#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('dmp-annalect', default_args=default_args,
schedule_interval='30 0 * * *')
pigjob_basedata = """{local_dir}/src/basedata/basedata.sh >
{local_dir}/log/basedata/run_log &
""".format(local_dir=WORKSPACE)
task1_pigjob_basedata = BashOperator(
task_id='task1_pigjob_basedata',owner='max',
bash_command=pigjob_basedata ,
dag=dag)
But I used the command "airflow test dagid taskid 2016-07-20" , I got some error,
...
{bash_operator.py:77} INFO - put: Permission denied: user=airflow,
....
I thought that my job ran with "max" user, but apperently , ran test using 'airflow' user .
I hope if I run my task using 'max' user, how should I do.
I figured out this issue.
Because I set the AIRFLOW_HOME in /home/airflow/, only airflow can access this file directory.
I've mitigated this by adding user airflow and all other users who own tasks into a group, then giving the entire group permission to read, write and execute files within airflow's home. Not sure if this is best practice, but it works and makes the owner field more useful than setting airflow as the owner of every DAG.
You can filter the list of dags in webserver by owner name when authentication is turned on by setting webserver:filter_by_owner in your config. With this, a user will see only the dags which it is owner of, unless it is a superuser.
[webserver]
filter_by_owner = True