how should I use the right owner task in airflow? - owner

I dont understand the "owner" in airflow. the comment of ower is "the owner of the task, using the unix username is recommended".
I wrote some the following code.
Default_args = {
'owner': 'max',
'depends_on_past': False,
'start_date': datetime(2016, 7, 14),
'email': ['max#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('dmp-annalect', default_args=default_args,
schedule_interval='30 0 * * *')
pigjob_basedata = """{local_dir}/src/basedata/basedata.sh >
{local_dir}/log/basedata/run_log &
""".format(local_dir=WORKSPACE)
task1_pigjob_basedata = BashOperator(
task_id='task1_pigjob_basedata',owner='max',
bash_command=pigjob_basedata ,
dag=dag)
But I used the command "airflow test dagid taskid 2016-07-20" , I got some error,
...
{bash_operator.py:77} INFO - put: Permission denied: user=airflow,
....
I thought that my job ran with "max" user, but apperently , ran test using 'airflow' user .
I hope if I run my task using 'max' user, how should I do.

I figured out this issue.
Because I set the AIRFLOW_HOME in /home/airflow/, only airflow can access this file directory.

I've mitigated this by adding user airflow and all other users who own tasks into a group, then giving the entire group permission to read, write and execute files within airflow's home. Not sure if this is best practice, but it works and makes the owner field more useful than setting airflow as the owner of every DAG.

You can filter the list of dags in webserver by owner name when authentication is turned on by setting webserver:filter_by_owner in your config. With this, a user will see only the dags which it is owner of, unless it is a superuser.
[webserver]
filter_by_owner = True

Related

Sensor file from gcs to bigquery. If success load it to GCS bucket, else load it to another bucket

I want to build a Dag like this
Description:
I have a file named population_YYYYMMDD.csv in local. Then, I load it to GCS bucket - folder A using GCSObjectExistenceSensor => Done
Then, I transform it using DataflowTemplatedJobStartOperator. Transform sth like column name, data type,...
Base on whether the population_YYYYMMDD file was success or failure
If success, I want to load it into Bigquery - dataset A, table named population_YYYYMMDD. And the csv file will move to another folder - Success Folder (Same or Different Bucket is also ok)
If failure, the csv file will move to Failure Folder
You can make use of the BranchPythonOperator in Airflow for implementing the conditional block in your case. This operator calls a python callable function which in turn returns the task_id of the next task to be executed.
Since you've not shared any code, here is a simple exampler dag you can follow. Pls feel free to replace the DummyOperator with your requirement specific operators.
from airflow import DAG
from airflow.operators.python import BranchPythonOperator
from airflow.operators.dummy import DummyOperator
from datetime import datetime
from airflow.models.taskinstance import TaskInstance
default_args = {
'start_date': datetime(2020, 1, 1)
}
def _choose_best_model(**kwargs):
dag_instance = kwargs['dag']
execution_date= kwargs['execution_date']
operator_instance = dag_instance.get_task("Dataflow-operator")
task_status = TaskInstance(operator_instance, execution_date).current_state()
if task_status == 'success':
return 'upload-to-BQ'
else:
return 'upload-to-GCS'
with DAG('branching', schedule_interval=None, default_args=default_args, catchup=False) as dag:
task1 = DummyOperator(task_id='Local-to-GCS')
task2 = DummyOperator(task_id='Dataflow-operator')
branching = BranchPythonOperator(
task_id='conditional-task',
python_callable=_choose_best_model,
provide_context=True
)
accurate = DummyOperator(task_id='upload-to-BQ')
inaccurate = DummyOperator(task_id='upload-to-GCS')
task1>>task2
branching >> [accurate, inaccurate]
Here _choose_best_model function is called by BranchPythonOperator, this function determines the state of Task2 (which in your case would be DataflowTemplatedJobStartOperator) and if the task is successful it will return the one task skipping the other.

Airflow DAG Steps Dependencies

I have Airflow DAG written as below:
with DAG(
dag_id='dag',
default_args={
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
},
dagrun_timeout=timedelta(hours=2),
start_date=datetime(2021, 9, 28, 11),
schedule_interval='10 * * * *'
) as dag:
create_job_flow = EmrCreateJobFlowOperator(
task_id='create_job_flow',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_default',
)
job_step = EmrAddStepsOperator(
task_id='job_step',
job_flow_id=create_job_flow.output,
aws_conn_id='aws_default',
steps=JOB_SETP,
)
job_step_sensor = EmrStepSensor(
task_id='job_step_sensor',
job_flow_id=create_job_flow.output,
step_id="{{ task_instance.xcom_pull(task_ids='job_step', key='return_value')[0] }}",
aws_conn_id='aws_default',
)
read_file = PythonOperator(
task_id="read_file",
python_callable=get_date_information
)
alter_partitions = PythonOperator(
task_id="alter_partitions",
python_callable=update_partitions
)
remove_cluster = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id=create_job_flow.output,
aws_conn_id='aws_default',
)
create_job_flow.set_downstream(job_step)
job_step.set_downstream(job_step_sensor)
job_step_sensor.set_downstream(read_file)
read_file.set_downstream(alter_partitions)
alter_partitions.set_downstream(remove_cluster)
So this is basically creating an EMR cluster, starting a Step in it and sensing that step. Then execute some Python functions and finally terminate the cluster. The view of the DAG in Airflow UI is as below:
Here, create_job_flow is also pointing to remove_cluster (maybe because the job_flow_id has a reference to create_job_flow) whereas I only set the downstream of alter_partitions to remove_cluster. Would this happen that before reaching job_step, it would remove the cluster as in that case the cluster will already be deleted before executing the split_job_step and hence that is the problem. Is there any way to remove the link between create_job_flow and remove_cluster? Or would it wait to finish alter_partitions and would then execute remove_cluster?
The "remove_cluster" task will wait until the "alter_partitions" task is completed. The extra edge between "create_job_flow" and "remove_cluster" (as well as between "create_job_flow" and "job_step_sensor") is a feature of the TaskFlow API and the XComArg concept, namely the use of an operator's .output property. (Check out this documentation for another example.)
In both the "remove_cluster" and "job_step_sensor" tasks, job_flow_id=create_job_flow.output is an input arg. Behind the scenes, when an operator's .output is used in a templated field as an input of another task, a dependency is automatically created. This feature ensures what were previously implicit task dependencies between tasks using other tasks' XComs are now explicit.
This pipeline will execute serially as written and desired (assuming the trigger_rule is "all_success" which is the default). The "remove_cluster" task won't execute until both the "create_job_flow" and "alter_partitions" tasks are complete (which is effectively a serial execution).

Submitting Spark Job to Livy (in EMR) from Airflow (using airflow Livy operator)

I am trying to schedule a job in EMR using airflow livy operator. Here is the example code I followed. The issue here is... nowhere Livy connection string (Host name & Port) is specified. How do I provide the Livy Server host name & port for the operator?
Also, the operator has parameter livy_conn_id, which in the example is set a value of livy_conn_default. Is that the right value?... or do I have set some other value?
You should be having livy_conn_default under connections in Admin tab of your Airflow dashboard, If that's set alright then yes, you can use this. Otherwise, you can change this or create another connection id and use that in livy_conn_id
There are 2 APIs we can use to connect Livy and Airflow:
Using LivyBatchOperator
Using LivyOperator
In the following example, i will cover LivyOperator API.
LivyOperator
Step1: Update the livy configuration:
Login to airflow ui --> click on Admin tab --> Connections --> Search for livy. Click on edit button and update the Host and Port parameters.
Step2: Install the apache-airflow-providers-apache-livy
pip install apache-airflow-providers-apache-livy
Step3: Create the data file under $AIRFLOW_HOME/dags directory.
vi $AIRFLOW_HOME/dags/livy_operator_sparkpi_dag.py
from datetime import timedelta, datetime
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.providers.apache.livy.operators.livy import LivyOperator
default_args = {
'owner': 'RangaReddy',
"retries": 3,
"retry_delay": timedelta(minutes=5),
}
# Initiate DAG
livy_operator_sparkpi_dag = DAG(
dag_id = "livy_operator_sparkpi_dag",
default_args=default_args,
schedule_interval='#once',
start_date = datetime(2022, 3, 2),
tags=['example', 'spark', 'livy']
)
# define livy task with LivyOperator
livy_sparkpi_submit_task = LivyOperator(
file="/root/spark-3.2.1-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.1.jar",
class_name="org.apache.spark.examples.SparkPi",
driver_memory="1g",
driver_cores=1,
executor_memory="1g",
executor_cores=2,
num_executors=1,
name="LivyOperator SparkPi",
task_id="livy_sparkpi_submit_task",
dag=livy_operator_sparkpi_dag,
)
begin_task = DummyOperator(task_id="begin_task")
end_task = DummyOperator(task_id="end_task")
begin_task >> livy_sparkpi_submit_task >> end_task
LIVY_HOST=192.168.0.1
curl http://${LIVY_HOST}:8998/batches/0/log | python3 -m json.tool
Output:
"Pi is roughly 3.14144103141441"

Dataflow job fails and tries to create temp_dataset on Bigquery

I'm running a simple dataflow job to read data from a table and write back to another.
The job fails with the error:
Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error:
Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].
I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene.
The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one.
Any idea of how to work with this issue ?
Here's the code:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv
options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'
with beam.Pipeline(options=options) as p:
query = "SELECT ..."
bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)
table_schema = ...
bq_data | beam.io.WriteToBigQuery(
project="prj",
dataset="test",
table="test",
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.
So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.
This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().
You can specify the dataset to use. That way the process doesn't create a temp dataset.
Example:
TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
.fromQuery("selectQuery").withQueryTempDataset("existingDataset")
.usingStandardSql().withMethod(TypedRead.Method.DEFAULT);

unable to load data into big query table by BigQueryOperator in composer

I have created the dag to load the data from big query to another big query table. I have used the BigQueryOperator in composer . But this code is not working as expected. I'm not able to get the error can any one please help me resolve this issue.
And i manually created empty table also still data in not loading into the table.Please find the below code and let me know did i missed any thing?
from typing import Any
from datetime import datetime, timedelta
import airflow
from airflow import models
from airflow.operators import bash_operator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
sql="""SELECT * FROM `project_id.dataset_name.source_table`"""
DEFAULT_ARGUMENTS = {
"owner": "Airflow",
"depends_on_past": False,
"start_date": datetime(2019, 8, 7),
"schedule_interval": '0 6 * * *',
"retries": 10
}
dag = models.DAG(
dag_id='Bq_to_bq',
default_args=DEFAULT_ARGUMENTS
)
LOAD_TABLE_TRUNC = BigQueryOperator(
task_id ='load_bq_table_truncate',
dag=dag,
bql=sql,
destination_proect_dataset_table='project-id.dataset-name.table_name',
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
allow_large_results='true',
use_legacy_sql=False,
)
LOAD_TABLE_APPEND = BigQueryOperator(
task_id ='load_bq_table_append',
dag=dag,
bql=sql,
destination_proect_dataset_table='project-id.dataset-name.table_name',
write_disposition='WRITE_APPEND',
create_disposition='CREATE_IF_NEEDED',
allow_large_results='true',
use_legacy_sql=False,
)
LOAD_TABLE_TRUNC.set_downstream(LOAD_TABLE_APPEND)
This is to find out error specific to the DAG's failure
You can find out error in two ways
Web Interface:
Go to the DAG and select Graph view.
Select the task and click on View Log.
Stack Driver logging:
Go to this URL https://console.cloud.google.com/logs/viewer? project=project_id.
Select 'Cloud Composer Environment' from first dropdown followed by location and DAG name.
Select Error as from Log level dropdown.
As Josh mentioned in his comment on your post, the value for allow_large_results should be True without the quotes. Additionally, I see that you have a typo in your spelling of destination_proect_dataset_table. You're missing a 'j': destination_project_dataset_table
BVSKanth's recommendations for finding DAG errors are also spot-on and worth considering for future debugging.