My task code is the following.
from airflow.models import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
rootdir = "/tmp/airflow"
default_args = {
'owner': 'max',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['max#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('test3', default_args=default_args,
schedule_interval='*/2 * * * *')
t1 = BashOperator(
task_id='test3-task1',
bash_command='date >> {rootdir}/test3-task1.out'.format(rootdir=rootdir),
owner='max',
dag=dag)
t2 = BashOperator(
task_id='test3-task2',
bash_command='whoami',
retries=3,
owner='max',
dag=dag)
Then I run the command "airflow test test3 test3-task2 2016-07-25" with 'airflow' user of linux. The result of output "whoami" is "airflow".
But I hope that the output result is "owner" of task.
What is my wrong ?
Thanks
the following is the output result.
[2016-07-25 11:22:37,716] {bash_operator.py:64} INFO - Temporary script location :/tmp/airflowtmpoYNJE8//tmp/airflowtmpoYNJE8/test3-task2U1lpom
[2016-07-25 11:22:37,716] {bash_operator.py:65} INFO - Running command: whoami
[2016-07-25 11:22:37,722] {bash_operator.py:73} INFO - Output:
[2016-07-25 11:22:37,725] {bash_operator.py:77} INFO - airflow
[2016-07-25 11:22:37,725] {bash_operator.py:80} INFO - Command exited with return code 0
You can use the "run_as_user" parameter under default_args
default_args = {
'owner': 'max',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['max#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'run_as_user': 'max'
}
It doesn't look like what you're trying to do is supported. Looking at the source code for both the bash_operator and the BaseOperator, neither makes any attempt to change users before executing the task, unfortunately.
Related
I'm trying to export scraped items to a json file. Here is the beginning of the code:
class GetId(scrapy.Spider):
name = "get_id"
path = expanduser("~").replace('\\', '/') + '/dox/Getaround/'
days = [0, 1, 3, 7, 14, 21, 26, 31]
dates = []
previous_cars_list = []
previous_cars_id_list = []
crawled_date = datetime.today().date()
for day in days:
market_date = crawled_date + timedelta(days=day)
dates.append(market_date)
# Settings
custom_settings = {
'ROBOTSTXT_OBEY' : False,
'DOWNLOAD_DELAY' : 5,
'CONCURRENT_REQUESTS' : 1,
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'AUTOTHROTTLE_ENABLED' : True,
'AUTOTHROTTLE_START_DELAY' : 5,
'LOG_STDOUT' : True,
'LOG_FILE' : path + 'log_get_id.txt',
'FEED_FORMAT': 'json',
'FEED_URI': path + 'cars_id.json',
}
I've done this 2 years ago without any issues. Now once I input "scrapy crawl get_id" in the Anaconda console, only the log file is exported and not the json with the data. In the log file the following error arise:
2022-08-25 15:14:48 [scrapy.extensions.feedexport] ERROR: Unknown feed storage scheme: c
Any clue how to deal with this ? Thanks
I'm not sure what version it was introduced but I always use the FEEDS setting. Either in the settings.py file or by using the custom settings class attribute like you use in your example.
For example:
# Settings
custom_settings = {
'ROBOTSTXT_OBEY' : False,
'DOWNLOAD_DELAY' : 5,
'CONCURRENT_REQUESTS' : 1,
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'AUTOTHROTTLE_ENABLED' : True,
'AUTOTHROTTLE_START_DELAY' : 5,
'LOG_STDOUT' : True,
'LOG_FILE' : path + 'log_get_id.txt',
'FEEDS': { # <-- added this
path + 'cars_id.json':{
'format': 'json',
'encoding': 'utf-8'
}
}
}
You can find out all of the possible fields that can be set at https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds
Scrapy 2.1.0 (2020-04-24)
The FEED_FORMAT and FEED_URI settings have been deprecated in favor of
the new FEEDS setting (issue 1336, issue 3858, issue 4507).
https://docs.scrapy.org/en/latest/news.html?highlight=FEED_URI#id30
In this case, there are two options to use an older version of the Scrapy, e.g.:
pip uninstall Scrapy
pip install Scrapy==2.0
Or use FEEDS; you can read more about it here https://docs.scrapy.org/en/latest/topics/feed-exports.html#feeds
When using BigQueryInsertJobOperator and setting the configuration to perform a dry run on a faulty .sql file/ a hardcoded query, the task succeeds even though it should fail. The same error gets properly thrown out by task failure when running with dryRun as false in configuration. Below is the code used for testing in composer(airflow)
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow import DAG
default_args = {
'depends_on_past': False,
}
dag = DAG(dag_id='bq_script_tester',
default_args=default_args,
schedule_interval='#once',
start_date=datetime(2021, 1, 1),
tags=['bq_script_caller']
)
with dag:
job_id = BigQueryInsertJobOperator(
task_id="bq_validator",
configuration={
"query": {
"query": "INSERT INTO `bigquery-public-data.stackoverflow.stackoverflow_posts` values('this is cool');",
"useLegacySql": False,
},
"dryRun": True
},
location="US"
)
How can a bigquery be validated using dryRun option in composer. Is there an alternative approach in composer to achieve the same functionality. The alternative should be an operator capable of accepting sql scripts that contain procedures and simple sql with support of templating.
Airflow version: 2.1.4
Composer version: 1.17.7
I am trying to use Ansible find command to delete files with a given pattern. Before executing the delete part, I want to list the files that will be deleted. I want to list only filenames including path. The default debug prints a lot of information
- name: Ansible delete old files from pathslist
find:
paths: "{{ pathslist }}"
patterns:
- "authlog.*"
- "server.log.*"
register: var_log_files_to_delete
- name : get the complete path
set_fact:
files_found_path: "{{ var_log_files_to_delete.files }}"
- debug:
var: files_found_path
This outputs like below
{
"atime": 1607759761.7751443,
"ctime": 1615192802.0948966,
"dev": 66308,
"gid": 0,
"gr_name": "root",
"inode": 158570,
"isblk": false,
"ischr": false,
"isdir": false,
"isfifo": false,
"isgid": false,
"islnk": false,
"isreg": true,
"issock": false,
"isuid": false,
"mode": "0640",
"mtime": 1607675101.0750349,
"nlink": 1,
"path": "/var/log/authlog.87",
"pw_name": "root",
"rgrp": true,
"roth": false,
"rusr": true,
"size": 335501,
"uid": 0,
"wgrp": false,
"woth": false,
"wusr": true,
"xgrp": false,
"xoth": false,
"xusr": false
}
I tried files_found_path: "{{ var_log_files_to_delete.files['path'] }}" but it generates an error.
How can I print only the paths?
Thank you
The Jinja2 filter map as a parameter attribute to transform a list of dict into a list of a specific attribute of each element (https://jinja.palletsprojects.com/en/2.11.x/templates/#map):
- name : get the complete path
set_fact:
files_found_path: "{{ var_log_files_to_delete.files | map(attribute='path') | list }}"
For more complex data extraction, there is the json_query filter (https://docs.ansible.com/ansible/latest/user_guide/playbooks_filters.html#selecting-json-data-json-queries)
I am trying to bulk write a DataFrame to mysql JDBC database. I am using databricks/pyspark.sql to write the DataFrames to the table. This table has a column that accepts json data(binary data). I did transform the json object to a StructType with following structure:
json object structure and conversion to dataframe:
schema_dict = {'fields': [
{'metadata': {}, 'name': 'dict', 'nullable': True, 'type': {"containsNull": True, "elementType":{'fields': [
{'metadata': {}, 'name': 'y1', 'nullable': True, 'type': 'integer'},
{'metadata': {}, 'name': 'y2', 'nullable': True, 'type': 'integer'}
],"type": 'struct'}, "type": 'array'}}
], 'type': 'struct'}
cSchema = StructType([StructField("x1", IntegerType()),StructField("x2", IntegerType()),StructField("x3", IntegerType()),StructField("x4", TimestampType()), StructField("x5", IntegerType()), StructField("x6", IntegerType()),
StructField("x7", IntegerType()), StructField("x8", TimestampType()), StructField("x9", IntegerType()), StructField("x10", StructType.fromJson(schema_dict))])
df = spark.createDataFrame(parsedList,schema=cSchema)
The output dataframe:
df:pyspark.sql.dataframe.DataFrame
x1:integer
x2:integer
x3:integer
x4:timestamp
x5:integer
x6:integer
x7:integer
x8:timestamp
x9:integer
x10:struct
dict:array
element:struct
y1:integer
y2:integer
Now I am trying to write this dataframe to mysql table using the mysql table.
import urllib
from pyspark.sql import SQLContext
from pyspark.sql.functions import regexp_replace, col
sqlContext = SQLContext(sc)
sqlContext
driver = "org.mariadb.jdbc.Driver"
url = "jdbc:mysql://dburl?rewriteBatchedStatements=true"
trial = "dbname.tablename"
user = "dbuser"
password = "dbpassword"
properties = {
"user": user,
"password": password,
"driver": driver
}
df.write.jdbc(url=url, table=trial, mode="append", properties = properties)
I am getting this error:
An error occurred while calling o2118.jdbc.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 176.0 failed 4 times, most recent failure: Lost task 15.3 in stage 176.0 (TID 9528, 10.168.231.82, executor 5): java.lang.IllegalArgumentException: Can't get JDBC type for struct<dict:array<struct<y1:int,y2:int>>>
Any ideas on how to write a dataframe that has a json column to mysql table? or how to solve this issue?
I am using Databricks 5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)
I've been looking through various answers on this topic but haven't been able to get a working solution.
I have airflow setup to Log to s3 but the UI seems to only use File based task handler instead of the S3 one specified.
I have the s3 connection setup as follows
Conn_id = my_conn_S3
Conn_type = S3
Extra = {"region_name": "us-east-1"}
(the ECS instance use a role that has full s3 permissions)
I have created a log_config file with the following settings also
remote_log_conn_id = my_conn_S3
encrypt_s3_logs = False
logging_config_class = log_config.LOGGING_CONFIG
task_log_reader = s3.task
And in my log config I have the following setup
LOG_LEVEL = conf.get('core', 'LOGGING_LEVEL').upper()
LOG_FORMAT = conf.get('core', 'log_format')
BASE_LOG_FOLDER = conf.get('core', 'BASE_LOG_FOLDER')
PROCESSOR_LOG_FOLDER = conf.get('scheduler', 'child_process_log_directory')
FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
PROCESSOR_FILENAME_TEMPLATE = '{{ filename }}.log'
S3_LOG_FOLDER = 's3://data-team-airflow-logs/airflow-master-tester/'
LOGGING_CONFIG = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'airflow.task': {
'format': LOG_FORMAT,
},
'airflow.processor': {
'format': LOG_FORMAT,
},
},
'handlers': {
'console': {
'class': 'logging.StreamHandler',
'formatter': 'airflow.task',
'stream': 'ext://sys.stdout'
},
'file.processor': {
'class': 'airflow.utils.log.file_processor_handler.FileProcessorHandler',
'formatter': 'airflow.processor',
'base_log_folder': os.path.expanduser(PROCESSOR_LOG_FOLDER),
'filename_template': PROCESSOR_FILENAME_TEMPLATE,
},
# When using s3 or gcs, provide a customized LOGGING_CONFIG
# in airflow_local_settings within your PYTHONPATH, see UPDATING.md
# for details
's3.task': {
'class': 'airflow.utils.log.s3_task_handler.S3TaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
's3_log_folder': S3_LOG_FOLDER,
'filename_template': FILENAME_TEMPLATE,
},
},
'loggers': {
'': {
'handlers': ['console'],
'level': LOG_LEVEL
},
'airflow': {
'handlers': ['console'],
'level': LOG_LEVEL,
'propagate': False,
},
'airflow.processor': {
'handlers': ['file.processor'],
'level': LOG_LEVEL,
'propagate': True,
},
'airflow.task': {
'handlers': ['s3.task'],
'level': LOG_LEVEL,
'propagate': False,
},
'airflow.task_runner': {
'handlers': ['s3.task'],
'level': LOG_LEVEL,
'propagate': True,
},
}
}
I can see the logs on S3 but when I navigate to the UI logs all I get is
*** Log file isn't local.
*** Fetching here: http://1eb84d89b723:8793/log/hermes_pull_double_click_click/hermes_pull_double_click_click/2018-02-26T11:22:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='1eb84d89b723', port=8793): Max retries exceeded with url: /log/hermes_pull_double_click_click/hermes_pull_double_click_click/2018-02-26T11:22:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe6940fc048>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I can see in the logs that its successfully importing the log_config.py (I included a init.py as well)
Can't see why its using the FileTaskHandler here instead of the S3 one
Any help would be great, thanks
In my scenario it wasn't airflow that was at fault here.
I was able to go to the gitter channel and talk to the guys there.
After putting print statements into the python code that was running I was able to catch an exception on this line of code.
https://github.com/apache/incubator-airflow/blob/4ce4faaeae7a76d97defcf9a9d3304ac9d78b9bd/airflow/utils/log/s3_task_handler.py#L119
The exception was a recusion max depth issue on the SSLContext, which after looking around on the web seemed to be coming from using some combination of gevent with unicorn.
https://github.com/gevent/gevent/issues/903
I switched this back to sync and had to change the AWS ELB Listener to TCP but after that the logs were working fine through the UI
Hope this helps others.