BigQueryInsertJobOperator dryRun is returning success instead of failure on composer (airflow) - google-bigquery

When using BigQueryInsertJobOperator and setting the configuration to perform a dry run on a faulty .sql file/ a hardcoded query, the task succeeds even though it should fail. The same error gets properly thrown out by task failure when running with dryRun as false in configuration. Below is the code used for testing in composer(airflow)
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow import DAG
default_args = {
'depends_on_past': False,
}
dag = DAG(dag_id='bq_script_tester',
default_args=default_args,
schedule_interval='#once',
start_date=datetime(2021, 1, 1),
tags=['bq_script_caller']
)
with dag:
job_id = BigQueryInsertJobOperator(
task_id="bq_validator",
configuration={
"query": {
"query": "INSERT INTO `bigquery-public-data.stackoverflow.stackoverflow_posts` values('this is cool');",
"useLegacySql": False,
},
"dryRun": True
},
location="US"
)
How can a bigquery be validated using dryRun option in composer. Is there an alternative approach in composer to achieve the same functionality. The alternative should be an operator capable of accepting sql scripts that contain procedures and simple sql with support of templating.
Airflow version: 2.1.4
Composer version: 1.17.7

Related

How to run Playwright in headless mode?

I created a new Vue app using npm init vue#latest and selected Playwright for e2e tests. I removed firefox and webkit from projects in the playwright.config.ts file, so it will only use chromium.
Running npm run test:e2e works fine, the process exists with a success code.
When forcing the tests to fail by modifying the ./e2e/vue.spec.ts file the output is
but the process does not exit with an error code, it still opened browser windows and so CI environments would freeze.
I searched the docs for a specific flag e.g. "headless" and tried --max-failures -x but that didn't help.
How can I tell Playwright to run in headless mode and exit with an error code when something failed?
Since playwright.config.ts already makes use of process.env.CI I thought about replacing reporter: "html", with reporter: [["html", { open: !process.env.CI ? "on-failure" : "never" }]],
but which arguments should I add to the script "test:e2e:ci": "playwright test", to ensure process.env.CI is set?
Update
I tried to run the script inside my CI environment and it seems to work out of the box ( I don't know how it sets the CI environment flag but the pipeline did not freeze )
- name: Install Playwright Browsers
run: npx playwright install --with-deps
- name: Check if e2e tests are passing
run: npm run test:e2e
If any test fails it exists with an error code
It's serving the html report and asking to press 'Ctr+C' to quite.You can disable it using below configuration.
// playwright.config.ts
import { PlaywrightTestConfig } from '#playwright/test';
const config: PlaywrightTestConfig = {
reporter: [ ['html', { open: 'never' }] ],
};
export default config;
Refer - Report Doc
Issue - https://github.com/microsoft/playwright/issues/9702
To add to the answer above, you can set headless: true in the 'use' block of the config which is above the projects block. Anything set at that level will apply to all projects unless you specifically override the setting inside a project specific area:
// playwright.config.ts
import { PlaywrightTestConfig } from '#playwright/test';
const config: PlaywrightTestConfig = {
reporter: [ ['html', { open: 'never' }] ],
use: {
headless: true,
},
projects: [
{
name: 'chromium',
use: {
browserName: 'chromium',
},
},
},
};
export default config;

Airflow BigQueryInsertJobOperator configuration

I'm having some issue converting from the deprecated BigQueryOperator to BigQueryInsertJobOperator. I have the below task:
bq_extract = BigQueryInsertJobOperator(
dag="big_query_task,
task_id='bq_query',
gcp_conn_id='google_cloud_default',
params={'data': Utils().querycontext},
configuration={
"query": {"query": "{% include 'sql/bigquery.sql' %}", "useLegacySql": False,
"writeDisposition": "WRITE_TRUNCATE", "destinationTable": {"datasetId": bq_dataset}}
})
this line in my bigquery_extract.sql query is throwing the error:
{% for field in data.bq_fields %}
I want to use 'data' from params, which is calling a method, this method is reading from a .json file:
class Utils():
bucket = Variable.get('s3_bucket')
_qcontext = None
#property
def querycontext(self):
if self._qcontext is None:
self.load_querycontext()
return self._qcontext
def load_querycontext(self):
with open(path.join(conf.get("core", "dags"), 'traffic/bq_query.json')) as f:
self._qcontext = json.load(f)
the bq_query.json is this format, and I need to use the nested bq_fields list values:
{
"bq_fields": [
{ "name": "CONCAT(ID, '-', CAST(ID AS STRING), "alias": "new_id" },
{ "name": "TIMESTAMP(CAST(visitStartTime * 1000 AS INT64)", "alias": "new_timestamp" },
{ "name": "TO_JSON_STRING(hits.experiment)", "alias": "hit_experiment" }]
}
this file has a list which I want to use in the above mentioned query line, but its throwing this error:
jinja2.exceptions.UndefinedError: 'data' is undefined
There are two issues with your code.
First "params" is not a supported field in BigQueryInsertJobOperator. See this post where I post how to pass params to sql file when using BigQueryInsertJobOperator. How do you pass variables with BigQueryInsertJobOperator in Airflow
Second, if you happen to get an error that your file cannot be found, make sure you set the full path of your file. I have had to do this when migrating from local testing to the cloud, even though file is in same directory. You can set the path in the dag config with example below(replace path with your path):
with DAG(
...
template_searchpath = '/opt/airflow/dags',
...
) as dag:

Databricks spark_jar_task failed when submitted via API

I am using to submit a sample spark_jar_task
My sample spark_jar_task request to calculate Pi :
"libraries": [
{
"jar": "dbfs:/mnt/test-prd-foundational-projects1/spark-examples_2.11-2.4.5.jar"
}
],
"spark_jar_task": {
"main_class_name": "org.apache.spark.examples.SparkPi"
}
Databricks sysout logs where it prints the Pi value as expected
....
(This session will block until Rserve is shut down) Spark package found in SPARK_HOME: /databricks/spark DATABRICKS_STDOUT_END-19fc0fbc-b643-4801-b87c-9d22b9e01cd2-1589148096455
Executing command, time = 1589148103046.
Executing command, time = 1589148115170.
Pi is roughly 3.1370956854784273
Heap
.....
Spark_jar_task though prints the PI value in log, the job got terminated with failed status without stating the error. Below are response of api /api/2.0/jobs/runs/list/?job_id=23.
{
"runs": [
{
"job_id": 23,
"run_id": 23,
"number_in_job": 1,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "FAILED",
"state_message": ""
},
"task": {
"spark_jar_task": {
"jar_uri": "",
"main_class_name": "org.apache.spark.examples.SparkPi",
"run_as_repl": true
}
},
"cluster_spec": {
"new_cluster": {
"spark_version": "6.4.x-scala2.11",
......
.......
Why the job failed here? Any suggestions will be appreciated!
EDIT :
The errorlog says
20/05/11 18:24:15 INFO ProgressReporter$: Removed result fetcher for 740457789401555410_9000204515761834296_job-34-run-1-action-34
20/05/11 18:24:15 WARN ScalaDriverWrapper: Spark is detected to be down after running a command
20/05/11 18:24:15 WARN ScalaDriverWrapper: Fatal exception (spark down) in ReplId-a46a2-6fb47-361d2
com.databricks.backend.common.rpc.SparkStoppedException: Spark down:
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:493)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)
20/05/11 18:24:17 INFO ShutdownHookManager: Shutdown hook called
I found answer from this post https://github.com/dotnet/spark/issues/126
Looks like, we shouldnt deliberately call
spark.stop()
when running as a jar in databricks

Airflow won't write logs to s3

I tried different ways to configure Airflow 1.9 to write logs to s3 however it just ignores it. I found a lot of people having problems reading the Logs after doing so, however my problem is that the Logs remain local. I can read them without problem but they are not in the specified s3 bucket.
What I tried was first to write into the airflow.cfg file
# Airflow can store logs remotely in AWS S3 or Google Cloud Storage. Users
# must supply an Airflow connection id that provides access to the storage
# location.
remote_base_log_folder = s3://bucketname/logs
remote_log_conn_id = aws
encrypt_s3_logs = False
Then I tried to set environment variables
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://bucketname/logs
AIRFLOW__CORE__REMOTE_LOG_CONN_ID=aws
AIRFLOW__CORE__ENCRYPT_S3_LOGS=False
However it gets ignored and the log files remain local.
I run airflow from a container, I adapted https://github.com/puckel/docker-airflow to my case but it won't write logs to s3. I use the aws connection to write to buckets in dags and this works but the Logs just remain local, no matter if I run it on an EC2 or locally on my machine.
I finally found an answer using StackOverflow answer
which is most of the work I then had to add one more step. I reproduce this answer here and adapt it a bit the way I did:
Some things to check:
Make sure you have the log_config.py file and it is in the correct dir: ./config/log_config.py.
Make sure you didn't forget the __init__.py file in that dir.
Make sure you defined the s3.task handler and set its formatter to airflow.task
Make sure you set airflow.task and airflow.task_runner handlers to s3.task
Set task_log_reader = s3.task in airflow.cfg
Pass the S3_LOG_FOLDER to log_config. I did that using a variable and retrieving it as in the following log_config.py.
Here is a log_config.py that works:
import os
from airflow import configuration as conf
LOG_LEVEL = conf.get('core', 'LOGGING_LEVEL').upper()
LOG_FORMAT = conf.get('core', 'log_format')
BASE_LOG_FOLDER = conf.get('core', 'BASE_LOG_FOLDER')
PROCESSOR_LOG_FOLDER = conf.get('scheduler', 'child_process_log_directory')
FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
PROCESSOR_FILENAME_TEMPLATE = '{{ filename }}.log'
S3_LOG_FOLDER = conf.get('core', 'S3_LOG_FOLDER')
LOGGING_CONFIG = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'airflow.task': {
'format': LOG_FORMAT,
},
'airflow.processor': {
'format': LOG_FORMAT,
},
},
'handlers': {
'console': {
'class': 'logging.StreamHandler',
'formatter': 'airflow.task',
'stream': 'ext://sys.stdout'
},
'file.task': {
'class': 'airflow.utils.log.file_task_handler.FileTaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
'filename_template': FILENAME_TEMPLATE,
},
'file.processor': {
'class': 'airflow.utils.log.file_processor_handler.FileProcessorHandler',
'formatter': 'airflow.processor',
'base_log_folder': os.path.expanduser(PROCESSOR_LOG_FOLDER),
'filename_template': PROCESSOR_FILENAME_TEMPLATE,
},
's3.task': {
'class': 'airflow.utils.log.s3_task_handler.S3TaskHandler',
'formatter': 'airflow.task',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
's3_log_folder': S3_LOG_FOLDER,
'filename_template': FILENAME_TEMPLATE,
},
},
'loggers': {
'': {
'handlers': ['console'],
'level': LOG_LEVEL
},
'airflow': {
'handlers': ['console'],
'level': LOG_LEVEL,
'propagate': False,
},
'airflow.processor': {
'handlers': ['file.processor'],
'level': LOG_LEVEL,
'propagate': True,
},
'airflow.task': {
'handlers': ['s3.task'],
'level': LOG_LEVEL,
'propagate': False,
},
'airflow.task_runner': {
'handlers': ['s3.task'],
'level': LOG_LEVEL,
'propagate': True,
},
}
}
Note that this way S3_LOG_FOLDER can be specified in airflow.cfg or as environment the variable AIRFLOW__CORE__S3_LOG_FOLDER.
One more thing that leads to this behavior (Airflow 1.10):
If you look at airflow.utils.log.s3_task_handler.S3TaskHandler, you'll notice that there are a few conditions under which the logs, silently, will not be written to S3:
1) The logger instance is already close()d (not sure how this happens in practice)
2) The log file does not exist on the local disk (this is how I got to this point)
You'll also notice that the logger runs in a multiprocessing/multithreading environment, and that Airflow S3TaskHandler and FileTaskHandler do some very no-no things with the filesystem. If assumptions about log files on disk are met, S3 log files will not be written, and nothing is logged nor thrown about this event. If you have specific, well defined needs in logging it might be a good idea to implement all your own logging Handlers (see python logging docs) and disable all Airflow log handlers (see Airflow UPDATING.md).
One more thing that may lead to this behaviour - botocore may be not installed.
Make sure when installing airflow to include s3 package pip install apache-airflow[s3]
In case this helps someone else, here is what worked for me, answered in a similar post: https://stackoverflow.com/a/73652781/4187360

protractor could not find protractor/selenium/chromedriver.exe at codeship

i'm trying to configure the integration to run portractor tests.
I'm using grunt-protractor-runner task
with following configuration:
protractor: {
options: {
configFile: "protractor.conf.js", //your protractor config file
keepAlive: true, // If false, the grunt process stops when the test fails.
noColor: false, // If true, protractor will not use colors in its output.
args: {
// Arguments passed to the command
}
},
run: {},
chrome: {
options: {
args: {
browser: "chrome"
}
}
}
}
and here is grunt task which i use for running the protractor after the server is running:
grunt.registerTask('prot', [
'connect:test',
'replace:includemocks',//for uncommenting angular-mocks reference
'protractor:run',
'replace:removemocks',//for commenting out angular-mocks reference
]);
It is running well on my local machine, but at codeship i'm getting following error:
Error: Could not find chromedriver at /home/rof/src/bitbucket.org/myrepo/myFirstRepo/node_modules/grunt-protractor-runner/node_modules/protractor/selenium/chromedriver.exe
Which i guess, a result of not having this "chromedriver.exe" at this path.
How can i solve it in codeship environment?
Thanks forwards
Add postinstall to your package.json file and that way npm install will take care of placing the binaries for you ahead of time:
"scripts": {
"postinstall": "echo -n $NODE_ENV | \
grep -v 'production' && \
./node_modules/protractor/bin/webdriver-manager update || \
echo 'will skip the webdriver install/update in production'",
...
},
And don't forget to set NODE_ENV ... not setting it at all will result in echo 'will skip the webdriver install/update in production' piece running. Setting it to dev or staging will get desired results.
Short answer (pulkitsinghal gave the original solution):
./node_modules/grunt-protractor-runner/node_modules/protractor/bin/webdriver-manager update
I'm one of the founders at Codeship.
The error seems to be because you are trying to use the exe file, but we're on Linux on our system. Did you hardcode that executable?
Could you send us an in-app support request so we have a link to look at and can help you fix this?