"chromedriver unexpectedly exited" when importing selenium through a DAG on local airflow deployment - selenium

I am trying to orchestrate a ETL pipeline with Airflow running on my local machine. I am using the "standard" docker-compose.yaml file from the apache.airflow webpage (this one: https://airflow.apache.org/docs/apache-airflow/2.4.3/docker-compose.yaml), my only alterations are mounting parts of my local file system onto docker, and using a custom image for allowing some python libraries to be installed (like selenium). This setup is working fine for some of my pipelines, but I have one involving webscraping with selenium that I cannot get to work.
I get an DAG import error:
Broken DAG: [/opt/airflow/dags/brand_delta/my_dags/amazon_italy_dag.py] Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 106, in start
self.assert_process_still_running()
File "/home/airflow/.local/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 119, in assert_process_still_running
raise WebDriverException(f"Service {self.path} unexpectedly exited. Status code was: {return_code}")
selenium.common.exceptions.WebDriverException: Message: Service /opt/airflow/chromedriver unexpectedly exited. Status code was: 127
The DAG imports a separate script, where the driver is initialized like this:
def init_chrome_browser(chrome_driver_path, url):
options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--start-maximized')
options.add_argument('window-size=2560,1440')
browser = webdriver.Chrome(service=Service(chrome_driver_path), options=options)
browser.get(url)
return browser
For some reason the chromedriver keeps "unexpectedly exiting". I have tried both installing the chromedriver on my local machine and mounting the file location to the docker-compose image, and installing chromedriver inside of the docker container of the airflow-worker, but in both cases I get this error.
I have also tried complementing the chromedriver with packages such as "libglib2.0..." inside of the worker and I do get chromedriver to start if I run it from the terminal of the worker. But still it gives me the same error when trying to run it with airflow.

Are you sure you have your chromedriver installed on all running airflow docker containers? Are you able to run your webscraping python code on worker and/or scheduler container out of airflow?

Related

Getting Chromedriver logs from standalone Selenium Chrome instance

Summary: How do I extract Chromedriver logs when running Selenium standalone Chrome instance? I.e. interacting via Selenium API commonly on port 4444.
Details:
We are using Protractor to connect to a container running the Docker image selenium/standalone-chrome Selenium "grid". Connection info is specified via the HUB_PORT_4444_TCP_ADDR environment variable. The connection URL ends up being http://localhost:4444/wd/hub. This works fine and our tests are running successfully in Jenkins.
For completeness I'd like to extract the Chromedriver logs and attach them to the build in case we need more info for debugging test failues. How can that be done?
This question seemed like a close match but they are running Chromedriver directly. I need to ask Selenium to provide them somehow.
Log properties of standalone chrome container can be configured using JAVA_OPTS.
You can add JAVA_OPTS environment variable to standalone chrome container
name: JAVA_OPTS
value: "-Dwebdriver.chrome.logfile=<Path to log file, with file name>"
We had a shared volume mounted, and gave path to that folder for putting log file.
Used yaml file for creating container template so used in above mentioned way.
If you are using CLI to launch container, same can be passed through CLI too.

Python Selenium Remote Webdriver(Chrome Webdriver via Selenium Grid), created but does not open browser

I have the following setup:
A Selenium server hub running at "http://localhost:hubPortNum" (a service with the Jar file selenium-server-standalone-3.141.5.jar with parameter -role hub).
A Selenium Node at running "http://localhost:nodePortNum' (the service with Jar file with parameters: -Dwebdriver.chrome.driver=ChromeWebdriverPath -role node -port :nodePortNum).
I checked the URL for the hub and node instances to be sure they are working.
Whenever I try to create Remote Webdriver via Python script:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
desiredCapabilities = DesiredCapabilities.CHROME.copy()
chromeOptionsRemote = webdriver.ChromeOptions()
chromeOptionsRemote.add_argument("--start-maximized")
chromeOptionsRemote.add_argument("--disable-session-crashed-bubble")
initRemoteDriver = webdriver.Remote(options=chromeOptionsRemote, command_executor='http://127.0.0.1:<nodePortNum>/wd/hub', desired_capabilities=desiredCapabilities)
print(initRemoteDriver.current_url)
The last line does print the current URL(which is "data:,"), that means Webdriver is created.
But the browser does not open on my local machine, that is it is running in the background and I don't know how to make it visible although it has worked in the past.
The troubleshooting steps I have made:
Reinstall latest selenium python package.
Re-Download latest Selenium server jar file.
Updating chrome.
adding chromeOptionsRemote.add_argument("--no-sandbox")
Making sure local Webdriver does open:
That is the line:
self.localDriver = webdriver.Chrome(options=chromeOptionsLocal,
desired_capabilities=desiredCapabilities)
does open the browser locally(the Chromedriver is in the path).
After I made these troubleshooting steps, I have tried the same configuration on a remote server and got the same result(browser not visible), so I think this is probably by design.
what configuration should I create for the browser to be visible?
Any help would be appreciated.
I was running the jar file by Always-Up: https://www.coretechnologies.com/products/AlwaysUp/
The problem was related to session 0 isolation: https://stackoverflow.com/a/26752251/2710840
in order to not run the application under session 0, I have enabled the Autologon feature:
defined the user under the application run as my user:
and executed the application from the context menu with the option to: "restart in this session"

Setting up S3 logging in Airflow

This is driving me nuts.
I'm setting up airflow in a cloud environment. I have one server running the scheduler and the webserver and one server as a celery worker, and I'm using airflow 1.8.0.
Running jobs works fine. What refuses to work is logging.
I've set up the correct path in airflow.cfg on both servers:
remote_base_log_folder = s3://my-bucket/airflow_logs/
remote_log_conn_id = s3_logging_conn
I've set up s3_logging_conn in the airflow UI, with the access key and the secret key as described here.
I checked the connection using
s3 = airflow.hooks.S3Hook('s3_logging_conn')
s3.load_string('test','test',bucket_name='my-bucket')
This works on both servers. So the connection is properly set up. Yet all I get whenever I run a task is
*** Log file isn't local.
*** Fetching here: http://*******
*** Failed to fetch log file from worker.
*** Reading remote logs...
Could not read logs from s3://my-bucket/airflow_logs/my-dag/my-task/2018-02-15T21:46:47.577537
I tried manually uploading the log following the expected conventions and the webserver still can't pick it up - so the problem is on both ends. I'm at a loss at what to do, everything I've read so far tells me this should be working. I'm close to just installing the 1.9.0 which I hear changes logging and see if I'm more lucky.
UPDATE: I made a clean install of Airflow 1.9 and followed the specific instructions here.
Webserver won't even start now with the following error:
airflow.exceptions.AirflowConfigException: section/key [core/remote_logging] not found in config
There is an explicit reference to this section in this config template.
So I tried removing it and just loading the S3 handler without checking first and I got the following error message instead:
Unable to load the config, contains a configuration error.
Traceback (most recent call last):
File "/usr/lib64/python3.6/logging/config.py", line 384, in resolve:
self.importer(used)
ModuleNotFoundError: No module named
'airflow.utils.log.logging_mixin.RedirectStdHandler';
'airflow.utils.log.logging_mixin' is not a package
I get the feeling that this shouldn't be this hard.
Any help would be much appreciated, cheers
Solved:
upgraded to 1.9
ran the steps described in this comment
added
[core]
remote_logging = True
to airflow.cfg
ran
pip install --upgrade airflow[log]
Everything's working fine now.

headless-selenium-for-win install issues

I am trying to download and install https://github.com/kybu/headless-selenium-for-win to use Headless Chrome and Firefox on windows. As the new headless browsers built within FF and Chrome do not support extensions.
I keep getting:
C:\Users\Dan >pip install -U git+https://github.com/kybu/headless-selenium-for-win.git
Collecting git+https://github.com/kybu/headless-selenium-for-win.git
Cloning https://github.com/kybu/headless-selenium-for-win.git to c:\users\Dan\appdata\local\temp\pip-6wiag0j8-build
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\Dan\Anaconda3\lib\tokenize.py", line 452, in open
buffer = _builtin_open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Dan\\AppData\\Local\\Temp\\pip-6wiag0j8-build\\setup.py'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\Dan\AppData\Local\Temp\pip-6wiag0j8-build\
My question is, why is this not installing correctly? Also, is it likely this method will allow extensions in Chrome or is that just the downsides of using headless browsers?
You are trying to install C++ project as a pip install. This is not going to work. You can download the compiled exe from their releases
https://github.com/kybu/headless-selenium-for-win/releases
Extract the headless_ie_selenium.exe from above in system path. Then run something like below to get to Firefox
import os
os.environ["HEADLESS_DRIVER"] = "geckodriver.exe"
from selenium import webdriver
driver = webdriver.Firefox("headless_ie_selenium.exe")
PS: Since I don't have Windows, can't test the above code but if you read the documentation this is what the documentation says
Selenium uses "drivers" to control web browsers. They are standalone executables driving browsers. headless_ie_selenium.exe by default looks for the IE driver in PATH, but it can be instructed to use other drivers as well. All command line arguments are forwarded to the driver, so the HEADLESS_DRIVER environment variable is used to specify the driver. Put the driver in one of the PATH directories.
Set the HEADLESS_DRIVER environment variable to geckodriver.exe for headless Firefox.

Selenium+Jenkins+Chromedriver = WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally

I have problem with running my Selenium tests in Jenkins.
A result of execution is always:
WebDriverException: Message: unknown error: Chrome failed to start:
exited abnormally
My tests are written in Robot Framework and are using Chromium webdriver.
I'm setting needed paths in my command, which looks like this:
export PATH=$PATH:/usr/lib/chromium-browser; export PATH=$PATH:/usr/lib/chromium-browser/chromedriver; . /home/michal/robot_env/bin/activate; robot -L TRACE /home/michal/project_robot/tests
And when I run this command manually in terminal IT WORKS fine (Chromium starts automatically and the test goes on).
So the problem suppose to be in Jenkins. I have installed Xvfb plugin, but it didn't help.
Additionally, in /etc/init.d/jenkins I put these lines:
/usr/bin/X :0 vt7 -ac
export DISPLAY=:0
xhost +
And once again - nothing changed. What else should I set or check?
i got stuck same way.
The problem is that jenkins has it own user, called jenkins, and jenkins user cannot open the browser.
if you try to make "su jenkins" and then "chromium-browser" you obtain the display error.
That because you obtain this issue. The problem is not the webdriver, the problem is the user.
i removed the jenkins user created by jenkins and i createad a normal user called jenkins before installing jenkins.
then i installed jenkins.
now jenkins user can run the test (because it can open the browser) but jenkins itself will not load anymore.