selenium get blocked quickly by websites - SSL handshake error - selenium

I'm using selenium & selenium wire in my project.
I'm writing flows to log in to aws and gcp portals..
my flows works good but when I enter to aws/gcp portal I get an errors and I see blank page..
aws portal
link: https://us-east-1.console.aws.amazon.com/console/home?region=us-east-1#
gcp portal
selenium driver
`
from seleniumwire import webdriver
from seleniumwire.webdriver import ChromeOptions
def test_aws_flow():
options = ChromeOptions()
options.add_experimental_option("detach", True)
options.add_argument('--no-sandbox')
options.add_argument('--single-process')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--start-maximized")
options.add_argument('--auto-open-devtools-for-tabs')
options.add_argument('--log-level=2')
options.add_argument('--disable-features=IsolateOrigins,site-per-process')
options.add_argument("--ignore_ssl")
options.add_argument('--ignore-ssl-errors')
options.add_argument('--ignore-certificate-errors')
options.add_argument("--disable-extensions")
options.add_argument("--disable-setuid-sandbox")
options.add_argument("--dns-prefetch-disable")
options.add_argument('ignore-certificate-errors')
options.add_argument('disable-web-security')
options.add_argument('--allow-insecure-localhost')
driver = webdriver.Chrome(options=options)
driver.get('....any-hidden-url')
# more flow actions - then it open aws portal
`
I found some issues in github into the library selenium wire that not worked for me..
https://github.com/wkeeling/selenium-wire/issues/566
they recommended to use with undetected bot chromedriver , I tried but it still show me the same issue.
some updates
I added openssl.cnf and run it locally in my test using pycharm
openssl_conf = openssl_init
[openssl_init]
ssl_conf = ssl_sect
[ssl_sect]
system_default = system_default_sect
[system_default_sect]
Options = UnsafeLegacyRenegotiation
It success to log in into gcp and to aws..why is that? how can I be sure it will not not happen in prod environment when I deploy it to aws lambda?

If your page does not open, please let me know. I believe they are
detecting bots. You might want to try using fake user agent as shown in
the below code.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.options import Options
import time
options = Options()
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.517 Safari/537.36'
options.add_argument('user-agent={0}'.format(user_agent))
driver = webdriver.Chrome(options=options)
wait = WebDriverWait(driver, 20)
action = ActionChains(driver)
driver.get("https://us-east-1.console.aws.amazon.com/console/home?region=us-east-1#")
time.sleep(20)
driver.quit()
Note - please remove the extra code which you didn't need. Thank you..!

Related

How to interact using Selenium with already opened browser?

Till a few days back what was working perfectly:
Open browser with:
"C:\Program Files\BraveSoftware\Brave-Browser\Application\brave.exe" --remote-debugging-port=9222
then in python I check the response status code (should be 200) using GET request to http://localhost:9222.
then attach selenium:
options = Options()
options.binary_location = "C:\Program Files\BraveSoftware\Brave-Browser\Application\brave.exe"
options.add_argument("disable-popup-blocking")
options.add_experimental_option("debuggerAddress",
socket.gethostbyname("localhost:9222")
driver = webdriver.Chrome(ChromeDriverManager().install(), options = options)
However, this setup is not working anymore, as nothing can be accessed through http://localhost:9222 now with new updates.
Any idea how to achieve the same?
Try this code, it's working:
Run the below command in command prompt:
"C:\\Program Files\\BraveSoftware\\Brave-Browser\\Application\\brave.exe" --remote-debugging-port=9222 --user-data-dir="C:\\Temp\\BraveData"
Brave browser will be launched, then use the below code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from webdriver_manager.core.utils import ChromeType
options = Options()
options.add_experimental_option("debuggerAddress", "localhost:9222")
driver = webdriver.Chrome(service=Service(ChromeDriverManager(chrome_type=ChromeType.BRAVE).install()), options = options)

Chrome Headless in AWS Lambda returns empty page

Am using Chrome Headless (with Serverless framework) to run my selenium scraping script in an AWS Lambda function.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
def main(event, context):
options = Options()
options.binary_location = '/opt/headless-chromium'
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--single-process')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('/opt/chromedriver',chrome_options=options)
driver.get('https://www.linkedin.com/in/williamhgates')
sleep(2)
body = f"Headless Chrome Initialized, Page : {driver.page_source}"
driver.close()
driver.quit()
response = {
"statusCode": 200,
"body": body
}
return response
Same script works perfectly in my local linux machine, returning the good source page.
But when I'm using it though AWS Lambda, it's returning an empty page with the following source code :
<html xmlns=\"http://www.w3.org/1999/xhtml\"><head></head><body></body></html>
Do you have any ideas ? Thank you in advance
This seems to be an issue with the SSL certificate,
set the desired capabilities to ignore it
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
desired_capabilities = DesiredCapabilities.CHROME.copy()
desired_capabilities['acceptInsecureCerts'] = True

Download file through Google Chrome RemoteWebDriver- headless mode in Linux using Selenium Java [duplicate]

I'm do me code in Cromedrive in 'normal' mode and works fine. When I change to headless mode it don't download the file. I already try the code I found alround internet, but didn't work.
chrome_options = Options()
chrome_options.add_argument("--headless")
self.driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'{}/chromedriver'.format(os.getcwd()))
self.driver.set_window_size(1024, 768)
self.driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': os.getcwd()}}
self.driver.execute("send_command", params)
Anyone have any idea about how solve this problem?
PS: I don't need to use Chomedrive necessarily. If it works in another drive it's fine for me.
First the solution
Minimum Prerequisites:
Selenium client version: Selenium v3.141.59
Chrome version: Chrome v77.0
ChromeDriver version: ChromeDriver v77.0
To download the file clicking on the element with text as Download Data within this website you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
options.add_argument("--window-size=1920,1080")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe', service_args=["--log-path=./Logs/DubiousDan.log"])
print ("Headless Chrome Initialized")
params = {'behavior': 'allow', 'downloadPath': r'C:\Users\Debanjan.B\Downloads'}
driver.execute_cdp_cmd('Page.setDownloadBehavior', params)
driver.get("https://www.mockaroo.com/")
driver.execute_script("scroll(0, 250)");
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#download"))).click()
print ("Download button clicked")
#driver.quit()
Console Output:
Headless Chrome Initialized
Download button clicked
File Downloading snapshot:
Details
Downloading files through Headless Chromium was one of the most sought functionality since Headless Chrome was introduced.
Since then there were different work-arounds published by different contributors and some of them are:
Downloading with chrome headless and selenium
Python equivalent of a given wget command
Now the, the good news is Chromium team have officially announced the arrival of the functionality Downloading file through Headless Chromium.
In the discussion Headless mode doesn't save file downloads #eseckler mentioned:
Downloads in headless work a little differently. There's the Page.setDownloadBehavior devtools command to set a download folder. We're working on a way to use DevTools network interception to stream the downloaded file via DevTools as well.
A detailed discussion can be found at Issue 696481: Headless mode doesn't save file downloads
Finally, #bugdroid revision seems to have nailed the issue for us.
[ChromeDriver] Added support for headless mode to download files
Previously, Chromedriver running in headless mode would not properly download files due to the fact it sparsely parses the preference file given to it. Engineers from the headless chrome team recommended using DevTools's "Page.setDownloadBehavior" to fix this. This changelist implements this fix. Downloaded files default to the current directory and can be set using download_dir when instantiating a chromedriver instance. Also added tests to ensure proper download functionality.
Here is the revision and commit
From ChromeDriver v77.0.3865.40 (2019-08-20) release notes:
Resolved issue 2454: Headless mode doesn't save file downloads [Pri-2]
Solution
Update ChromeDriver to latest ChromeDriver v77.0 level.
Update Chrome to Chrome Version 77.0 level. (as per ChromeDriver v76.0 release notes)
Note: Chrome v77.0 is yet to be GAed/pushed for release so till then you can download and install a development build and test either from:
Chrome Canary
Latest build from the Dev Channel
Outro
However Mac OSX users have a wait for their pie as On Chromedriver, headless chrome crashes after sending Page.setDownloadBehavior on MacOSX.
Chomedriver Version: 95.0.4638.54
Chrome Version 95.0.4638.69
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
options.add_argument("--start-maximized")
options.add_argument("--no-sandbox")
options.add_argument("--disable-extensions")
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--disable-gpu")
options.add_argument('--disable-software-rasterizer')
options.add_argument("user-agent=Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166")
options.add_argument("--disable-notifications")
options.add_experimental_option("prefs", {
"download.default_directory": "C:\\link\\to\\folder",
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing_for_trusted_sources_enabled": False,
"safebrowsing.enabled": False
}
)
What seemed to work was that I used "\\" instead of "/" for the address. The latter approach didn't throw any error, but didn't download any documents either. But, using double back slashes did the job.
For javascript use below code:
const chrome = require('selenium-webdriver/chrome');
let options = new chrome.Options();
options.addArguments('--headless --window-size=1500,1200');
options.setUserPreferences({ 'plugins.always_open_pdf_externally': true,
"profile.default_content_settings.popups": 0,
"download.default_directory": Download_File_Path });
driver = await new webdriver.Builder().setChromeOptions(options).forBrowser('chrome').build();
Then switch tabs as soon as you click the download button:
await driver.sleep(1000);
var Handle = await driver.getAllWindowHandles();
await driver.switchTo().window(Handle[1]);
This C# works for me
Note the new headless option https://www.selenium.dev/blog/2023/headless-is-going-away/
private IWebDriver StartBrowserChromeHeadlessDriver()
{
var chromeOptions = new ChromeOptions();
chromeOptions.AddArgument("--headless=new");
chromeOptions.AddArgument("--window-size=1920,1080");
chromeOptions.AddUserProfilePreference("download.default_directory", downloadFolder);
var chromeDownload = new Dictionary<string, object>
{
{ "behavior", "allow" },
{ "downloadPath", downloadFolder }
};
var driver = new ChromeDriver(driverFolder, chromeOptions, TimeSpan.FromSeconds(timeoutSecs));
driver.ExecuteCdpCommand("Browser.setDownloadBehavior", chromeDownload);
return driver;
}
import pathlib
from selenium.webdriver import Chrome
driver = Chrome()
driver.execute_cdp_cmd("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": str(pathlib.Path.home().joinpath("Downloads"))
})
I don't think you should be using the browser for downloading content, leave it to Chrome developers/testers.
I believe you should rather get href attribute of the element you want to download and obtain it using requests library
If your site requires authentication you could fetch cookies from the browser instance and pass them to requests.Session.

Selenium returns different html source than viewed in browser

Im trying to use Selenium to load next page with results by clicking Load More button from this site.
However the source code of the html page loaded by selenium does not show(load) actual products which one can see when browsing.
Here is my code:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
#browser = webdriver.Firefox()#Chrome('./chromedriver.exe')
URL = "https://thekrazycouponlady.com/coupons-for/costco"
PATIENCE_TIME = 60
LOAD_MORE_BUTTON_XPATH = '//button[#class = "kcl-btn ng-scope"]/span'
caps = DesiredCapabilities.PHANTOMJS
# driver = webdriver.Chrome(r'C:\Python3\selenium\webdriver\chromedriver_win32\chromedriver.exe')
caps["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
driver = webdriver.PhantomJS(r'C:\Python3\selenium\webdriver\phantomjs-2.1.1-windows\bin\phantomjs.exe',service_log_path=os.path.devnull,desired_capabilities=caps)
driver.get(URL)
while True:
try:
time.sleep(20)
html = driver.page_source.encode('utf-8')
print(html)
loadMoreButton = driver.find_element_by_xpath(LOAD_MORE_BUTTON_XPATH)
loadMoreButton.click()
except Exception as e:
print (e)
break
print ("Complete")
driver.quit()
Not sure if I can attach sample html file here for reference.
Anyway, what is the problem and how do I load exactly the same page with selenium as i do via browser?
It might be due to the use of PhantomJS, it isn't maintained any more and deprecated from Selenium 3.8.1. Use Chrome headless instead.
options = Options()
options.headless = True
driver = webdriver.Chrome(CHROMEDRIVER_PATH, chrome_options=options)

splinter: how to add chrome options?

I'm using splinter(v0.7.3) for web testing under linux, while on chrome, the default sample code can not running:
from splinter import Browser
from pyvirtualdisplay import Display
d = Display(visible=0, size=(800, 600))
d.start()
b = Browser('chrome')
b.visit('http://www.google.com')
b.quit()
d.stop()
While running, I got the exception like this:
selenium.common.exceptions.WebDriverException: Message: chrome not reachable
And I test the same function in selenium with some chrome option added:
from selenium import web driver
from selenium.webdriver.chrome.options import Options
from pyvirtualdisplay import Display
d = Display(visible=0, size=(800, 600))
d.start()
opt = Options()
opt.add_argument('--disable-setuid-sandbox')
b = webdriver.Chrome(chrome_options=opt)
b.get('http://www.google.com')
b.quit()
d.stop()
This works ok, the difference is the --disable-setuid-sandbox option added to chrome driver, if the option not add, there would be a zombie chrome-sandbox process under chromium-browser.
The problem here is, I don't know how to pass a chrome.options.Option instance to splinter.Browser(), I browsed the implementation under splinter/driver/webdriver/chrome.py, it seems that there is no entry to pass such a instance to splinter.Browser(). Is there some other way to pass options to chrome driver?
Create a new instance of BaseWebDriver and set .driver with an instance of the Chrome driver. This example starts Chrome maximized:
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from splinter.driver.webdriver import BaseWebDriver, WebDriverElement
options = Options()
options.add_argument('--start-maximized')
browser = BaseWebDriver()
browser.driver = Chrome(chrome_options=options)
browser.visit('https://www.google.com')
The only way I could ever do this was by using the add_argument method with selenium.webdriver.ChromeOptions like so:
from selenium.webdriver import ChromeOptions
from splinter import Browser
chrome_options = ChromeOptions()
chrome_options.add_argument(your_argument)
b=Browser("chrome", options=chrome_options)
b.visit('http://www.google.com')
b.quit()
so in your code would be:
from splinter import Browser
from selenium.webdriver import ChromeOptions
from pyvirtualdisplay import Display #I'm not certain what this is...
d = Display(visible=0, size=(800, 600))
d.start()
chrome_options = ChromeOptions()
chrome_options.add_argument('disable-setuid-sandbox')
b = Browser('chrome')
b.visit('http://www.google.com')
b.quit()
d.stop()
Note: I was unable to test this with your argument specifically because I recently broke my GRUB so I am stuck in windows, and the disable-setuid-sandbox option is linux-only. However, I have been using this method with the headless argument for a while.
I am not 100% sure that this will work but I just looked at the docs for splinter and it says.
You can also pass additional arguments that correspond to Selenium DesiredCapabilities arguments.
Looking into the sourcecode of Splinter calling Browser can take some arguments. These arguments will then be passed to create an Instance of the Chrome WebDriver. So I went to the selenium sourcecode and saw the constructor looks like this:
def __init__(self, executable_path="chromedriver", port=0,
chrome_options=None, service_args=None,
desired_capabilities=None, service_log_path=None):
There is a parameter for chrome_options so it should be possible to pass it using this parameter. So if I'm correct this should work fine:
opt = Options()
opt.add_argument('--disable-setuid-sandbox')
b = Browser(browser='chrome', chrome_options=opt)
Edit
Alternatively you could pass the options as desired capabilities aswell:
opt = Options()
opt.add_argument('--disable-setuid-sandbox')
dc = opt.to_capabilities()
b = Browser(browser='chrome', desired_capabilities=dc)
I've been working on a fork of splinter for the past couple weeks, you can check out my dev branch if you want. I have added this and other features.
Options can be passed as a list to the chrome_options parameter
from splinter import Browser
options = ['--start-maximized', '--disable-setuid-sandbox']
with Browser('chrome', chrome_options=options) as browser:
browser.visit('http://www.google.com')
Edit:
So it turns out this was possible with splinter all along just using **kwargs which passes the various available options to the selenium driver(s). For example:
from splinter import Browser
options = {'chrome_options':['--start-maximized', '--disable-setuid-sandbox']}
with Browser('chrome', **options) as browser:
browser.visit('http://www.google.com')