Selenium Webdriver/Browser with Python - selenium

I need to build a Python scraper to scrape data from a website where content is only displayed after a user clicks a link bound with a Javascript onclick function, and the page is not reloaded. I've looked into Selenium in order to do this and played around with it a bit, and it seems Selenium opens a new Firefox web browser everytime I instantiate a driver:
>>> driver = webdriver.Firefox()
Is this open browser required, or is there a way to get rid of it? I'm asking because the scraper is potentially part of a web app, and I'm afraid if multiple users start using it, I will have a bunch of browser windows open on my server.

Yes, selenium automates web browsers.
You can add this at the bottom of your python code to make sure the browser is closed at the end:
driver.quit()

Related

How we can automate real browser instead of using selenium browser instance

I am trying to scrape a website, but it is not loading in selenium. When I browse that website in my "real" chrome browser, everything works fine. Is there any way I can use my real browser with python to automate stuff, instead of using selenium??
Thanks
Using selenium we can automate real browsers.
If in case the website is not loading via selenium, you can check if adding desired capabilities helps.
Here we can set proxy, disable extensions etc. There are many options available.
https://chromedriver.chromium.org/capabilities
Also if you can share what kind of error is displayed that would be helpful.

Selenium with headless chrome fails to get url when switching tabs

I'm currently running Selenium with Specflow.
One of my tests clicks on a button which triggers the download of a pdf file.
That file is automatically opened in a new tab where the test then grabs the url and downloads the referenced file directly to the selenium project.
This whole process works perfectly when chrome driver is run normally but fails on a headless browser with the following error:
The HTTP request to the remote WebDriver server for URL http://localhost:59658/session/c72cd9679ae5f713a6c857b80c3515e4/url timed out
after 60 seconds. -> The request was aborted: The operation has timed out.
This error occurs when attempting to run driver.Url
driver.Url calls work elsewhere in the code. It only fails after the headless browser switches tabs. (Yes, I am switching windows using the driver)
For reference, I cannot get this url without clicking the button on the first page and switching tabs as the url is auto-generated after the button is clicked.
I believe you are just using argument as "--headless" for better performance you should select screen size too. Sometimes, due to inappropriate screen size it cannot detect functions which you are looking for. try using this code or just add one line for size.
from selenium import webdriver
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")
driver = webdriver.Chrome(chrome_options=chrome_options)
Don't forget to put other arguments according to your need in "driver".

Selenium Golang binding without server

There are many selenium webdriver binding package of Golang.
However, I don't want to control browser throught server.
How can I control browser with Golang and selenium without selenium server?
You can try github.com/fedesog/webdriver which says in its documentation:
This is a pure go library and doesn't require a running Selenium driver.
I would characterize the Selenium webdriver as a client rather than a server. Caveat: I have used the Selenium webdriver (Chrome version) from .Net and I am assuming it is similar for Go.
The way Selenium works is that you will launch an instance of it from within code, and it creates a live version of the selected browser (i.e. Chrome) and your program retains control over it. Then you write code to tell the browser to navigate to a page, inspect the response, and interact with the browser by filling out form data, clicking on buttons, etc. You can see what is happening on the browser as the code runs, so it is easy to troubleshoot when the interaction doesn't go as planned.
I have used Selenium to upload tens of thousands of records to a website that has no API and only a graphical user interface. Give it a chance.

Selenium interpret javascript on mac?

I'm trying to make a web crawler that click on ads (yes, i know), it's very sophisticated, but, I realise that Google Ads aren't showed when javascript is disabled. Today, i use Mechanize, and it doesn't "accept" javasript.
I heard selenium use another system to crawl the net.
The only thing I want to do is access my page, and click on the ad (generated by javascript).
Can Selenium do it ?
Selenium is a browser automation tool. You can basically automate everything you can do in your browser. Start with going through the Getting Started section of the documentation.
Example:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
print driver.title
driver.close()
Besides automating common browsers, like Chrome, Firefox, Safari or Internet Explorer, you can also use PhantomJS headless browser.

HTMLUNIT with Headless Selenium

I am trying to scrape a website that contains images using a headless Selenium.
Initially, the website populates 50 images. If you scroll down more and more images are loaded.
Windows 7 x64
python 2.7
recent install of selenium
[1] Non-Headless
Navigating to the website with selenium as follows:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(url)
browser.execute_script('window.scrollBy(0, 10000)')
browser.page_source
This works (if anyone has a better suggestion please let me know).
I can continue to scrollBy() until I reach the end and then pull the source page.
[2] Headless with HTMLUNIT
from selenium import webdriver
driver = webdriver.Remote(desired_capabilities=webdriver.DesiredCapabilities.HTMLUNIT)
driver.get(url)
I cannot use scrollBy() in this headless environment.
Any suggestions on how to scrape this kind of page?
Thanks
One option is to study the JavaScript to see how it calculates what to load next. Then implement that logic in your scraping client instead. Once you have done that, you can use faster scraping tools like Perl's WWW::Mechanize.
You need to enable JavaScript explicitly when using the HtmlUnit Driver:
driver.setJavascriptEnabled(true);
According to [http://code.google.com/p/selenium/wiki/HtmlUnitDriver](the docs), it should emulate IE's JavaScript handling by default.
When I tried the same method, I got error messages that selenium crashed while connecting java to simulate javascript.
I wrote the script into execute_script method then the code works well.
I guess the communication between selenium and java server part is not configured properly.
Enabling the javascript with HTMLUNITDRIVERWITHJS is possible and quick ;)