I am trying to scrape a website that contains images using a headless Selenium.
Initially, the website populates 50 images. If you scroll down more and more images are loaded.
Windows 7 x64
python 2.7
recent install of selenium
[1] Non-Headless
Navigating to the website with selenium as follows:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(url)
browser.execute_script('window.scrollBy(0, 10000)')
browser.page_source
This works (if anyone has a better suggestion please let me know).
I can continue to scrollBy() until I reach the end and then pull the source page.
[2] Headless with HTMLUNIT
from selenium import webdriver
driver = webdriver.Remote(desired_capabilities=webdriver.DesiredCapabilities.HTMLUNIT)
driver.get(url)
I cannot use scrollBy() in this headless environment.
Any suggestions on how to scrape this kind of page?
Thanks
One option is to study the JavaScript to see how it calculates what to load next. Then implement that logic in your scraping client instead. Once you have done that, you can use faster scraping tools like Perl's WWW::Mechanize.
You need to enable JavaScript explicitly when using the HtmlUnit Driver:
driver.setJavascriptEnabled(true);
According to [http://code.google.com/p/selenium/wiki/HtmlUnitDriver](the docs), it should emulate IE's JavaScript handling by default.
When I tried the same method, I got error messages that selenium crashed while connecting java to simulate javascript.
I wrote the script into execute_script method then the code works well.
I guess the communication between selenium and java server part is not configured properly.
Enabling the javascript with HTMLUNITDRIVERWITHJS is possible and quick ;)
Related
There are many selenium webdriver binding package of Golang.
However, I don't want to control browser throught server.
How can I control browser with Golang and selenium without selenium server?
You can try github.com/fedesog/webdriver which says in its documentation:
This is a pure go library and doesn't require a running Selenium driver.
I would characterize the Selenium webdriver as a client rather than a server. Caveat: I have used the Selenium webdriver (Chrome version) from .Net and I am assuming it is similar for Go.
The way Selenium works is that you will launch an instance of it from within code, and it creates a live version of the selected browser (i.e. Chrome) and your program retains control over it. Then you write code to tell the browser to navigate to a page, inspect the response, and interact with the browser by filling out form data, clicking on buttons, etc. You can see what is happening on the browser as the code runs, so it is easy to troubleshoot when the interaction doesn't go as planned.
I have used Selenium to upload tens of thousands of records to a website that has no API and only a graphical user interface. Give it a chance.
I'm trying to make a web crawler that click on ads (yes, i know), it's very sophisticated, but, I realise that Google Ads aren't showed when javascript is disabled. Today, i use Mechanize, and it doesn't "accept" javasript.
I heard selenium use another system to crawl the net.
The only thing I want to do is access my page, and click on the ad (generated by javascript).
Can Selenium do it ?
Selenium is a browser automation tool. You can basically automate everything you can do in your browser. Start with going through the Getting Started section of the documentation.
Example:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
print driver.title
driver.close()
Besides automating common browsers, like Chrome, Firefox, Safari or Internet Explorer, you can also use PhantomJS headless browser.
I have successfully tested applications in FireFox using Selenium WebDriver, but one of our applications runs in a custom browser made using QTWebKit. Is it possible to use WebDriver to automate testing in a custom browser like this?
There is a webdriver for QtWebkit here: https://github.com/cisco-open-source/qtwebdriver
It also can automate qtwidget and QML aplications.
You should be able to use it to automate you custom browser, provided you preserved the QtWebkit APIs.
If QTWebKit has released a WebDriver executable for the browser, then yes.
Otherwise - no.
I need to build a Python scraper to scrape data from a website where content is only displayed after a user clicks a link bound with a Javascript onclick function, and the page is not reloaded. I've looked into Selenium in order to do this and played around with it a bit, and it seems Selenium opens a new Firefox web browser everytime I instantiate a driver:
>>> driver = webdriver.Firefox()
Is this open browser required, or is there a way to get rid of it? I'm asking because the scraper is potentially part of a web app, and I'm afraid if multiple users start using it, I will have a bunch of browser windows open on my server.
Yes, selenium automates web browsers.
You can add this at the bottom of your python code to make sure the browser is closed at the end:
driver.quit()
I'm not sure I quite understand the difference. WebDriver API also directly controls the browser of choice. When should you use selenium remote control (selenium RC) instead ?
Right now, my current situation is I am testing a web application by writing a suite with Selenium WebDriver API and letting it run on my computer. The tests are taking longer and longer to complete, so I have been searching for ways to run the tests on a Linux server.
If I use Selenium Remote Control, does this mean I have to rewrite everything I wrote with WebDriver API?
I am getting confused with Selenium Grid, Hudson, Selenium RC. I found a Selenium Grid plugin for Hudson, but not sure if this includes Selenium RC.
Am I taking the correct route? I envision the following architecture:
Hudson running on few Ubuntu dedicated servers.
Hudson running with Xvnc & Selenium Grid plugin. (Do I need to install Firefox separately ?)
Selenium grid running selenium RC test suites.
I think this is far more time efficient than running test on my current working desktop computer with WebDriver API.
WebDriver is now Selenium 2. The Selenium and WebDriver code bases are being merged. WebDriver gets over a number of issues that Selenium has and Selenium gets over a number of issues that Webdriver has.
If you have written your tests in Selenium one you don't have to rewrite them to work with Selenium 2. We, the core developers, have written it so that you create a browser instance and inject that into Selenium and your Selenium 1 tests will work in Selenium 2. I have put an example below for you.
// You may use any WebDriver implementation. Firefox is used here as an example
WebDriver driver = new FirefoxDriver();
// A "base url", used by selenium to resolve relative URLs
String baseUrl = "http://www.google.com";
// Create the Selenium implementation
Selenium selenium = new WebDriverBackedSelenium(driver, baseUrl);
// Perform actions with selenium
selenium.open("http://www.google.com");
selenium.type("name=q", "cheese");
selenium.click("name=btnG");
Selenium 2 unfortunately has not been put into Selenium 2 but it shouldn't be too long until it has been added since we are hoping to reach beta in the next couple of months.
As far as I understand, Webdriver implementation started little later than Selenium RC. From my point of view, WebDriver is more flexible solution, which fixed some annoying problems of SeleniumRC.
WebDriver provides standard interface for testing web GUI. There are several implementations of this interface (HTTP, browser-specific and based on Selenium). Since you already have some WebDriver tests, you must be familiar with basic docs like this
The tests are getting longer and longer to complete, so I have been searching for ways to run the tests on a linux server.
Did you try to find actual bottlenecks? I'm not sure, that elimination of WebDriver layer will help. I think, most time is spent on Selenium commands sending and HTTP requests to system-under-test.
If I use sleneium remote control, does
this mean I have to rewrite everything
I wrote with WebDriver API ?
Generally, yes. If you did not implement some additional layer between tests code and WebDriver.
As for Selenium Grid:
You may start several Selenium RC instances on several different [virtual] nodes, then register them in Selenium Grid. Your tests connect to Selenium Grid, and it redirects all commands to SeleniumRC instances, coordinating them in accordance with required browsers.
For details of hudson plugin you may find more info here