I'm trying to make a web crawler that click on ads (yes, i know), it's very sophisticated, but, I realise that Google Ads aren't showed when javascript is disabled. Today, i use Mechanize, and it doesn't "accept" javasript.
I heard selenium use another system to crawl the net.
The only thing I want to do is access my page, and click on the ad (generated by javascript).
Can Selenium do it ?
Selenium is a browser automation tool. You can basically automate everything you can do in your browser. Start with going through the Getting Started section of the documentation.
Example:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
print driver.title
driver.close()
Besides automating common browsers, like Chrome, Firefox, Safari or Internet Explorer, you can also use PhantomJS headless browser.
Related
With Robot Framework and its Selenium Library, I need to open specifically Firefox, get to a repository on GitLab and download a certain file. Please don't question the tool choice, I was asked to do this with Robot on Firefox and I have to do it with Robot on Firefox. Nothing crazy on the surface actually, but I found out that GitLab runs a "check" on the browser and apparently Selenium gets stuck.
I've searched for solutions, but they all apply to either Selenium on Java, Python, etc., and most of them are about Chrome. Only a handful of unclear ones talk about Firefox and none about Robot with Selenium. I've tried to adapt some of them, such as the following:
SeleniumLibrary.Open Browser browser=firefox
... ff_profile_dir=set_preference("dom.webdriver.enabled","false");set_preference("useAutomationExtension","false")
SeleniumLibrary.Go To ${UrlGitLab}
But it doesn't work. It's still stuck on the page
Checking your browser before accessing gitlab.com.
This process is automatic. Your browser will redirect to your
requested content shortly.
Please allow up to 5 seconds…
It looks like it tries to reload (?) a couple of times, but it won't go further.
Is there a solution, or a workaround that doesn't completely ditch Robot + Selenium with Firefox?
How do I launch a browser with all the user data(history, cookies & etc.) in python selenium web driver?
You can load existing browser profile while opening a selenium web driver. For Chrome see here, for FireFox see similar solutions
You can enable the "reader-mode" by opening chrome://flags/#enable-reader-mode in Google Chrome.
Then, you can toggle the "reader mode" while browsing a webpage:
How to get the "reader mode" version of a web page using Selenium and chromedriver?
Not really the answer you probably want. You can enable the "reader-mode" feature by using ChromeOptions.
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("--enable-features=ReaderMode")
driver = webdriver.Chrome(options=options)
driver.get(url)
Then you have to toggle it somehow. The Selenium documentation is quite clear that it is for not for testing browser functionality see https://stackoverflow.com/a/49801432/839338 so you are left with using another automation tool like AutoIt https://www.autoitscript.com/site/ or Sikuli http://doc.sikuli.org/ to find and toggle the "reader-mode" menu item. I'm not sure how you go about that using scrapy.
I need to build a Python scraper to scrape data from a website where content is only displayed after a user clicks a link bound with a Javascript onclick function, and the page is not reloaded. I've looked into Selenium in order to do this and played around with it a bit, and it seems Selenium opens a new Firefox web browser everytime I instantiate a driver:
>>> driver = webdriver.Firefox()
Is this open browser required, or is there a way to get rid of it? I'm asking because the scraper is potentially part of a web app, and I'm afraid if multiple users start using it, I will have a bunch of browser windows open on my server.
Yes, selenium automates web browsers.
You can add this at the bottom of your python code to make sure the browser is closed at the end:
driver.quit()
I am trying to scrape a website that contains images using a headless Selenium.
Initially, the website populates 50 images. If you scroll down more and more images are loaded.
Windows 7 x64
python 2.7
recent install of selenium
[1] Non-Headless
Navigating to the website with selenium as follows:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(url)
browser.execute_script('window.scrollBy(0, 10000)')
browser.page_source
This works (if anyone has a better suggestion please let me know).
I can continue to scrollBy() until I reach the end and then pull the source page.
[2] Headless with HTMLUNIT
from selenium import webdriver
driver = webdriver.Remote(desired_capabilities=webdriver.DesiredCapabilities.HTMLUNIT)
driver.get(url)
I cannot use scrollBy() in this headless environment.
Any suggestions on how to scrape this kind of page?
Thanks
One option is to study the JavaScript to see how it calculates what to load next. Then implement that logic in your scraping client instead. Once you have done that, you can use faster scraping tools like Perl's WWW::Mechanize.
You need to enable JavaScript explicitly when using the HtmlUnit Driver:
driver.setJavascriptEnabled(true);
According to [http://code.google.com/p/selenium/wiki/HtmlUnitDriver](the docs), it should emulate IE's JavaScript handling by default.
When I tried the same method, I got error messages that selenium crashed while connecting java to simulate javascript.
I wrote the script into execute_script method then the code works well.
I guess the communication between selenium and java server part is not configured properly.
Enabling the javascript with HTMLUNITDRIVERWITHJS is possible and quick ;)