Restricting Selenium/Webdriver/HtmlUnit to a certain domain - selenium

While using selenium/webdriver for web scraping, I realized the target site has google analytics script running. Is there a way to restrict selenium/webdriver/htmlunit to avoid certain urls/domains ?
Thanks,

I think it is impossible becouse Selenium is actually adapter for several implementation. So he can't deny to load some scripts to firefox or chrome. Perhaps you can check driver api(firefox profile, htmlunit configuration file) to accomplish this.

Related

How we can automate real browser instead of using selenium browser instance

I am trying to scrape a website, but it is not loading in selenium. When I browse that website in my "real" chrome browser, everything works fine. Is there any way I can use my real browser with python to automate stuff, instead of using selenium??
Thanks
Using selenium we can automate real browsers.
If in case the website is not loading via selenium, you can check if adding desired capabilities helps.
Here we can set proxy, disable extensions etc. There are many options available.
https://chromedriver.chromium.org/capabilities
Also if you can share what kind of error is displayed that would be helpful.

Can using Selenium WebDriver for automated web crawling be dangerous?

I'd like to crawl a set of random websites received from a URL generator, using Selenium's ChromeDriver with Crawljax to do static code analysis on the captured DOM states.
Is this potentially unsafe for the machine doing the crawling?
My concern is that one of the randomly generated sites is malicious and that execution of JavaScript from ChromeDriver (which is used to capture the new DOM states) infects the machine running the test somehow. Should I be running this in some kind of sandboxed environment?
--edit--
If it matters, the crawler is implemented entirely in Java.
Simple answer, no. Only if your afraid of cookies, and even if you are, your machine isn't.
It's hard to say it's very secure,you should aware of that there is no absolute secure in network.Recently,a chrome RCE has been put out,details:
SSD Advisory – Chrome Turbofan Remote Code Execution – SecuriTeam Blogs
Maybe this can effect on Selenium's ChromeDriver
But you can do some enforce on your system,such as change your firewall mode to white list,only allow your python script and selenium to access internet on port 80,443.
Even if your system pwned by RCE,the malicious code still can't access internet,unless it inject to you python process(I think it's very hard to do with js script in Browser RCE).
Another option:Install HIPS,if your python script want to do anything else but crawl web page(such as start an other process) or read/write some other files,you will know it and decide what to do.
In my oppion,do your crawl thing in a VM and do some enforce on firewall(Windows firewall or Linux iptables),shutdown useless services in windows.That's enough.
In a word,it's diffcult to find the balance between security and convenience and do not believe your system is unbreakable

Selenium RemoteWebDriver and Windows Authentication Dialogs

I've seen this question has been asked a few times, and lots of solutions get suggested - but none of them seem to work for the RemoteWebDriver (ie: using Selenium Grid). They're usually centered around using the local ChromeDriver/FirefoxDriver/IEDriver classes.
I am using the .NET bindings, by the way :).
What I want to do is fairly simple (in terms of requirement). I have a Selenium Server setup, and am currently using the RemoteWebDriver to perform automated UI tests on various sites. This setup is working fine.
However, some sites use NTLM/Windows Authentication, and we need to start writing automated tests for these. However, as far as I can tell, there is no solution for this.
I have seen the following "solutions", but - unless someone can correct me - they either don't work consistently, or will not work using RemoteWebDriver:
Using the IAlert functionality (like here). However, this isn't implemented in the .NET bindings, and doesn't work for all browsers as far as I can tell.
Using the Robot API to interact with the popup (like here). But this is for running on your local machine, and not supported by RemoteWebDriver.
Using AutoIt to do a similar thing to the Robot API. However, this won't work using RemoteWebDriver.
Passing the credentials in the URL (eg: http://username:password#example.com). However, this doesn't work for Windows Authentication - just normal HTTP Basic Authentication.
I can't actually see any other solutions, unless anyone else can help?
A workaround currently is to log onto the Selenium server, go to the sites in each browser, and save the credentials. But this isn't ideal, and adds a level of manual interaction to each test.
Any help would be appreciated :).
It appears I have found my own solution - use a proxy which adds the NTLM negotiation/authorisation automatically. Pretty simple to setup :).
http://cntlm.sourceforge.net/

How to specify Selenium Webdriver to use current browser settings for internet

Using Selenium's WebDriver, with PhantomJSDriver, I am trying to do headless browser testing. It works fine when connected to internet WITHOUT a proxy. But when the connection to internet is via an authenticated proxy, it fails. I want to deploy this program to multiple user sites, which might be connected to internet with or without proxy, and in case of proxy, it might be authenticated or unauthenticated.
Is there a way to tell Selenium Webdriver to use the "current" browser's internet connection settings? Please note I am using phantomjs.
Thanks,
abbas
There is some more simple but very effective solution, that I've used when battling with similar issues an year ago.
Do you have these issues when using other *Drivers? If not - my proposal is to use your implementation of any other *Driver that works fine and after it passes authentication just cast it to PhantomJSDriver. Please note that is just possible workaround if your TestFramework hierarchy is built to support such an action.
In addition you can consider the following - when I used such Polymorphism the difference is speed. For FirefoxDriver and PhantomJSDriver it wasn't such a pain and if you can use it only for authentication it will not slow you down noticeable.
I'm not sure that I can help you with my solution, but it will not hurt to try it.

Does htmlunit creates browser instances on the machine where it is running?

I am using htmlunit for web scraping - logging to a website on behalf of the users, settings something in their profile and then come back.
Just using pure Htmlunit and no selenium framework.
Now my question:
WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_11);
Does this statement - creates a browser instance on the machine where i am executing the code or what it does?
I am using BrowserVersion.INTERNET_EXPLORER_11 as this is an accepted browser at that website.
How Selenium is different than htmlunit - i know we can use htmlunit as a webdriver in Selenium. Does Selenium needs a native browser instance on the machine where the code is getting executed? Does Selenium creates browser instances?
My use case is - I will be having multiple users accessing this application. I know WebClient in htmlunit is not thread safe(so have to code it as Spring proto type bean).
Is there any suggestions regarding this?
Any help is greatly appreciated.
HTMLUnit is a headless browser. So no window will be created if used with Selenium either. Setting the BrowserVersion will just tell HTMLUnit to present itself to the server as if it were a given browser (AFAIK, it will just change the User-Agent but might perform additional internal processing depending on the version). I guess this answers most of the questions but the last one.
Regarding asking for suggestions on how to implement this I would try to avoid logging in to a website that way. If the website does not provide an API for this then it is likely that it is agains the Terms Of Service. Assuming it is not, you will have to create new WebClient instances for each user each time the data needs to be extracted from the other site.