Can using Selenium WebDriver for automated web crawling be dangerous?

Can using Selenium WebDriver for automated web crawling be dangerous? - selenium

I'd like to crawl a set of random websites received from a URL generator, using Selenium's ChromeDriver with Crawljax to do static code analysis on the captured DOM states.
Is this potentially unsafe for the machine doing the crawling?
My concern is that one of the randomly generated sites is malicious and that execution of JavaScript from ChromeDriver (which is used to capture the new DOM states) infects the machine running the test somehow. Should I be running this in some kind of sandboxed environment?
--edit--
If it matters, the crawler is implemented entirely in Java.

Simple answer, no. Only if your afraid of cookies, and even if you are, your machine isn't.

It's hard to say it's very secure,you should aware of that there is no absolute secure in network.Recently,a chrome RCE has been put out,details:
SSD Advisory – Chrome Turbofan Remote Code Execution – SecuriTeam Blogs
Maybe this can effect on Selenium's ChromeDriver
But you can do some enforce on your system,such as change your firewall mode to white list,only allow your python script and selenium to access internet on port 80,443.
Even if your system pwned by RCE,the malicious code still can't access internet,unless it inject to you python process(I think it's very hard to do with js script in Browser RCE).
Another option:Install HIPS,if your python script want to do anything else but crawl web page(such as start an other process) or read/write some other files,you will know it and decide what to do.
In my oppion,do your crawl thing in a VM and do some enforce on firewall(Windows firewall or Linux iptables),shutdown useless services in windows.That's enough.
In a word,it's diffcult to find the balance between security and convenience and do not believe your system is unbreakable

Related

Load Testing with Selenium? What are the alternatives for my situation

Currently I'm trying to run a load test which walks through a uniquely created URL. I know JMeter is often used for load testing, but I was specifically asked to do it through something like Selenium that uses real browsers to create the URL then open that URL and complete the steps within the URL. I have created a Selenium script that can easily do this, but I need to do this 100 times concurrently and can't find a good way to do.
Is there a way to do this? I've looked into Selenium Grid but I'm not sure if I even have enough nodes to run 100 browsers concurrently. Please if you have recommendations for software or a different method of doing this I would love to hear it. Thank you!

JMeter can be integrated with Selenium using WebDriver Sampler so you can re-use your code and rely on JMeter's multithreading capabilities.
If one machine won't be powerful enough to kick off 100 browsers - you can consider going for Distributed Testing
In general be aware that browsers don't do any magic, they just send HTTP requests, wait for responses and render them. JMeter is not capable only of rendering the page, but if you need to load test the backend - it can mimic browser's network footprint with 100% accuracy, just make sure to configure JMeter accordingly in order to behave like a real browser
JavaScript execution time and page rendering speed can be checked either using single WebDriver Sampler or a separate solution like Lighthouse

Security Considerations - ChromeDriver - Webdriver for Chrome

I was wondering if anyone had more information on what the specific risks for using chromedriver as was concerned by this statement.
"If possible, run ChromeDriver with a test account that has no access to sensitive local or network data. ChromeDriver should never be run with a privileged account."
Would like to know what the specific risks are when using a privileged account and what if any preventative measures can be taken to protect against them.
Thank you in advance!

How Google Chrome Browser Works
In the article Chrome Browser Security #STEPHANIE CRAWFORD mentioned, Google has leveraged its power as a search engine by creating its Safe Browsing technology which will automatically warn you if Chrome detects that a site you're visiting contains malware or phishing.
Chrome deploys this security measure through a unique security feature termed as Sandboxing. Sandboxing implies, separating each process out into independent spaces to see how they function individually. Chrome handles its workload as a series of multiple processes rather than as part of one large browser process. Each time you open a Web page, Chrome launches one or more new processes to run the scripts on that page. Also, each Chrome extension and app runs in its own process. Chrome implements sandboxing through its multi-process architecture. The security advantage in sandboxing comes with Chrome being able to control the access token for each process. These access token for a process allows that process access to important information about your system, like its files and registry keys. Chrome intercepts each access token from the processes launched from the browser, and it modifies that token to limit its access to that information. So, Chrome's sandboxing helps block web pages that try to install malware, capture your personal information or obtain data from your hard drive. The drawback of sandboxing is that, it can't catch everything. A sandboxed process might still be able to access less secure file systems. It's also likely to miss protecting registry keys and files managed by third party software, like a game or chat program that isn't native to the system.
WebDriver driven Chrome
While initiating a WebDriver controled Chrome Browsing Context using Selenium recently we had been advocating to use a certain command line argument:
--no-sandbox: Disables the sandbox for all process types that are normally sandboxed.
See:
WebDriverException: unknown error: DevToolsActivePort file doesn't exist while trying to initiate Chrome Browser
How to configure ChromeDriver to initiate Chrome browser in Headless mode through Selenium?
unknown error: session deleted because of page crash from unknown error: cannot determine loading status from tab crashed with ChromeDriver Selenium
No Sandbox
There are a couple of more Sandbox related flags available which enables the sandboxed processes to run without a job object assigned to them. This flag is required to allow Chrome to run in RemoteApps or Citrix. This flag can reduce the security of the sandboxed processes and allow them to do certain API calls like shut down Windows or access the clipboard. Also we lose the chance to kill some processes until the outer job that owns them finishes.
--allow-no-sandbox-job: Disables usage of sandbox job.
--allow-sandbox-debugging: Allows debugging of sandboxed processes.
--disable-gpu-sandbox: Disables the GPU process sandbox.
--disable-namespace-sandbox: Disables usage of the namespace sandbox.
--disable-seccomp-filter-sandbox: Disable the seccomp filter sandbox (seccomp-bpf) (Linux only).
--disable-setuid-sandbox: Disable the setuid sandbox (Linux only).
--disable-win32k-lockdown: Disables the Win32K process mitigation policy for child processes.
--enable-audio-service-sandbox: enable the audio service sandbox.
--gpu-sandbox-allow-sysv-shm: Allows shmat() system call in the GPU sandbox.
--gpu-sandbox-failures-fatal: Makes GPU sandbox failures fatal.
--no-sandbox-and-elevated: Disables the sandbox and gives the process elevated privileges (Windows only).
Sandbox
Sandbox leverages the OS-provided security to allow code execution that cannot make persistent changes to the computer or access information that is confidential. The architecture and exact assurances that the sandbox provides are dependent on the operating system.
windows implementation principles:
Do not re-invent the wheel: It is tempting to extend the os kernel with a better security model. Don't. Let the operating system apply its security to the objects it controls. On the other hand, it is just okay to create application-level objects (abstractions) that have a custom security model.
Principle of least privilege: This should be applied both to the sandboxed code and to the code that controls the sandbox. In other words, the sandbox should work even if the user cannot elevate to super-user.
Assume sandboxed code is malicious code: For threat-modeling purposes, we consider the sandbox compromised (that is, running malicious code) once the execution path reaches past a few early calls in the main() function. In practice, it could happen as soon as the first external input is accepted, or right before the main loop is entered.
Be nimble: Non-malicious code does not try to access resources it cannot obtain. In this case the sandbox should impose near-zero performance impact. It's ok to have performance penalties for exceptional cases when a sensitive resource needs to be touched once in a controlled manner. This is usually the case if the OS security is used properly.
Emulation is not security: Emulation and virtual machine solutions do not by themselves provide security. The sandbox should not rely on code emulation, code translation, or patching to provide security.
linux implementation
macos implementation

Targeting different platforms (Browser, OS) using WebDriver?

I am a newbie to Automated Testing using WebDriver, so I have a few questions just to clear some things in my head. On a few pages, I saw the samples of executing WebDriver tests on different platforms by just targeting these for the capabilities of the browser or OS.
capability= DesiredCapabilities.firefox();
capability.setBrowserName("firefox");
capability.setPlatform(org.openqa.selenium.Platform.ANY);
or
capability= DesiredCapabilities.internetExplorer();
capability.setBrowserName("iexplore");
capability.setPlatform(org.openqa.selenium.Platform.WINDOWS);
As mentioned in:
Executing tests Concurrently on different OS and Browsers with WebDriver using Java and TestNG
So, if I understand that correctly, actually it is possible to run tests and verify those on the different OS and Browsers just by using the libraries provided by the Selenium?
If so, how accurate are these tests for typical cross browser/platform html/JavaScript issues?
Thank you

This is a great question. I'm going to try to break this down into smaller packets of information so that it hopefully makes sense for old pros and newbies alike.
Without Selenium Grid:
For starters, it is possible to use individual drivers for all the different browser/OS combinations you wish to run the tests on. The drawback is you have to make some (though usually minimal) code adjustments for each browser's driver. This also means breaking the DRY principle. To learn more about writing these kinds of tests check out this documentation. (Also note that if you wanted to run these tests on each build via CI on something like Jenkins you need to have the actual browsers running on a slave on your own hardware, but these are more the DevOps concerns.)
Using Selenium Grid:
More commonly used for the sort of goals you mentioned (and referenced in the other post you linked to), Selenium Grid is a server that allows multiple instances of tests to run in different web browsers on remote machines. The more intro oriented docs for this are here and more forward looking docs are here.
Running Local or in the Cloud:
With Selenium Grid you are going to go one of two ways.
Run on your own hardware locally (or wherever your company has machines to remote into)
Use an online service like Sauce Labs or Testing Bot
A nice "what this might look like in Java" for having an online service provide the browsers is shown in this Sauce Labs page and for Testing Bot here.
Selenium Can be Written in a Ton of Languages:
Selenium follows the WebDriver API and for C#, Java, Perl, PHP, Python, Ruby, JavaScript (Node) or other languages, you still can write test scripts in any of these (and they provide the “frameworks” for some of these officially, while others are community driven) and still have the run tests run solidly in all modern browsers.
Concerning Mobile Devices
There is some good discussion over here that discusses how "close to the real thing" you want your mobile browser tests to be, since the iPhoneDriver and AndroidDriver are largely based on use through WebView, which is less close to the real thing. They are now finding themselves being replaced by ios-driver, Selendroid, and Appium.
To Sum It Up
So to answer what I think you’re getting at with,
... is possible to run tests and verify those on the different OS and Browsers just
by using the libraries provided by the Selenium
the answer is that you can use Selenium Grid and an online service or you will have to use base Selenium/Selenium Server along with a number of other libraries to test all modern browser and OS combinations, but I'm sure many shops do just that because they have the experience and expertise to pull it off.
Alternate (Non-Selenium) Option to Write Once and Test Across Browsers:
If you have a team with JavaScript experience and you're looking to hit the same goal of testing across browsers without the overhead of Selenium, Automates JavaScript Unit Testing
with Sauce Labs (formerly Browser Swarm) would be a good option.

Does anyone know browser emulators?

Over 2 years I tested web application with help Selenium framework. I know the best design is testing on VM.
The only one downside of this - it's very slow testing. Why?
browser only gets so much memory, if you will run several instances.
site coud be very slow.
connections can be very slow.
Would be great if there was a framework that emulated the browser (engine/core) correctly and can provide some results (api) for surf on the page.
I don't mean to simulate just on the one browser with different version (like IE). I mean to simulate for all browsers with very popular and newest version.
Does anyone know a framework/tool that can do it?
Thank you.

You can try PhantomJS for example.
From their page:
PhantomJS is a headless WebKit with JavaScript API. It has fast and
native support for various web standards: DOM handling, CSS selector,
JSON, Canvas, and SVG.
You can use it in combination with Jasmine (as well as several other frameworks) for testing.
However the selection of available engines is limited to WebKit. I doubt that Selenium will be easy to replace. By the way it looks like Selenium will probably become a W3C standard over the next years.

You can also run Selenium with Xvfb - I use it to execute test on remote server and it is going very well.

How to stress test simulating heavy load using Selenium

I have a system to test, which is a video ads distribution technology. I need to load every video like 1-2 mins to serve the ads. The videos are played in a Flash client and streamed as FLV streams like in YouTube.
The reason why I need to test it only via browsers -- and every other method won't work -- is to stress test both the video streaming servers and the ads servers simultaneously and displaying ads in real-time.
I have used Selenium, WatiN, Automation Anywhere and many other automation tools. However, when I am trying to start like 10000 browsers on my machine (32GB RAM, 16-core CPU), none of them are able to do the job.
With Selenium, I am able to start the maximum FireFox instances so far, but that's still too low: half of the instances don't run the test.
Any suggestions to do with Selenium?

You aren't going to run 10,000 browsers on your machine. That would give 3.2MB of physical memory per browser instance and I'm pretty sure FireFox just won't like that.
You could create a JMeter script that hits your server with many threads. It won't interact with the UI but would simulate the load of many clients hitting whatever URLs you tell it. I believe it also includes the ability to record a session and play it back for easy setup of your sessions.

Selenium isn't really optimized for load/stress testing, especially if you're running your browsers locally. Running 1000+ browsers is going to choke even the beefiest server. Though RAM is an obvious bottleneck, you also have limited CPU resources and bandwidth. The latter being a primary concern if you are loading videos.
Not to mention you'd be testing from a single IP with 10k browsers, so load balancing may not kick in properly, as well as the actual distribution of video ads to specific virtual users.
If you want to stick with existing Selenium tests, I've had good experiences with BrowserMob. They basically have a huge grid to do real browser load-testing, distributed across AWS.
Another recommendation would be an actual performance testing tool. I'd recommend Soasta CloudTest. They have a free version that runs 100 users so you can see if it will be a good fit for you. I have found that scripting for CloudTest is relatively simple.
Disclaimer: My experiences with both companies have been as a paying customer and I have never worked for either.

If you are using Windows machine then as per my experience there is a limit on number of browser window instances to be opened. As per my test last time, it does restrict between 100-150 browser windows.
I would recommend you using headless robot, which doesn't require opening browser window. I think latest version of Selenium has that capability. But it seems to be more like a load test as you are trying to simulate 10,000+ user instances, I would recommend you using load testing tool like JMeter or LoadRunner.

It looks to me that you are trying to verify what the client will see based on high traffic, no?
In that case, Joel is quite correct. If you absolutely have to see what the client sees, you could use threaded hits and just dump the results in a database. That'll show you anything the client would see anyway, and it's a lot easier to sort through than thousands of browser instances.
Either way, your client will not see errors if there are no errors present on the server side. If you're testing functionality in bandwidth restricted environments, CPU-intensive environments, or memory-intensive environments, those are much easier achieved than running thousands of browser instances.

Your post smells of some form of ad-based fraud to me, but either way: have you considered using different web browsers besides Firefox? PhantomJS is a headless webkit-based browser that is compatible with Selenium. It supports all the core browser features like DOM handling, CSS selectors, Javascript and Canvas. I do not know if it supports Flash.
This post has a decent list of other headless and automatable web-browsers that you might consider.
Also, if each browser instance is instantiating a Flash plugin, don't neglect the possibility that the issue could be with Flash and not Firefox. Alternatively, why instantiate several different Firefox processes? Can you accomplish what you want through the use of tabs instead?

The in-house way to this wiht selenium is using browsermob proxy and multiple broswser agents to recreate the experience of different users, changing the ip is more difficult because it requires changing your home network.
Here is a good example

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas