Using webkit for headless browsing - webkit

I am using webkit based tools to built a headless browser for crawling webpages (I need this because I would like to evaluate the javascript found on pages and fetch the final rendered page). But, the two different systems I have implemented so far exhibit very poor performance. I have implemented two different systems, both of which use webkit as the backend:
Using Google Chrome: I would start Google Chrome and communicate with each tab using webSockets exposed by Chrome for remote debugging (debugging over wire). This way I can control each tab, load a new page and once the page is loaded I fetch the DOM of the loaded webpage.
Using phantomjs: phantomjs uses webkit to load pages and provides a headless browsing option. As explained in the examples of phantomjs, I use page.open to open a new URL and then fetch the dom once the page is loaded by evaluating javascript on the page.
My goal is to crawl pages as fast as I can and if the page does not load in the first 10 seconds, declare it failed and move on. I understand that each page takes a while to load, so to increase the number of pages I load per second, I open many tabs in Chrome or start multiple parallel processes using phantomjs. The following is the performance that I observe:
If I open more than 20 tabs in Chrome / 20 phantomjs instances, the CPU usage rockets up.
Due to the high CPU usage, a lot of pages take more than 10seconds to load and hence I have a higher failure rate (~80% of page load requests failing)
If I intend to keep the fails to less than 5% of the total requests, I cannot load more than 1 URL per second.
After trying out both the webkit based systems, it feels like the performance bottleneck is the webkit rendering engine and hence would like to understand from other users here, the number of URLs per second that I can expect to crawl. My hardware configuration is:
Processor: Intel® Core™ i7-2635QM (1 processor, 4 cores)
Graphics card: AMD Radeon HD 6490M (256MB)
Memory: 4GB
Network bandwidth is good enough to be able to load pages more than the performance that I am observing
The question I am trying to ask this mailing list is, does any one have experience using webkit for crawling web pages for a random set of URLs (say picking 10k URLs from twitter stream), how many URLs can I reasonably expect to crawl per second?
Thanks

This question is actually more related to hardware than the software but let me point you in some better directions anyway.
First, understand that each page is itself spawning multiple threads. It will download the page, and then start spawning off new download threads for elements on the page such as javascript files, css files and images. [Ref: http://blog.marcchung.com/2008/09/05/chromes-process-model-explained.html ]
So depending on how the page is structured, you could end up with a fair number of threads going at the same time just for the page, add on top your trying to do too many loads at once and you have a problem.
The Stack Overflow thread at Optimal number of threads per core gives further information on the situation you are experiencing. Your overloading your cpu.
Your processor is 4 physical 8 logical cores. I would recommend spawning no more than 4 connections at one time, leaving the secondary logical cores to handle some of the threading there. You may find that you even need to reduce this number but 4 is a good starting point. By rendering pages 4 at a time instead of overloading your whole system trying to render 20 you will actually increase your overall speed since you end up with far less cache swapping. Start by clocking your time against several easily timed locations. Then try with less and more. There will be a sweet spot. Note, the headless browser version of PhantomJS is likely going to be better for you as in headless mode it probably won't download the images (a plus).
Your best overall option here though is to do a partial page rendering yourself using the webkit source at http://www.webkit.org/. Since all it seems you need to render is the html and the javascript. This reduces your number of connections and allows you to control your threads with far greater efficiency. You could in that case then create an event queue, spool all your primary url's into there. Spawn off 4 worker threads that all work off of the worker queue, as they process a page and need to download further source they can add those further downloads to the queue. Once all files for a page are downloaded into memory (or disk if your worried about ram) for a particular url, you can then add an item into the event queue to render the page and then parse it for whatever you need.

Depends on what data you are trying to parse, if you only care that javascript and the html , then hypertext query langage would offer an immense speedup, http://htql.net/ , or you can look to setting up something in the cloud such as http://watirmelon.com/2011/08/29/running-your-watir-webdriver-tests-in-the-cloud-for-free/

Related

Setting Disk Cache size in Selenium, while webscraping multiple websites? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
From the available information I understood that setting disk cache size in selenium will help in faster loading of the web pages, when we are doing the scraping or anything on single website. But my question is what good will it do if we set the disk cache size while dealing with multiple websites?
Or is it in fact bad to set disk cache size? When scraping multiple web pages i.e. in a way the websites can trace that we are scraping?
Disk Cache is a cache memory that is used to speed up the process of storing and accessing data from the host machine hard disk. It enables faster processing during reading/writing, issuing commands and other I/O process between the hard disk, the memory and computing components. A disk cache is also referred to as a disk buffer or cache buffer.
Chromium disk cache
The disk cache stores resources fetched from the web so that they can be accessed quickly at a latter time if needed. The main characteristics are:
The cache should not grow unbounded so there must be an algorithm for deciding when to remove old entries.
While it is not critical to lose some data from the cache, having to discard the whole cache should be minimized. The current design should be able to gracefully handle application crashes, no matter what is going on at that time, only discarding the resources that were open at that time. However, if the whole computer crashes while we are updating the cache, everything on the cache probably will be discarded.
Access to previously stored data should be reasonably efficient, and it should be possible to use synchronous or asynchronous operations.
We should be able to avoid conflicts that prevent us from storing two given resources simultaneously. In other words, the design should avoid cache trashing.
It should be possible to remove a given entry from the cache, and keep working with a given entry while at the same time making it inaccessible to other requests (as if it was never stored).
The cache should not be using explicit multithread synchronization because it will always be called from the same thread. However, callbacks should avoid reentrancy problems so they must be issued through the thread's message loop.
Conclusion
To conclude, by default google-chrome will be configured with the default value for the diskcache which users can configure as per their respective usecases.
Changing Chrome Cache size on Windows 10
There is only one method that can be used to set and limit Google Chrome’s cache size.
Launch Google Chrome.
Right-click on the icon for Google Chrome on the taskbar and again right-click on the entry labeled as Google Chrome.
Now click on Properties. It will open the Google Chrome Properties window.
Navigate to the tab labeled as Shortcut.
In the field called Target, type in the following after the whole address:
-disk-cache-size-<size in bytes>
As an example, to configure it as -disk-cache-size-2147483648:
"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" -disk-cache-size-2147483648
Here 2147483648 is the size of the cache in bytes which is equal to 2 Gigabytes.
Click on Apply and then click on OK for the limit to be set.

How to reduce PhantomJS's CPU and memory usage?

I'm using PhantomJS via Python's webdriver lib. It eats lots of RAM and CPU, and it's an issue because I'd like to run as many instances as it's possible.
Some google'ing didn't give me anything helpful. So I'll ask directly:
Does the size matter? If I set driver.set_window_size(1280, 1024), will it eat more memory than 1024x768?
Is there any option in the source code which can be turned off without real issues and which lead to significant memory usage reduce? Yes I still need images and CSS and JS loading and applying, but I can get rid of some other features... For example, I can turn off caching (and load all media files every time). Yes, I do need to speed it up and make it less greedy and I'm ready to re-compile it... Any ideas here?
Thanks a lot!
I assume you call phantomjs once for every rendering job. This creates a new phantomjs process every time. You could try batching as many as you could in the one js script and call phantomjs once for the whole batch.

Ways to reduce CPU usage while using PhantomJS/CasperJS

Am using PhantomJS and CasperJS for screenscraping and stuff. The issue which I am facing is that its taking too much CPU usage which makes me feel it might not be that scalable. Are there any ways to reduce CPU usage for the same. Some of which I can think of are:
1) Disable image loading
2) Disable js loading
Also I want to know if python is more light(in terms of CPU usage) than phantom for the scraping purpose.
Why CasperJS / PhantomJS only? Are you scraping websites that load content with JavaScript? Any tool that doesn't run a full webkit browser will be more lightweight than one that does.
As mentioned in the comments, you can use wget or curl on linux systems to dump webpages to files / stdout. There are many libraries that can handle & parse raw HTML such as cheerio for NodeJS.
Still want some form of scripting? Because you mentioned python, there is a tool called Mechanize that does just that without running webkit. It's not as powerful as Casper / Phantom, but it allows you to do a lot of the same things (filling out forms, clicking links, etc) with a much smaller footprint.
After 5 and a half years I don't think you are having this issue anymore, but if anyone else stumbles across this problem, here's the solution.
After finishing scraping, quit the browser by typing browser.quit(), browser being the name of the variable you set.

w3wp.exe consuming large amount of RAM

I noticed the other day that it seems like all the w3wp.exe running on my server are consuming way more RAM than I would expect. So I created test web application with a single aspx page and then a vanilla global.asax page as well (default methods, no additional code). I then deployed that site to IIS6 with a target framework of 4.0 built in release mode with debug set to "false". The site is also set up under it's own application pool. I then used issapp.vbs to figure out which w3wp.exe this test site was running under. I was surprised to see that the single site with 1 page was using almost 40mb of ram.
40mb of RAM seems like a lot for a single page website. Is this normal and if so why? Is there anything I can do to reduce the memory footprint?
I also noticed that each time the default page was refreshed that the w3wp.exe grew a little bit more. Is IIS6 caching the same page over and over?
40Mb for a single AppDomain (irrespective of how many pages you have) is alright. Consider that .NET is a managed environment and the app pool contains a large majority of the logic and heap for serving requests. In many cases it also will be less eager to free up RAM, if the RAM is not under contention (demanded by other parts of the system).
You can share app pools amongst different websites, and the "overhead" of the sandbox becomes less proportionate to your perception of what is acceptable. However, I don't think this is bad at all.
You're trying to optimize the case where you have no load and no clients. That really isn't sensible. Optimize for a realistic use case.
You don't build a factory and then try to get it to make one item as efficiently as possible. You try to get it to crank them out efficiently by the dozens.
This is not really an issue, since it is quite normal. Your .NET Framework is getting loaded and hence the size. When a real issue happens, you would have to use certain post production or profiling tools to get it fixed. Sharing some articles for memory and post production debugging just in case you need it in future.
http://msdn.microsoft.com/en-us/magazine/cc188781.aspx
http://aspalliance.com/1350_Post_Production_Debugging_for_ASPNET_Applications__Part_1.all
http://blogs.msdn.com/b/tess/archive/2006/09/06/net-memory-usage-a-restaurant-analogy.aspx
http://blogs.msdn.com/b/tess/archive/2008/09/12/asp-net-memory-issues-high-memory-usage-with-ajaxpro.aspx

How To Simulate Lower CPU Processor Machines For Browser Testing

We have some users which are using lower-CPU powered machines and they're encountering slow response times using our web application. Is there any way for me to do testing so that I can simulate lower CPU rates?
For example, I have 2.3 Ghz computing power, can I lower it to 1.6 Ghz or lower so that I may be able to test it?
BTW, our customers are using Windows. I have to simulate low computing power on Internet Explorer as browser.
Most new CPUs multiplier can easily be lowered (Intel: Speedstep, AMD: PowerNow!). This is used to save power. With RMclock you can manually adjust your multiplier and thus lower your frequency and make your pc slower. I use this tool myself so I can tell you that it works.
http://cpu.rightmark.org/products/rmclock.shtml
The virtual machine Bochs(pronounced boxes) allows you to set a instructions per second directive. It's probably the slowest emulator out there as it is though...
Create some virtual machines.
You can use VirtualPC or VirtualBox both are free.
I would recommend to start something on the background which eats up all your processor cycles.
A program which finds primenumbers or something similar.
Another slight option in addition to those above is to boot windows in a lower resource config. Go to the start menu,, select run and type MSCONFIG. You can go to the boot tab, click on advanced options and limit the memory and number of of processsors. It's not as robust as the above, but it does give you another option.
Lowering the CPU clock doesn't always give expected results.
Newer CPUs feature architecture improvements which make them more efficient on an equvialent clock basis than older chips. Incidentally, because of this virtual machines are a bad way of testing performance for "older" tech as well.
Your best bet is to simply buy a couple of older machines. Using similar RAM (types and amounts), processor, motherboard chipsets, hard drives, and video cards. All of which feed into the total performance of the machine itself.
I bring the other components up because changing just one of them can have an impact on even browser performance. A prime example is memory. If your clients are constrained to something like 512MB of RAM, the machines could be performing a lot of hard drive access for VM swaps, even for just running the browser. In this situation downgrading the clock speed on your processor while still retaining your 2GB (assuming) of RAM would still not perform anywhere near the same even if everything else was equal.
Isak Savo'sanswer works, but can be a bit finicky, as the modern tpl is going to try and limit cpu load as much as possible. When I tested it out, It was hard (though possible with some testing) to consistently get the types of cpu usages I wanted.
Then I remembered, http://www.cpukiller.com/, which does this already. Highly recommended. As an aside, I found this util from playing old 90s games on modern machines, back when frame rate was pegged to cpu clock time, making playing them on modern computers way too fast. Great utility.
Another big difference between high-performance and low-performance CPUs is the number of cores available. This can realistically differ by a factor of 4, way more than the difference in clock frequency you're likely to encounter.
You can solve this by setting the thread affinity. Even IE6 will use 13 threads just to show google.com. That means it will benefit from a multi-core CPU. But if you set the thread affinity to one core only, all 13 IE threads will have to share that one core.
I understand that this question is pretty old, but here are some receipts I personally use (not only for Web development):
BES. I'm getting some weird results while using it.
Go to Control Panel\All Control Panel Items\Power Options\Edit Plan Settings\Change Advanced Power Settings, then go to the "Processor" section and set it's maximum state to 5% (or something else). It works only if your processor supports dynamic multiplier change and ACPI driver is installed correctly.
Run Task Manager and set processor affinity to a single core (or whatever number of cores you want) for your browser's (or any other's) process. Not a best practice for browsers, because JavaScript implementations are usually single-threaded, but, as far as I see, modern browsers actually DO use multiple cores.
There are a few different methods to accomplish this.
If you're using VirtualBox, go into the Settings for the VM you want to slow the CPU speed for. Go to System > Processor, then set the Execution Cap. The percentage controls how slow it will go: lower values are slower relative to the regular speed. In practice, I've noticed the results to be choppy, although it does technically work.
It is also possible to set the CPU speed for the whole system. In the Windows 10 Settings app, go to System > Power & Sleep. Then click Additional Power Settings on the right hand side. Go to Change Plan Settings for the currently selected plan, then click Change Advanced Power Plan Settings. Scroll down to Processor Power Management and set the Maximum Processor State. Again, this is a percentage. Although this does work, I find that in practice, it doesn't have a big impact even when the percentage is set very low.
If you're dealing with a videogame that uses DirectX or OpenGL and doesn't have a framerate cap, another common method is to force Vsync on in your graphics driver settings. This will usually slow the rendering to about 60 FPS which may be enough to play at a reasonable rate. However, it will only work for applications using 3D hardware rendering specifically.
Finally: if you'd rather not use a VM, and don't want to change a system global setting, but would rather simulate an old CPU for one specific process only, then I have my own program to do that called Old CPU Simulator.
The main brain of the operation is a command line tool written in C++, but there is also a GUI wrapper written in C#. The GUI requires .NET Framework 4.0. The default settings should be fine in most cases - just select the CPU you'd like to simulate under Target Rate, then hit New and browse for the program you'd like to run.
https://github.com/tomysshadow/OldCPUSimulator (click the Releases tab on the right for binaries.)
The concept is to suspend and resume the process at a precise rate, and because it happens so quickly the process will appear to just be running slowly. For example, by suspending a process for 3 milliseconds, then resuming it for 1 millisecond, it will appear to be running at 25% speed. By controlling the ratio of time suspended vs. time resumed, it is possible to simulate different speeds. This is completely API agnostic (it doesn't hook DirectX, OpenGL, etc. it'll work with a command line program if you want.)
Old CPU Simulator does not ask for a percentage, but rather, the clock speed to simulate (which it calls the Target Rate.) It then automatically determines, based on your CPU's real clock speed, the percentage to use. Although clock speed is not the only factor that has improved computer performance over time (there are also SSDs, faster GPUs, more RAM, multithreaded performance, etc.) it's a good enough approximation to get fairly consistent results across machines given the same Target Rate. It also supports other options that may help with consistency, such as setting the process affinity to one.
It implements three different methods of suspending and resuming a process and will use the best available: NtSuspendProcess, NtQuerySystemInformation, or Toolhelp Snapshots. It also uses timeBeginPeriod and timeEndPeriod to achieve high precision timing without busy looping. Note that this is not an emulator; the binary still runs natively. If you like, you can view the source to see how it's implemented - it's not a large project. On my machine, Old CPU Simulator uses less than 1% CPU and less than 1 MB of memory, so the program itself is quite efficient (unlike running intensive programs to intentionally slow the CPU.)