Am using PhantomJS and CasperJS for screenscraping and stuff. The issue which I am facing is that its taking too much CPU usage which makes me feel it might not be that scalable. Are there any ways to reduce CPU usage for the same. Some of which I can think of are:
1) Disable image loading
2) Disable js loading
Also I want to know if python is more light(in terms of CPU usage) than phantom for the scraping purpose.
Why CasperJS / PhantomJS only? Are you scraping websites that load content with JavaScript? Any tool that doesn't run a full webkit browser will be more lightweight than one that does.
As mentioned in the comments, you can use wget or curl on linux systems to dump webpages to files / stdout. There are many libraries that can handle & parse raw HTML such as cheerio for NodeJS.
Still want some form of scripting? Because you mentioned python, there is a tool called Mechanize that does just that without running webkit. It's not as powerful as Casper / Phantom, but it allows you to do a lot of the same things (filling out forms, clicking links, etc) with a much smaller footprint.
After 5 and a half years I don't think you are having this issue anymore, but if anyone else stumbles across this problem, here's the solution.
After finishing scraping, quit the browser by typing browser.quit(), browser being the name of the variable you set.
Related
I've got no idea how virtual machines work, would pyautogui code that moves mouse to a certain pixel on screen still work with a virtual machine? I want to have my code that clicks on my screen to join google meet to be able to run without my PC being on.
Answering whether this will work or not is a non-trivial question.
Below are some of the gotcha's I had to deal with when I was running pyautogui tests against a test VM.
If your VM isn't rendering a UI then pyautogui won't run correctly. Generally, to save resources a VM will only render the UI when it needs to display the UI.
If your tests were developed on a system with a different resolution than the VM there are a whole host of bugs that can be introduced. Changes in resolution can result in target images no longer being valid due to UI scaling and changes in layout. Any hard coded positions or calculated pixel offsets can also be broken if the resolution changes. Best practice is to develop your code in as close an environment to the one it will be run in as possible.
I would consider pyautogui an automation solution of last resort. If you've exhausted other automation options, then the best way to know if it will work is to start small with a proof of concept against your environment. Then slowly expand automation capabilities as you work through the many quirks associated with it's testing paradigm.
So i had been making video tutorials for my friends on how to program. On my old computer i had been all ways running simple screen recorder and it recorded fine. But recently i got a new computer. And so when i got a fresh install of arch linux on the box. I set up the environment with every thing i needed to make another video. When i downloaded simple screen recorder using yaourt, and started recording. I had recorded up to a two hour session with out knowing that it was glitching out. When i look at my computer i do not see the same issue as when the final product is done rendering. I think it might be a rendering error or i do not have the right codecs. After a hour or two searching on the web i could find no forum posts on the codec. I took in multiple things that could be wrong with it fps was my first choice but when i had recorded with 25 and even 50 fps it was still glitching out. The next idea i had was that i had the wrong codec H.264. But with searching i could find no solution to that one. Then i thought that i might have been encoding at to high of a speed (23). But still that proved me wrong. so now i am confused with how to get my answer.
Settings Screen shot:
Video Link:
https://www.youtube.com/watch?v=zfyIZiJCDa4
The glitches are often relate to the rendering backend of the window compositor you are using.
Solution 1 - Change the rendering backend of the window compositor
#thouliha reported having issues with compton. In my case I had glitches with openGL (2.0 & 3.1) and resolved the issue by switching to XRender for recording.
On KDE you easily change the rendering backend of the window compositor in the settings .
Solution 2 - Change the Tearing Prevention method
To keep using OpenGL, for example for better performance, you can also tweak the tearing prevention method.
In my case switching from Automatic to Never allowed me to record video with OpenGL compositor without glitches.
Solution 3 - Intel iGPU specific issues
Intel iGPU (Intel graphics) has some rendering issues with some CPUs.
You can check the Troubleshooting section of ArchLinux wiki to check those.
Example of features creating tearing or flickering related issues:
SNA
VSYNC
Panel Self Refresh (PSR)
Check also /etc/X11/xorg.conf.d/20-intel.conf if your system has put tweaks in here.
I'm not exactly sure what you mean by glitching out, especially since the video is down now, but I've found that the video is choppy when using compton, so I had to turn that off.
I'm using PhantomJS via Python's webdriver lib. It eats lots of RAM and CPU, and it's an issue because I'd like to run as many instances as it's possible.
Some google'ing didn't give me anything helpful. So I'll ask directly:
Does the size matter? If I set driver.set_window_size(1280, 1024), will it eat more memory than 1024x768?
Is there any option in the source code which can be turned off without real issues and which lead to significant memory usage reduce? Yes I still need images and CSS and JS loading and applying, but I can get rid of some other features... For example, I can turn off caching (and load all media files every time). Yes, I do need to speed it up and make it less greedy and I'm ready to re-compile it... Any ideas here?
Thanks a lot!
I assume you call phantomjs once for every rendering job. This creates a new phantomjs process every time. You could try batching as many as you could in the one js script and call phantomjs once for the whole batch.
I am using webkit based tools to built a headless browser for crawling webpages (I need this because I would like to evaluate the javascript found on pages and fetch the final rendered page). But, the two different systems I have implemented so far exhibit very poor performance. I have implemented two different systems, both of which use webkit as the backend:
Using Google Chrome: I would start Google Chrome and communicate with each tab using webSockets exposed by Chrome for remote debugging (debugging over wire). This way I can control each tab, load a new page and once the page is loaded I fetch the DOM of the loaded webpage.
Using phantomjs: phantomjs uses webkit to load pages and provides a headless browsing option. As explained in the examples of phantomjs, I use page.open to open a new URL and then fetch the dom once the page is loaded by evaluating javascript on the page.
My goal is to crawl pages as fast as I can and if the page does not load in the first 10 seconds, declare it failed and move on. I understand that each page takes a while to load, so to increase the number of pages I load per second, I open many tabs in Chrome or start multiple parallel processes using phantomjs. The following is the performance that I observe:
If I open more than 20 tabs in Chrome / 20 phantomjs instances, the CPU usage rockets up.
Due to the high CPU usage, a lot of pages take more than 10seconds to load and hence I have a higher failure rate (~80% of page load requests failing)
If I intend to keep the fails to less than 5% of the total requests, I cannot load more than 1 URL per second.
After trying out both the webkit based systems, it feels like the performance bottleneck is the webkit rendering engine and hence would like to understand from other users here, the number of URLs per second that I can expect to crawl. My hardware configuration is:
Processor: Intel® Core™ i7-2635QM (1 processor, 4 cores)
Graphics card: AMD Radeon HD 6490M (256MB)
Memory: 4GB
Network bandwidth is good enough to be able to load pages more than the performance that I am observing
The question I am trying to ask this mailing list is, does any one have experience using webkit for crawling web pages for a random set of URLs (say picking 10k URLs from twitter stream), how many URLs can I reasonably expect to crawl per second?
Thanks
This question is actually more related to hardware than the software but let me point you in some better directions anyway.
First, understand that each page is itself spawning multiple threads. It will download the page, and then start spawning off new download threads for elements on the page such as javascript files, css files and images. [Ref: http://blog.marcchung.com/2008/09/05/chromes-process-model-explained.html ]
So depending on how the page is structured, you could end up with a fair number of threads going at the same time just for the page, add on top your trying to do too many loads at once and you have a problem.
The Stack Overflow thread at Optimal number of threads per core gives further information on the situation you are experiencing. Your overloading your cpu.
Your processor is 4 physical 8 logical cores. I would recommend spawning no more than 4 connections at one time, leaving the secondary logical cores to handle some of the threading there. You may find that you even need to reduce this number but 4 is a good starting point. By rendering pages 4 at a time instead of overloading your whole system trying to render 20 you will actually increase your overall speed since you end up with far less cache swapping. Start by clocking your time against several easily timed locations. Then try with less and more. There will be a sweet spot. Note, the headless browser version of PhantomJS is likely going to be better for you as in headless mode it probably won't download the images (a plus).
Your best overall option here though is to do a partial page rendering yourself using the webkit source at http://www.webkit.org/. Since all it seems you need to render is the html and the javascript. This reduces your number of connections and allows you to control your threads with far greater efficiency. You could in that case then create an event queue, spool all your primary url's into there. Spawn off 4 worker threads that all work off of the worker queue, as they process a page and need to download further source they can add those further downloads to the queue. Once all files for a page are downloaded into memory (or disk if your worried about ram) for a particular url, you can then add an item into the event queue to render the page and then parse it for whatever you need.
Depends on what data you are trying to parse, if you only care that javascript and the html , then hypertext query langage would offer an immense speedup, http://htql.net/ , or you can look to setting up something in the cloud such as http://watirmelon.com/2011/08/29/running-your-watir-webdriver-tests-in-the-cloud-for-free/
How can i test webpage/app rendering for slow speed connection?
Use Fiddler web debugger - it has a feature to simulate slow modem speeds.
What you're concerned with is what is actually going over the wire. A tool like Fiddler will help you understand how much data is being transmitted. You can then work to whittle it down to as little as possible.
Firebug's net console will tell you what each page is downloading, how much and how long its taking. Tools like YSlow and Google Page Speed will give you suggestions about how to speed up the page both loading and rendering.