scrapy-splash crawler starts fast but slows down (not throttled by website)

scrapy-splash crawler starts fast but slows down (not throttled by website) - scrapy

I have a single crawler written in scrapy using the splash browser via the scrapy-splash python package. I am using the aquarium python package to load balance the parallel scrapy requests to a splash docker cluster.
The scraper uses a long list of urls as the start_urls list. There is no "crawling" from page to page via hrefs or pagination.
I am running six splash dockers with 5 slots per splash as the load balanced browser cluster. I am running scrapy at six concurrent requests.
The dev machine is a macbook pro with a dual core 2.4Ghz CPU with 16Gb RAM.
When the spider starts up, the aquarium stdout shows fast request/responses, the onboard fan spins up and the system is running at 90% used with 10% idle so I am not overloading the system resources. The memory/swap is not exhausted either.
At this time, I get a very slow ~30 pages/minute. After a few minutes, the fans spin down, the system resources are significantly free (>60% idle) and the scrapy log shows every request having a 503 timeout.
When I look at the stdout of the aquarium cluster, there are requests being processed, albeit very slowly compared to when the spider is first invoked.
If I got to localhost:9050, I do get the splash page after 10 seconds or so, so the load balancer/splash is online.
If I stop the spider and restart it, it starts up normally so this does not seem to be a throttle from the target site as a spider restart would also be throttled but it's not.
I appreciate any insight that the community can offer.
Thanks.

Related

Scrapyd vs. cron + git hooks

I have a project with obout 30 spiders, all scheduled via cron job. Whenever I want to deploy a project I git push to production where a hook will put the files in place.
Now I came accross scrapyd which seems to do both in a more soffisticated way by egifying the scraper and deploying it to the production environment. Looking at the code it seems that this project has come to a halt about 3 years ago. I am wondering if there is an advantage to switch to scrapyd and what the reason is for this code to be so old and no longer under development. Scrapy itself receives regula updates in contrast.
Would you advice to use scrapyd and if yes, why?

I've been using scrapyd for about 2 years, and I do prefer to use it over just launching your jobs using scrapy crawl:
You can set the number of scrapers that can run at the same time using `max_proc_per_cpu. Any scrapers you launch when the max is reached, are put in a queue and launched when a spot is available.
You have a minimalistic GUI in which you can check the queues & read the logs.
Scheduling spiders is easily done with the api-calls. Same for listing the spiders, cancelling spiders, ...
You can use http cache even when running multiple spiders at the same time
You can deploy on multiple servers at once if you want to spread out your crawls over different servers

Xdebug boosts site speed

I have WAMP stack for development and a lot of sites are going slow, but I have a really big issue with PrestaShop because loading time is 1 min on average.
Although the content is loaded, the main request is responding very slowly and Chrome's waterfall shows that the delay is caused by Content Downloading, but all assets are already downloaded (local storage) or cached.
I noticed that when I enable the xdebug listener (on VSCode) the site is responding as it should, i.e. within miliseconds.
Any idea what might be happening ?

selenium grid execution slows down with time

I am running multiple data validation test on selenium grid (only chrome browsers) on CentOS stack. I notice that initially the tests complete real quick. However, with time, the execution slows down considerably.
I am trying to validate data from a csv file with data on a web application. I have around 100K records in the csv file. For each record, below are the list of event:
launch remote driver(chrome) instance
Open the web application and login
search for the keywords in the csv file on the application and validate the results(output in csv VS output on the web application)
close the remote driver instance
I have configured 7 nodes using CentOS and each node has 10 browser instances.
Also, I am using ThreadPoolExecutor for submitted each thread. So at any given time, I will have 70 threads running where each thread is a webdriver instance.
I am not sure if this is a code level issue or infrastructure related issue. Can someone point me in the right direction of how I can find the root cause for this slowness and rectify it.
I have tried to monitor system resources for one of the nodes and see that the java process takes around 55% CPU and 10% memory. while each browser takes 10% CPU and 4% memory

Selenium grid will be slow when time increases as selenium grid is running on jvm and it will occupy more memory. There are many factors will affect the performance of the browser like no of browser in a node, node configuration, grid configuration and you web server performance. For better grid performance, you have to restart grid hub and nodes once in a while.

Apache - resources randomly hang (resulting in slow page loads)

HTTP requests of resources randomly - about between 1-5% of the time (per resource, not per page loads) - take extremely long to be delivered to the browser (~20 seconds), not uncommonly hanging indefinitely even. (Server details listed in list at the bottom).
This results in about every 5th request to any page appear to hang due to a JavaScript resource hanging within the <head> tag.
The resources are css, js and small image files, served directly by apache (no scripting language), although page loads (involving PHP or Rails) also rarely hang, with equal chances as any other resource (1-5% of the time), so this seems to be an Apache Request related issue.
Additional information:
I've checked the idle workers on server-status and as expected, I still have 98% of my idle workers. Although this may be relevant as the hangings apply to static resources not served by FastCGI (the resources are static).
I am not the only one with this problem. Someone else is also having the same problem, and from a different IP address.
This happens in both Google Chrome and Firefox as HTTP clients.
I have tried constantly force refreshing the same JS file in a new tab. It eventually led to the same kind of hanging.
The Timing tab for Google Chrome reports 34ms waiting and 19.27s receiving for one of these hanging requests. Would that mean Apache already had the file contents to be delivered ready, only had trouble delivering it in a sensible amount of time?
error.log doesn't show any errors. There are some expected 404 and 500 errors in error.log, but those aren't related to the hanging; those are actual errors for nonexisting pages and PHP fatal errors.
I get some suspicious 206 Partial Content responses mostly for static content, although the hanging happens more often then those partial contents. I mostly get 200 OK responses everywhere and I can confirm indefinitely hanging resources that were reported as 200 OK in the apache access.log.
I do have mod_passenger installed for Redmine. I don't know if that helps, but suspiciously this server has it installed unlike all the other servers I worked with. Although mod_passenger shouldn't affect static content, especially not within a non-ruby project folder, should it?
The server is using Apache 2.4 Event MPM on Ubuntu 13.10, hosted on Digital Ocean.
What may be causing these hangings and how could I fix this?

I had the same problem, so after reading this thread I tried setting KeepAlive Off in my apache config which seems to have helped- all resources have expected waiting times now.
Not a great "fix", but at least I am one step closer to figuring out the cause and pages aren't taking 15s to fully load in the mean time.

Is there a way to spawn Unicorn processes and have them load fully before they are accessed to process requests?

I am running Unicorn on Heroku. I notice when we scale web dynos up. These new web dynos are being accessed right after it is spawned. In the logs we are getting:
Request timeout" error with 30 seconds limit (i.e. service=30000ms)
As soon as a dyno starts, traffic will be sent to it. If Unicorn is still spinning up child processes as these requests arrive, then they will likely time out, especially if your app takes a while to boot.
Is there a way to spawn Unicorn child processes without them being accessed by requests, until the app is fully loaded?

This is actually something else and something I've encountered before.
Essentially, when you scale you're seeing your dyno get created and your slug being deployed onto it. At this point your slug is initialised. The routing mesh will start sending requests through the moment this has happened, as it sees the dyno as up and ready to rock.
However, (and this is where I found the problem), it takes time for your application to spin up to respond to the request (Unicorn is up, Rails is still initialising) and you get the 30 second timeout.
The only way I could fix this was to get my app starting up in less than 30 seconds repeatedly which I finally achieved by updating to a current version of Rails. I've also found some increases by also updating to try running on Ruby 1.9.3

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas