How to pause a scrapy spider and make the other keep on scraping? - scrapy

I am facing a problem with my custom retry middleware in scrapy. I have a project made of 6 spiders, launched by a little script containing a CrawlerProcess(), crawling 6 different websites. They should work simultaneously and here is the problem: i have implemented a retry middleware to send again requests which had not been responsed (status code 429) like that:
if response.status == 429:
self.crawler.engine.pause()
time.sleep(self.RETRY_AFTER.get(spider.name) * (random.random() + 0.5))
self.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
But of course, using time.sleep, when I execute all the spiders simultaneously using a CrawlerProcess(), all the spiders are paused (because the process is paused)
Is there a better way to pause only one of them, retrying the request after a delay and make the other spiders continue to crawl? Because my application should scrape some sites at the same time, but each spider should send again a request if status code 429 had been received, so how can i replace that time.sleep() with something that stops only the spider entering that method?

Related

Report scrapy ResponseNeverReceived

I would like to send status report to a database after a scrapy crawl. I already have a system in place that listens to the spider closed event and sends the reason for closing the spider to the database.
My problem is that I get a twisted.web._newclient.ResponseNeverReceived which ends the spider with CloseSpider('finished'). This reason is not enough for me to deduct the actual reason for failure.
How would you close the spider with a different reason when ResponseNeverReceived occurs? There doesn't seem to be any signals for that.

Yielding more requests if scraper was idle more than 20s

I would like to yield more requests at the end of a CrawlSpider that uses Rules.
I noticed I was not able to feed more requests by doing this in the spider_closed method:
self.crawler.engine.crawl(r, self)
I noticed that this technic work in spider_idle method but I would like to wait to be sure that the crawl is finished before feeding more requests.
I set the setting CLOSESPIDER_TIMEOUT = 30
What would be the code to wait 20 seconds idle before triggering the process of feeding more requests?
Is there a better way?
If it is really important that the previous crawling has completely finished before the new crawling starts, consider running either two separate spiders or the same spider twice in a row with different arguments that determine which URLs it crawls. See Run Scrapy from a script.
If you don’t really need for the previous crawling to finish, and you simply have URLs that should have a higher priority than other URLs for some reason, consider using request priorities instead. See the priority parameter of the Request class constructor.

Scrapy distributed connection count

Let's say I had a couple of servers each running multiple Scrapy spider instances at once. Each spider is limited to 4 concurrent requests with CONCURRENT_REQUESTS = 4. For concreteness, let's say there are 10 spider instances at once so I never expect more than 40 requests max at once.
If I need to know at any given time how many concurrent requests are active across all 10 spiders, I might think of storing that integer on a central redis server under some "connection_count" key.
My idea was then to write some downloader middleware that schematically looks like this:
class countMW(object):
def process_request(self, request, spider):
# Increment the redis key
def process_response(self, request, response, spider):
# Decrement the redis key
return response
def process_exception(self, request, exception, spider):
# Decrement the redis key
However, with this approach it seems the connection count under the central key, can be more than 40. I even get > 4, for a single spider running (when the network is under load), and even for a single spider when the redis store is just replaced with the approach of storing the count as an attribute on the spider instance itself, to remove any lag in remote redis key server updates being the problem.
My theory for the reason this doesn't work is that even though the request concurrency per spider is capped at 4, Scrapy still creates and queues more than 4 requests in the meantime, and those extra requests call process_requests incrementing the count long before they are fetched.
Firstly, is this theory correct? Secondly, if it is, is there a way that I could increment the redis count only when a true fetch was occurring (when the request becomes active), and decrement it similarly.
In my opinion it is better customize scheduler as it fits better to Scrapy architecture and you have full control of the requests emitting process:
Scheduler
The Scheduler receives requests from the engine and enqueues them for feeding them later (also to the engine) when the engine requests them.
https://doc.scrapy.org/en/latest/topics/architecture.html?highlight=scheduler#component-scheduler
For example you can find some inspiration ideas about how to customize scheduler here: https://github.com/rolando/scrapy-redis
Your theory is partially correct. Usually requests are made much faster than they are fulfilled and the engine will give, not some, but ALL of these requests to the scheduler. But these queued requests are not processed and thus will not call process_request until they are fetched.
There is a slight lag between when the scheduler releases a request and when the downloader begins to fetch it; and, this allows for the scenario you observe where more than CONCURRENT_REQUESTS requests are active at the same time. Since Scrapy processes requests asynchronously there is this bit of a sloppy double dipping possibility baked in; so, how to deal with it. I'm sure you don't want to run synchronously.
So the question becomes: what is the motivation behind this? Are you just curious about the inner workings of Scrapy? Or do you have some ISP bandwidth cost limitations to deal with, for example? Because we have to define what we really mean by concurrency here.
When does a request become "active"?
When the scheduler releases it?
When the downloader begins to fetch it?
When the underlying Twisted deferred is created?
When the first TCP packet is sent?
When the first TCP packet is received?
Perhaps you could add your own scheduler middleware for finer grained control and perhaps can take inspiration from Downloader.fetch.

How to know scrapy-redis finish

When I use scrapy-redis that will set spider DontCloseSpider.
How to know scrapy crawling finish.
crawler.signals.connect(ext.spider_closed,signal=signals.spider_closed) not working
Interesting.
I see this comment:
# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
SCHEDULER_IDLE_BEFORE_CLOSE = 10
If you follow the setup instructions properly and it doesn't work, I guess that at least you would have to give some data that allows to reproduce your setup e.g. your settings.py or if you have any interesting spiders/pipelines.
spider_closed signal should indeed happen. Just quite some seconds after it runs out of URLs in the queue. If the queue is not empty, the spider won't close - obviously.

Handling long running webrequest

I have a web request that can take 30-90 seconds to complete in some cases(most of the time it completed in 2-3). Currently, the software looks like it has hung if the request takes this long.
I was thinking that I could use background worker to process the webrequest in a separate thread. However, The software must wait on the request to process before continuing. I know how to setup the background worker. What I am unsure about is how to handle waiting on the request to process.
Do I need to create a timer to check for the results until the request times out or is processed?
I wouldn't use a timer. Rather, when the web request completes on the worker thread, use a call to the Invoke method on a control in the UI to update the UI with the result (or send some sort of notification).