How to callback a Scrapy spider method upon receiving SIGINT or Scrapyd's cancel.json call? - scrapy

In Scrapy, when we press CTRL+C we see Received SIGINT, shutting down gracefully. Send again to force in logs that comes from this code.
Or when we request cancel.json of Scrapyd following code is executed
I want to catch those signals in my Spider, so any of those is called, one of my Spider's callback method should also be called.
I have searched a lot about this but did not find any help.

We can use signal to catch those CTRL+C or TERM signal sent my scrapyd
import signal
def custom_terminate_spider(self, sig, frame):
self.logger.info(self.crawler.stats.get_stats()) #print stats if you want
#dangerous line, it will just kill your scrapy spider running immediately
os.kill(os.getpid(), signal.SIGKILL)
def start_requests(self):
signal.signal(signal.SIGINT, self.custom_terminate_spider) #CTRL+C
signal.signal(signal.SIGTERM, self.custom_terminate_spider) #sent by scrapyd

Related

Report scrapy ResponseNeverReceived

I would like to send status report to a database after a scrapy crawl. I already have a system in place that listens to the spider closed event and sends the reason for closing the spider to the database.
My problem is that I get a twisted.web._newclient.ResponseNeverReceived which ends the spider with CloseSpider('finished'). This reason is not enough for me to deduct the actual reason for failure.
How would you close the spider with a different reason when ResponseNeverReceived occurs? There doesn't seem to be any signals for that.

How to pause a scrapy spider and make the other keep on scraping?

I am facing a problem with my custom retry middleware in scrapy. I have a project made of 6 spiders, launched by a little script containing a CrawlerProcess(), crawling 6 different websites. They should work simultaneously and here is the problem: i have implemented a retry middleware to send again requests which had not been responsed (status code 429) like that:
if response.status == 429:
self.crawler.engine.pause()
time.sleep(self.RETRY_AFTER.get(spider.name) * (random.random() + 0.5))
self.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
But of course, using time.sleep, when I execute all the spiders simultaneously using a CrawlerProcess(), all the spiders are paused (because the process is paused)
Is there a better way to pause only one of them, retrying the request after a delay and make the other spiders continue to crawl? Because my application should scrape some sites at the same time, but each spider should send again a request if status code 429 had been received, so how can i replace that time.sleep() with something that stops only the spider entering that method?

Using asyncio to talk to daemon

I have a python module that uses Telnetlib to open a connection to a daemon, and is able to send commands and get parse responses. This is used for testing purposes. However, sometimes the daemon will send asynchronous messages too. My current solution is to have a "wait_for_message" method that will start a thread to listen on the telnet socket. Meanwhile, in the main thread, I send a certain command that I know will trigger the daemon to send the particular async message. Then I simply do thread.join() and wait for it to finish.
import telnetlib
class Client():
def __init__(self, host, port):
self.connection = telnetlib.Telnet(host, port)
def get_info(self):
self.connection.write('getstate\r')
idx, _, _ = self.connection.expect(['good','bad','ugly'], timeout=1)
return idx
def wait_for_unexpected_message(self):
idx, _, _ = self.connection.expect(['error'])
Then when I write tests, I can use the module as part of the automation system.
client = Client()
client.connect('192.168.0.45', 6000)
if client.get_info() != 0:
# do stuff
else:
# do other stuff
It works really well until I want to handle async messages coming in. I've been reading about Python's new asyncio library, but I haven't quite figured out how to do what I need it to do, or even if my use case would be benefited by asyncio.
So my question: Is there a better way to handle this better with asyncio? I like using telnetlib because of the expect() features it has. Using a simple TCP socket does not do that for me.

How to know scrapy-redis finish

When I use scrapy-redis that will set spider DontCloseSpider.
How to know scrapy crawling finish.
crawler.signals.connect(ext.spider_closed,signal=signals.spider_closed) not working
Interesting.
I see this comment:
# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
SCHEDULER_IDLE_BEFORE_CLOSE = 10
If you follow the setup instructions properly and it doesn't work, I guess that at least you would have to give some data that allows to reproduce your setup e.g. your settings.py or if you have any interesting spiders/pipelines.
spider_closed signal should indeed happen. Just quite some seconds after it runs out of URLs in the queue. If the queue is not empty, the spider won't close - obviously.

How to ACK celery tasks with parallel code in reactor?

I have a celery task that, when called, simply ignites the execution of some parallel code inside a twisted reactor. Here's some sample (not runnable) code to illustrate:
def run_task_in_reactor():
# this takes a while to run
do_something()
do_something_more()
#celery.task
def run_task():
print "Started reactor"
reactor.callFromThread(run_task_in_reactor)
(For the sake of simplicity, please assume that the reactor is already running when the task is received by the worker; I used the signal #worker_process_init.connect to start my reactor in another thread as soon as the worker comes up)
When I call run_task.delay(), the task finishes pretty quickly (since it does not wait for run_task_in_reactor() to finish, only schedules its execution in the reactor). And, when run_task_in_reactor() finally runs, do_something() or do_something_more() can throw an exception, which will go unoticed.
Using pika to consume from my queue, I can use an ACK inside do_something_more() to make the worker notify the correct completion of the task, for instance. However, inside Celery, this does not seems to be possible (or, at least, I do't know how to accomplish the same effect)
Also, I cannot remove the reactor, since it is a requirement of some third-party code I'm using. Other ways to achieve the same result are appreciated as well.
Use reactor.blockingCallFromThread instead.