Report scrapy ResponseNeverReceived - scrapy

I would like to send status report to a database after a scrapy crawl. I already have a system in place that listens to the spider closed event and sends the reason for closing the spider to the database.
My problem is that I get a twisted.web._newclient.ResponseNeverReceived which ends the spider with CloseSpider('finished'). This reason is not enough for me to deduct the actual reason for failure.
How would you close the spider with a different reason when ResponseNeverReceived occurs? There doesn't seem to be any signals for that.

Related

How to pause a scrapy spider and make the other keep on scraping?

I am facing a problem with my custom retry middleware in scrapy. I have a project made of 6 spiders, launched by a little script containing a CrawlerProcess(), crawling 6 different websites. They should work simultaneously and here is the problem: i have implemented a retry middleware to send again requests which had not been responsed (status code 429) like that:
if response.status == 429:
self.crawler.engine.pause()
time.sleep(self.RETRY_AFTER.get(spider.name) * (random.random() + 0.5))
self.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
But of course, using time.sleep, when I execute all the spiders simultaneously using a CrawlerProcess(), all the spiders are paused (because the process is paused)
Is there a better way to pause only one of them, retrying the request after a delay and make the other spiders continue to crawl? Because my application should scrape some sites at the same time, but each spider should send again a request if status code 429 had been received, so how can i replace that time.sleep() with something that stops only the spider entering that method?

Scrapy - Do spider_idle is called when a DOWNLOAD_DELAY is specified?

I'm writing a spider for scraping data about cars from a carsharing websites https://fr.be.getaround.com/. The objective is to divide my spider in two parts. First, it scrapes data for available cars and keeps unavailable cars aside. Second, once all information about available cars is scraped, thus at the end of the process, the spider scrapes additional information for unavailable cars. For this second part, I've added the spider_idle method in my spider. Doing so, it should be called once no available cars is remaining in the waiting list. However, I've added a DOWNLOAD_DELAY (5 seconds) and Autothrottle is enabled. I was wondering, do spider_idle will be called during the waiting time between each request (within the 5 seconds) ?
No.
The spider_idle signal is only called when there are no further requests to process. It will not be called if no request happens to be in progress because the next request needs to wait for a given time to pass.

Spider stops after one hour of processing one item

I have a scrapy spider running in a (non-free) scrapinghub account that sometimes has to OCR a PDF (via Tesseract) - which depending on the number of units can take quite some time.
What I see in the log is something like this:
2220: 2019-07-07 22:51:50 WARNING [tools.textraction] PDF contains only images - running OCR.
2221: 2019-07-08 00:00:03 INFO [scrapy.crawler] Received SIGTERM, shutting down gracefully. Send again to force
The SIGTERM always arrives about one hour after the message saying the OCR has started, so I'm assuming there's a mechanism that kills everything if there's no new request or item for one hour.
How can I hook into that, and prevent the shutting down? Is this an example of signal.spider_idle?

Requests is blocking and waiting for first call to be finished

I have a serious problem, is that I can not launch any query on clicking on some link in my website until the first request is finished:
I have a musical website in which I generate waveform on click on the track, the problem is that generating the waveform take some sometime (about 12 seconds) to be finished, so if I click on another track it will not start untill the generation (the call to the webservice) is not yet finished.
I have seen in FRONT calls but the problem seems to be in backend, because even in the logs of apache, it says that the first query was called at x time, when it finishes apache write down in the log the second call.
The strange thing is that the time of the second call written by apache is the x + some micro seconds, as the apache also was blocked waiting for the first request also to be finished.
Please help me.
BTW : the website is secured by a password so I must have contact with someone to see the problem
Ask me what ever you want for more details.
Kind regards.

How to know scrapy-redis finish

When I use scrapy-redis that will set spider DontCloseSpider.
How to know scrapy crawling finish.
crawler.signals.connect(ext.spider_closed,signal=signals.spider_closed) not working
Interesting.
I see this comment:
# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
SCHEDULER_IDLE_BEFORE_CLOSE = 10
If you follow the setup instructions properly and it doesn't work, I guess that at least you would have to give some data that allows to reproduce your setup e.g. your settings.py or if you have any interesting spiders/pipelines.
spider_closed signal should indeed happen. Just quite some seconds after it runs out of URLs in the queue. If the queue is not empty, the spider won't close - obviously.