I have a scrapy spider running in a (non-free) scrapinghub account that sometimes has to OCR a PDF (via Tesseract) - which depending on the number of units can take quite some time.
What I see in the log is something like this:
2220: 2019-07-07 22:51:50 WARNING [tools.textraction] PDF contains only images - running OCR.
2221: 2019-07-08 00:00:03 INFO [scrapy.crawler] Received SIGTERM, shutting down gracefully. Send again to force
The SIGTERM always arrives about one hour after the message saying the OCR has started, so I'm assuming there's a mechanism that kills everything if there's no new request or item for one hour.
How can I hook into that, and prevent the shutting down? Is this an example of signal.spider_idle?
Related
I would like to send status report to a database after a scrapy crawl. I already have a system in place that listens to the spider closed event and sends the reason for closing the spider to the database.
My problem is that I get a twisted.web._newclient.ResponseNeverReceived which ends the spider with CloseSpider('finished'). This reason is not enough for me to deduct the actual reason for failure.
How would you close the spider with a different reason when ResponseNeverReceived occurs? There doesn't seem to be any signals for that.
I'm writing a spider for scraping data about cars from a carsharing websites https://fr.be.getaround.com/. The objective is to divide my spider in two parts. First, it scrapes data for available cars and keeps unavailable cars aside. Second, once all information about available cars is scraped, thus at the end of the process, the spider scrapes additional information for unavailable cars. For this second part, I've added the spider_idle method in my spider. Doing so, it should be called once no available cars is remaining in the waiting list. However, I've added a DOWNLOAD_DELAY (5 seconds) and Autothrottle is enabled. I was wondering, do spider_idle will be called during the waiting time between each request (within the 5 seconds) ?
No.
The spider_idle signal is only called when there are no further requests to process. It will not be called if no request happens to be in progress because the next request needs to wait for a given time to pass.
I have a serious problem, is that I can not launch any query on clicking on some link in my website until the first request is finished:
I have a musical website in which I generate waveform on click on the track, the problem is that generating the waveform take some sometime (about 12 seconds) to be finished, so if I click on another track it will not start untill the generation (the call to the webservice) is not yet finished.
I have seen in FRONT calls but the problem seems to be in backend, because even in the logs of apache, it says that the first query was called at x time, when it finishes apache write down in the log the second call.
The strange thing is that the time of the second call written by apache is the x + some micro seconds, as the apache also was blocked waiting for the first request also to be finished.
Please help me.
BTW : the website is secured by a password so I must have contact with someone to see the problem
Ask me what ever you want for more details.
Kind regards.
When I use scrapy-redis that will set spider DontCloseSpider.
How to know scrapy crawling finish.
crawler.signals.connect(ext.spider_closed,signal=signals.spider_closed) not working
Interesting.
I see this comment:
# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
SCHEDULER_IDLE_BEFORE_CLOSE = 10
If you follow the setup instructions properly and it doesn't work, I guess that at least you would have to give some data that allows to reproduce your setup e.g. your settings.py or if you have any interesting spiders/pipelines.
spider_closed signal should indeed happen. Just quite some seconds after it runs out of URLs in the queue. If the queue is not empty, the spider won't close - obviously.
I am using Heroku to host a small app. It's running a screenscraper using Mechanize and Nokogiri for every search request, and this takes about 3 seconds to complete.
Does the screenscraper block anyone else who wants to access the app at that moment? In other words, is it in holding mode for only the current user or for everyone?
If you have only one heroku dyno, then yes it is the case that other users would have to wait in line.
Background jobs are for cases like yours where there is some heavy processing to be done. The process running rails doesn't do the hard work up-front, instead it triggers a background job to do it and responds quickly, freeing itself up to respond to other requests.
The data processed by the background job is viewed later - perhaps in a few requests time, or whenever the job is done, loaded in by javascript.
Definitely, because a dyno is single threaded if it's busy scrapping then other requests will be queued until the dyno is free or ultimately timed out if they hang for more than 30 seconds.
Anything that relies on an external service would be best run through a worker via DJ - even sending an email, so your controller puts the message into the queue and returns the user to a screen and then the mail is picked up by DJ and delivered so the user doesn't have to wait for the mailing process to complete.
Install the NewRelic gem to see what your queue length is doing
John.
If you're interested in a full service worker system we're looking for beta testers of our Heroku appstore app SimpleWorker.com. It's integrated tightly into Heroku and offers a number of advantages over DJ.
It's as simple as a few line of code to send your work off to our elastic worker cloud and you can take advantage of massive parallel processing because we do not limit the number of workers.
Shoot me a message if you're interested.
Chad