Let Scrapy continue to crawl from last break point - scrapy

I'm using scrapy to crawl an website, but bad thing happens (power down, etc.).
I wonder how can I continue my crawling from where it was broke. I don't want to start over from the seeds.

This can be done by persisting scheduled requests to the disk.
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
See http://doc.scrapy.org/en/latest/topics/jobs.html for more information.

Related

scrapy-splash crawler starts fast but slows down (not throttled by website)

I have a single crawler written in scrapy using the splash browser via the scrapy-splash python package. I am using the aquarium python package to load balance the parallel scrapy requests to a splash docker cluster.
The scraper uses a long list of urls as the start_urls list. There is no "crawling" from page to page via hrefs or pagination.
I am running six splash dockers with 5 slots per splash as the load balanced browser cluster. I am running scrapy at six concurrent requests.
The dev machine is a macbook pro with a dual core 2.4Ghz CPU with 16Gb RAM.
When the spider starts up, the aquarium stdout shows fast request/responses, the onboard fan spins up and the system is running at 90% used with 10% idle so I am not overloading the system resources. The memory/swap is not exhausted either.
At this time, I get a very slow ~30 pages/minute. After a few minutes, the fans spin down, the system resources are significantly free (>60% idle) and the scrapy log shows every request having a 503 timeout.
When I look at the stdout of the aquarium cluster, there are requests being processed, albeit very slowly compared to when the spider is first invoked.
If I got to localhost:9050, I do get the splash page after 10 seconds or so, so the load balancer/splash is online.
If I stop the spider and restart it, it starts up normally so this does not seem to be a throttle from the target site as a spider restart would also be throttled but it's not.
I appreciate any insight that the community can offer.
Thanks.

Scrapy Prevent Visiting Same URL Across Schedule

I am planning on deploying a Scrapy spider to ScrapingHub and using the schedule feature to run the spider on a daily basis. I know that, by default, Scrapy does not visit the same URLs. However, I was wondering if this duplicate URL avoidance is persistent across scheduled starts on ScrapingHub? And whether or not I can set it so that Scrapy does not visit the same URLs across its scheduled starts.
DeltaFetch is a Scrapy plugin that stores fingerprints of visited URLs across different Spider runs. You can use this plugin for incremental (delta) crawls. Its main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages from where no items were extracted before, to URLs from the spiders' start_urls attribute or requests generated in the spiders' start_requests method.
See: https://blog.scrapinghub.com/2016/07/20/scrapy-tips-from-the-pros-july-2016/
Plugin repository: https://github.com/scrapy-plugins/scrapy-deltafetch
In Scrapinghub's dashboard, you can activate it on the Addons Setup page, inside a Scrapy Cloud project. Though, you'll also need to activate/enable DotScrapy Persistence addon for it to work.

Test or Mock Scrapy Pipeline

I was looking into testing the scrapy pipeline, (I already know the spider works) when it occurred to me I could just use a local copy of a page from the target website instead of repeatedly hitting it with my spider online. But I did not see anything suggesting that option. Is there some reason why this won't work, or is not a best practice?

Usage of scrapyd instead of scrapy

Using scrapyd I can run scrapy on multiple cores.
The following call i do using scrapy:
scrapy crawl buch
According to the information there is no multiprocessor usage:
Scrapy does not use multithreading and will not use more than one core. If your spider is CPU bound, the usual way to speed up is to use multiple separate scrapy processes, avoiding any bottlenecks with the python GIL.
This information is based on:
CPU-intensive parsing with scrapy
How can i now use Scrapy on all cores using scrapyd like in above example ?
Problem is there, that i cannot simply start it parallel and avoid then that the scrapers are all running the same list and crawling same stuff then parallely without splitting their lists of urls to crawl.

what are the advantages use scrapyd?

The scrapy doc says that:
Scrapy comes with a built-in service, called “Scrapyd”, which allows you to deploy (aka. upload) your projects and control their spiders using a JSON web service.
is there some advantages in comformance use scrapyd?
Scrapyd allows you to run scrapy on a different machine than the one you are using via a handy web API which means you can just use curl or even a web browser to upload new project versions and run them. Otherwise if you wanted to run Scrapy in the cloud somewhere you would have to scp copy the new spider code and then login with ssh and spawn your scrapy crawl myspider.
Scrapyd will also manage processes for you if you want to run many spiders in parallel; but if you have Scrapy on your local machine and have access to the command-line or a way to run spiders and just want to run one spider at a time, then you're better off running the spider manually.
If you are developing spiders then for sure you don't want to use scrapyd for quick compile/test iterations as it just adds a layer of complexity.