I have developed a crawler using scrapy and integrated it inside celery tasks. It runs well, without any problem.
Scrapy is based on twisted reactor, so as the scrapy docs said, "Scrapy runs a single spider per process". My question is "There is a way to run a spider inside a thread?" Or "Can i run scrapy crawler with celery eventlet"?
Thanks!
Related
I am planning on deploying a Scrapy spider to ScrapingHub and using the schedule feature to run the spider on a daily basis. I know that, by default, Scrapy does not visit the same URLs. However, I was wondering if this duplicate URL avoidance is persistent across scheduled starts on ScrapingHub? And whether or not I can set it so that Scrapy does not visit the same URLs across its scheduled starts.
DeltaFetch is a Scrapy plugin that stores fingerprints of visited URLs across different Spider runs. You can use this plugin for incremental (delta) crawls. Its main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages from where no items were extracted before, to URLs from the spiders' start_urls attribute or requests generated in the spiders' start_requests method.
See: https://blog.scrapinghub.com/2016/07/20/scrapy-tips-from-the-pros-july-2016/
Plugin repository: https://github.com/scrapy-plugins/scrapy-deltafetch
In Scrapinghub's dashboard, you can activate it on the Addons Setup page, inside a Scrapy Cloud project. Though, you'll also need to activate/enable DotScrapy Persistence addon for it to work.
Using scrapyd I can run scrapy on multiple cores.
The following call i do using scrapy:
scrapy crawl buch
According to the information there is no multiprocessor usage:
Scrapy does not use multithreading and will not use more than one core. If your spider is CPU bound, the usual way to speed up is to use multiple separate scrapy processes, avoiding any bottlenecks with the python GIL.
This information is based on:
CPU-intensive parsing with scrapy
How can i now use Scrapy on all cores using scrapyd like in above example ?
Problem is there, that i cannot simply start it parallel and avoid then that the scrapers are all running the same list and crawling same stuff then parallely without splitting their lists of urls to crawl.
is there a command line switch or another out of the box trick to force scrapy to download only a few urls (even though there are more available?)
i am currently invoking a spider on the command like like below and would like it to finish after 10 url retrievals.
scrapy crawl mySpider
You can pass a setting to scrapy:
scrapy crawl mySpider -s CLOSESPIDER_PAGECOUNT=10
If the spider crawls more than 10, the spider will be closed with the reason closespider_pagecount.
I'm using scrapy to crawl an website, but bad thing happens (power down, etc.).
I wonder how can I continue my crawling from where it was broke. I don't want to start over from the seeds.
This can be done by persisting scheduled requests to the disk.
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
See http://doc.scrapy.org/en/latest/topics/jobs.html for more information.
The scrapy doc says that:
Scrapy comes with a built-in service, called “Scrapyd”, which allows you to deploy (aka. upload) your projects and control their spiders using a JSON web service.
is there some advantages in comformance use scrapyd?
Scrapyd allows you to run scrapy on a different machine than the one you are using via a handy web API which means you can just use curl or even a web browser to upload new project versions and run them. Otherwise if you wanted to run Scrapy in the cloud somewhere you would have to scp copy the new spider code and then login with ssh and spawn your scrapy crawl myspider.
Scrapyd will also manage processes for you if you want to run many spiders in parallel; but if you have Scrapy on your local machine and have access to the command-line or a way to run spiders and just want to run one spider at a time, then you're better off running the spider manually.
If you are developing spiders then for sure you don't want to use scrapyd for quick compile/test iterations as it just adds a layer of complexity.