what are the advantages use scrapyd? - scrapy

The scrapy doc says that:
Scrapy comes with a built-in service, called “Scrapyd”, which allows you to deploy (aka. upload) your projects and control their spiders using a JSON web service.
is there some advantages in comformance use scrapyd?

Scrapyd allows you to run scrapy on a different machine than the one you are using via a handy web API which means you can just use curl or even a web browser to upload new project versions and run them. Otherwise if you wanted to run Scrapy in the cloud somewhere you would have to scp copy the new spider code and then login with ssh and spawn your scrapy crawl myspider.
Scrapyd will also manage processes for you if you want to run many spiders in parallel; but if you have Scrapy on your local machine and have access to the command-line or a way to run spiders and just want to run one spider at a time, then you're better off running the spider manually.
If you are developing spiders then for sure you don't want to use scrapyd for quick compile/test iterations as it just adds a layer of complexity.

Related

Scrapyd vs. cron + git hooks

I have a project with obout 30 spiders, all scheduled via cron job. Whenever I want to deploy a project I git push to production where a hook will put the files in place.
Now I came accross scrapyd which seems to do both in a more soffisticated way by egifying the scraper and deploying it to the production environment. Looking at the code it seems that this project has come to a halt about 3 years ago. I am wondering if there is an advantage to switch to scrapyd and what the reason is for this code to be so old and no longer under development. Scrapy itself receives regula updates in contrast.
Would you advice to use scrapyd and if yes, why?
I've been using scrapyd for about 2 years, and I do prefer to use it over just launching your jobs using scrapy crawl:
You can set the number of scrapers that can run at the same time using `max_proc_per_cpu. Any scrapers you launch when the max is reached, are put in a queue and launched when a spot is available.
You have a minimalistic GUI in which you can check the queues & read the logs.
Scheduling spiders is easily done with the api-calls. Same for listing the spiders, cancelling spiders, ...
You can use http cache even when running multiple spiders at the same time
You can deploy on multiple servers at once if you want to spread out your crawls over different servers

Using Scrapy Spider results for a website

I've experimented with some crawlers to pull web data from within a Python environment on my local machine. Ideally, I'd like to host a website that can initiate crawlers to aggregate content and display that on the site.
My question is, is it possible to do this from a web environment and not my local machine?
Sure there are many services that are doing the same task you wanted.
scrapingHub is the best example you can get. https://scrapinghub.com/
You can deploy your spiders in there and run it periodically(paid service). Deploy and call spider via scrapingHub API form your website and use the spider output in your host website.
Also, you can achieve the same idea in your server and website via API call.

Scrapy Prevent Visiting Same URL Across Schedule

I am planning on deploying a Scrapy spider to ScrapingHub and using the schedule feature to run the spider on a daily basis. I know that, by default, Scrapy does not visit the same URLs. However, I was wondering if this duplicate URL avoidance is persistent across scheduled starts on ScrapingHub? And whether or not I can set it so that Scrapy does not visit the same URLs across its scheduled starts.
DeltaFetch is a Scrapy plugin that stores fingerprints of visited URLs across different Spider runs. You can use this plugin for incremental (delta) crawls. Its main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages from where no items were extracted before, to URLs from the spiders' start_urls attribute or requests generated in the spiders' start_requests method.
See: https://blog.scrapinghub.com/2016/07/20/scrapy-tips-from-the-pros-july-2016/
Plugin repository: https://github.com/scrapy-plugins/scrapy-deltafetch
In Scrapinghub's dashboard, you can activate it on the Addons Setup page, inside a Scrapy Cloud project. Though, you'll also need to activate/enable DotScrapy Persistence addon for it to work.

Test or Mock Scrapy Pipeline

I was looking into testing the scrapy pipeline, (I already know the spider works) when it occurred to me I could just use a local copy of a page from the target website instead of repeatedly hitting it with my spider online. But I did not see anything suggesting that option. Is there some reason why this won't work, or is not a best practice?

Import.io some crawlers don't have the button for crawl locally

I was creating some crawlers using import.io however seems like for some of them the option for run locally is not showing. Anyone knows why they dont have the run from local button or how i can make to put in the crawlers?
If you don't see the option to run the crawler [Remotely or locally] then it means that your crawler is already running locally only.
When you save a crawler, import.io does a few checks to see if it can be run remotely on our servers, in some cases this increases chances of crawlers working as the servers do additional processing.
If those check fail, then the crawler can only run locally and therefore your crawler will be run locally be default.