Modules folder in Scrapinghub - scrapy

I'm currently using Scrapinghub's Scrapy Cloud to host my 12 spiders (and 12 differnet projects).
I'd like to have one folder with functions that are used by all 12 spiders but not sure what the best way to implement it without having 1 functions folder in each spider.
I'm thinking about hosting all spiders under the same project, creating a private package in the cloud the spiders connect to or hosting ScrapyD myself so I can reference the modules.
Has anyone stumbled upon this and what was your solution?

Related

Scrapyd vs. cron + git hooks

I have a project with obout 30 spiders, all scheduled via cron job. Whenever I want to deploy a project I git push to production where a hook will put the files in place.
Now I came accross scrapyd which seems to do both in a more soffisticated way by egifying the scraper and deploying it to the production environment. Looking at the code it seems that this project has come to a halt about 3 years ago. I am wondering if there is an advantage to switch to scrapyd and what the reason is for this code to be so old and no longer under development. Scrapy itself receives regula updates in contrast.
Would you advice to use scrapyd and if yes, why?
I've been using scrapyd for about 2 years, and I do prefer to use it over just launching your jobs using scrapy crawl:
You can set the number of scrapers that can run at the same time using `max_proc_per_cpu. Any scrapers you launch when the max is reached, are put in a queue and launched when a spot is available.
You have a minimalistic GUI in which you can check the queues & read the logs.
Scheduling spiders is easily done with the api-calls. Same for listing the spiders, cancelling spiders, ...
You can use http cache even when running multiple spiders at the same time
You can deploy on multiple servers at once if you want to spread out your crawls over different servers

Using Scrapy Spider results for a website

I've experimented with some crawlers to pull web data from within a Python environment on my local machine. Ideally, I'd like to host a website that can initiate crawlers to aggregate content and display that on the site.
My question is, is it possible to do this from a web environment and not my local machine?
Sure there are many services that are doing the same task you wanted.
scrapingHub is the best example you can get. https://scrapinghub.com/
You can deploy your spiders in there and run it periodically(paid service). Deploy and call spider via scrapingHub API form your website and use the spider output in your host website.
Also, you can achieve the same idea in your server and website via API call.

Scrapy Prevent Visiting Same URL Across Schedule

I am planning on deploying a Scrapy spider to ScrapingHub and using the schedule feature to run the spider on a daily basis. I know that, by default, Scrapy does not visit the same URLs. However, I was wondering if this duplicate URL avoidance is persistent across scheduled starts on ScrapingHub? And whether or not I can set it so that Scrapy does not visit the same URLs across its scheduled starts.
DeltaFetch is a Scrapy plugin that stores fingerprints of visited URLs across different Spider runs. You can use this plugin for incremental (delta) crawls. Its main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages from where no items were extracted before, to URLs from the spiders' start_urls attribute or requests generated in the spiders' start_requests method.
See: https://blog.scrapinghub.com/2016/07/20/scrapy-tips-from-the-pros-july-2016/
Plugin repository: https://github.com/scrapy-plugins/scrapy-deltafetch
In Scrapinghub's dashboard, you can activate it on the Addons Setup page, inside a Scrapy Cloud project. Though, you'll also need to activate/enable DotScrapy Persistence addon for it to work.

handling file upload and serving in a distributed web application

I'm going to deploy a web application with multiple Pyramid application servers and nginx as a load balancer.
This application will have a feature for uploading files which should be available for downloading afterwards.
Total size of uploaded files may be very big so I'd like to deploy a separate file webserver to serve these static files. (this is one reason why I don't like rsync solution proposed here).
What is the best solution to handle file upload and syncronization in this case? I was thinking about NFS or something like that, but I'm not sure it is a good way to solve the problem. I suppose there must be some best-practices here or even a tool or library for these purposes.
UPDATE:
I don't want use cloud services like Dropbox, it would be nicer to find some syncronization solution inside the network segment.
UPDATE2:
I finished with setting up NFS, for now it works perfectly.
not really a python or pyramid related question. But, you should investigate distributed file systems and CDN's both of which are for this kind of thing. gridfs is easy enough to get going with. But there are plenty of other options. Both Amazon and Google have similar services.

what are the advantages use scrapyd?

The scrapy doc says that:
Scrapy comes with a built-in service, called “Scrapyd”, which allows you to deploy (aka. upload) your projects and control their spiders using a JSON web service.
is there some advantages in comformance use scrapyd?
Scrapyd allows you to run scrapy on a different machine than the one you are using via a handy web API which means you can just use curl or even a web browser to upload new project versions and run them. Otherwise if you wanted to run Scrapy in the cloud somewhere you would have to scp copy the new spider code and then login with ssh and spawn your scrapy crawl myspider.
Scrapyd will also manage processes for you if you want to run many spiders in parallel; but if you have Scrapy on your local machine and have access to the command-line or a way to run spiders and just want to run one spider at a time, then you're better off running the spider manually.
If you are developing spiders then for sure you don't want to use scrapyd for quick compile/test iterations as it just adds a layer of complexity.