I was looking into testing the scrapy pipeline, (I already know the spider works) when it occurred to me I could just use a local copy of a page from the target website instead of repeatedly hitting it with my spider online. But I did not see anything suggesting that option. Is there some reason why this won't work, or is not a best practice?
Related
I'm currently trying to load test a homepage I develop. Till now Loader.io was good enough for my purposes, but I realized it does not download/use the embedded assets.
Is there a load test service, which get's as close as possible to real users?
I haven't found anything until now. Hopefully somebody of you guys knows a suitable service.
Thanks in advance!
Apache JMeter does it for sure:
See Web Testing with JMeter: How To Properly Handle Embedded Resources in HTML Responses
Moreover it simulates browser cache via HTTP Cache Manager
If you rather looking for a "service" there are several options of running a JMeter test in the cloud starting from shell scripts like JMeter ec2 Script and ending up with end-to-end solutions like Flood.io or BlazeMeter
I've experimented with some crawlers to pull web data from within a Python environment on my local machine. Ideally, I'd like to host a website that can initiate crawlers to aggregate content and display that on the site.
My question is, is it possible to do this from a web environment and not my local machine?
Sure there are many services that are doing the same task you wanted.
scrapingHub is the best example you can get. https://scrapinghub.com/
You can deploy your spiders in there and run it periodically(paid service). Deploy and call spider via scrapingHub API form your website and use the spider output in your host website.
Also, you can achieve the same idea in your server and website via API call.
I am planning on deploying a Scrapy spider to ScrapingHub and using the schedule feature to run the spider on a daily basis. I know that, by default, Scrapy does not visit the same URLs. However, I was wondering if this duplicate URL avoidance is persistent across scheduled starts on ScrapingHub? And whether or not I can set it so that Scrapy does not visit the same URLs across its scheduled starts.
DeltaFetch is a Scrapy plugin that stores fingerprints of visited URLs across different Spider runs. You can use this plugin for incremental (delta) crawls. Its main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages from where no items were extracted before, to URLs from the spiders' start_urls attribute or requests generated in the spiders' start_requests method.
See: https://blog.scrapinghub.com/2016/07/20/scrapy-tips-from-the-pros-july-2016/
Plugin repository: https://github.com/scrapy-plugins/scrapy-deltafetch
In Scrapinghub's dashboard, you can activate it on the Addons Setup page, inside a Scrapy Cloud project. Though, you'll also need to activate/enable DotScrapy Persistence addon for it to work.
I was creating some crawlers using import.io however seems like for some of them the option for run locally is not showing. Anyone knows why they dont have the run from local button or how i can make to put in the crawlers?
If you don't see the option to run the crawler [Remotely or locally] then it means that your crawler is already running locally only.
When you save a crawler, import.io does a few checks to see if it can be run remotely on our servers, in some cases this increases chances of crawlers working as the servers do additional processing.
If those check fail, then the crawler can only run locally and therefore your crawler will be run locally be default.
The scrapy doc says that:
Scrapy comes with a built-in service, called “Scrapyd”, which allows you to deploy (aka. upload) your projects and control their spiders using a JSON web service.
is there some advantages in comformance use scrapyd?
Scrapyd allows you to run scrapy on a different machine than the one you are using via a handy web API which means you can just use curl or even a web browser to upload new project versions and run them. Otherwise if you wanted to run Scrapy in the cloud somewhere you would have to scp copy the new spider code and then login with ssh and spawn your scrapy crawl myspider.
Scrapyd will also manage processes for you if you want to run many spiders in parallel; but if you have Scrapy on your local machine and have access to the command-line or a way to run spiders and just want to run one spider at a time, then you're better off running the spider manually.
If you are developing spiders then for sure you don't want to use scrapyd for quick compile/test iterations as it just adds a layer of complexity.