I'm looking at ways of implementing a crawl delays inside of Scrapy spiders. I was wondering if it is possible to do access the reactor's callLater method from within a spider? That would enable a page to be parsed after n seconds quite easily.
You can set a delay with ease actually by setting the DOWNLOAD_DELAY in the settings file.
DOWNLOAD_DELAY
Default: 0
The amount of time (in secs) that the downloader should wait before
downloading consecutive pages from the same spider. This can be used
to throttle the crawling speed to avoid hitting servers too hard.
Decimal numbers are supported. Example:
DOWNLOAD_DELAY = 0.25 # 250 ms of delay This setting is also
affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by
default). By default, Scrapy doesn’t wait a fixed amount of time
between requests, but uses a random interval between 0.5 and 1.5 *
DOWNLOAD_DELAY.
You can also change this setting per spider.
See also Scrapy's Docs - DOWNLOAD_DELAY
Related
I would like to yield more requests at the end of a CrawlSpider that uses Rules.
I noticed I was not able to feed more requests by doing this in the spider_closed method:
self.crawler.engine.crawl(r, self)
I noticed that this technic work in spider_idle method but I would like to wait to be sure that the crawl is finished before feeding more requests.
I set the setting CLOSESPIDER_TIMEOUT = 30
What would be the code to wait 20 seconds idle before triggering the process of feeding more requests?
Is there a better way?
If it is really important that the previous crawling has completely finished before the new crawling starts, consider running either two separate spiders or the same spider twice in a row with different arguments that determine which URLs it crawls. See Run Scrapy from a script.
If you don’t really need for the previous crawling to finish, and you simply have URLs that should have a higher priority than other URLs for some reason, consider using request priorities instead. See the priority parameter of the Request class constructor.
Newbie designing his architecture question here:
My Goal
I want to keep track of multiple twitter profiles over time.
What I want to build:
A SpiderMother class that interfaces with some Database (holding CrawlJobs) to spawn and manage many small Spiders, each crawling 1 user-page on twitter at an irregular interval (the jobs will be added to the database according to some algorithm).
They get spawned as subprocesses by SpiderMother and depending on the success of the crawl, the database job get removed. Is this a good architecture?
Problem I see:
Lets say I spawn 100 spiders and my CONCURRENT_REQUESTS limit is 10, will twitter.com be hit by all 100 spiders immediately or do they line up and go one after the other?
Most scrapy settings / runtime configurations will be isolated for the current open spider during the run. Default scrapy request downloader will be acting only per spider also, so you will indeed see 100 simultaneous requests if you fire up 100 processes. You have several options to enforce per domain concurrency globally and none of them are particularly hassle free:
Use just one spider running per domain and feed it through redis (check out scrapy-redis). Alternatively don't spawn more than one spider at a time.
Have a fixed pool of spiders or limit the amount of spiders you spawn from your orchestrator. Set concurrency settings to be "desired_concurrency divided by number of spiders".
Overriding scrapy downloader class behavior to store its values externally (in redis for example).
Personally I would probably go with the first and if hit by the performance limits of a single process scale to the second.
I wonder what techniques would you use when, say a page contains links to 6 videos, 300Mb each and you want to download them all. Should I write my custom downloader?
I'm used to use MediaPipeline but it utilizes the framework scheduler which has the following issues:
You never know which file is currently being downloaded
You have no idea on download progress/state until it fails
Strange timeout behaviour:
a) Looks like timeout is applied to the whole request download operation, not only to pauses in download. So, say, having a timeout of 5min I will never be able to download a file which takes 6 min to download. b) If you make 5 concurrent long requests and one of them is taking too long, you will get all of them (not complete yet) timed out. You have to limit the number of concurrent requests by 1 in settings (which will affect the whole spider).
You can make use of Youtube downloader after having retrieve links to the videos.
Youtube downloader will try to continue if video has not finished downloading. You can also force it to continue. Write a wrapper around it for concurrency if it takes long for single downloads.
Disclaimer: I am not in anyway affiliated with the maintainers of this package.
I am searching for a solution to test, how many requests my webserver could handle, until causing a load-time of more than 5s. Is there a possibility to manage this with Apache jMeter?
Server: SLES OS running a WordPress blog(Apache Webserver, MySQL)
Best regards
Andy
It is. Example action plan:
Record anticipated test scenario using HTTP(S) Test Script Recorder or JMeter Chrome Extension
Perform correlation (handle dynamic values) and parametrization (if required)
Add virtual users. It is recommended to configure users to arrive gradually, like starting with 1 and adding 1 each second. You can use Ultimate Thread Group which provides an easy visual way of defining ramp-up and ramp-down.
Add Duration Assertion with the value of 5000 ms so it will fail the request if it takes > 5 seconds
Use Active Threads Over Time and Response Times Over Time listeners combination to determine what maximum amount of users which can be served providing response time < 5 seconds.
I am crawling one website and parsing some content+images however even for simple site with 100 pages or so it is taking hours to do the job. I am using following settings. Any help would be highly appreciated. I have already seen this question - Scrapy 's Scrapyd too slow with scheduling spiders but couldn't gather much insight.
EXTENSIONS = {'scrapy.contrib.logstats.LogStats': 1}
LOGSTATS_INTERVAL = 60.0
RETRY_TIMES = 4
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 12
CONCURRENT_ITEMS = 200
DOWNLOAD_DELAY = 0.75
Are you sure the website is responding OK?
Setting DOWNLOAD_DELAY = 0.75 will force requests to be sequential and add a delay of 0.75 seconds between them. Your crawl will certainly be faster if you remove this, however, with 12 concurrent requests per domain be careful you are not hitting websites too aggressively.
Even with the delay it should not take hours, so that's why I am wondering if the website is slow or unresponsive. Some websites will do this to bots.