I am crawling one website and parsing some content+images however even for simple site with 100 pages or so it is taking hours to do the job. I am using following settings. Any help would be highly appreciated. I have already seen this question - Scrapy 's Scrapyd too slow with scheduling spiders but couldn't gather much insight.
EXTENSIONS = {'scrapy.contrib.logstats.LogStats': 1}
LOGSTATS_INTERVAL = 60.0
RETRY_TIMES = 4
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 12
CONCURRENT_ITEMS = 200
DOWNLOAD_DELAY = 0.75
Are you sure the website is responding OK?
Setting DOWNLOAD_DELAY = 0.75 will force requests to be sequential and add a delay of 0.75 seconds between them. Your crawl will certainly be faster if you remove this, however, with 12 concurrent requests per domain be careful you are not hitting websites too aggressively.
Even with the delay it should not take hours, so that's why I am wondering if the website is slow or unresponsive. Some websites will do this to bots.
Related
If the local time arrives at 200 ms, it reaches 3 seconds in the cloudflare.Why is there such a difference
between the ttfb time, even though both Local and cloudflare are the same project? There is also caching Everything
in the cloudflare.But I don't understand why there is such a difference ttbf time.
[(local : ttfb = 200.03ms,content download = 6.66ms ),(cloudflare: ttfb = 3.64s, contnet download = 182.06ms)]
I did my best to increase the speed of my website api, and I haven't got the desired result yet. I have also done
caching in my web application. I have checked the speed of the controller very carefully through many monitoring
tools. Checked by Monitoring tool, my query's average time also comes around 50 ms.
Is this a problem? If so, what is the solution?
What is the best way to load a window so that there is less load in multiprocessing.
I tried to run 5 thousand threads in embed (1 thread = 1 window) on such a configuration,
but the load is always under 100%
ConfIg
handels
I would like to yield more requests at the end of a CrawlSpider that uses Rules.
I noticed I was not able to feed more requests by doing this in the spider_closed method:
self.crawler.engine.crawl(r, self)
I noticed that this technic work in spider_idle method but I would like to wait to be sure that the crawl is finished before feeding more requests.
I set the setting CLOSESPIDER_TIMEOUT = 30
What would be the code to wait 20 seconds idle before triggering the process of feeding more requests?
Is there a better way?
If it is really important that the previous crawling has completely finished before the new crawling starts, consider running either two separate spiders or the same spider twice in a row with different arguments that determine which URLs it crawls. See Run Scrapy from a script.
If you don’t really need for the previous crawling to finish, and you simply have URLs that should have a higher priority than other URLs for some reason, consider using request priorities instead. See the priority parameter of the Request class constructor.
I'm uploading small files (via Bucket.upload), and occasionally I get a 503 or 500 from the google backend, but it usually returns that after ~5 or ~10 seconds (I assume it's the timeout on google's end). I noticed in gcloud-node util it sets the request timeout to 60 seconds, but I don't see to be able to find a way to set the timeout myself.
Thanks!
On my phone while playing with Mickey Mouse figurines with my daughter, so my answer can't be too long. Check out Request interceptors: https://googlecloudplatform.github.io/gcloud-node/#/docs/v0.28.0/gcloud
Hopefully that helps!
I'm looking at ways of implementing a crawl delays inside of Scrapy spiders. I was wondering if it is possible to do access the reactor's callLater method from within a spider? That would enable a page to be parsed after n seconds quite easily.
You can set a delay with ease actually by setting the DOWNLOAD_DELAY in the settings file.
DOWNLOAD_DELAY
Default: 0
The amount of time (in secs) that the downloader should wait before
downloading consecutive pages from the same spider. This can be used
to throttle the crawling speed to avoid hitting servers too hard.
Decimal numbers are supported. Example:
DOWNLOAD_DELAY = 0.25 # 250 ms of delay This setting is also
affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by
default). By default, Scrapy doesn’t wait a fixed amount of time
between requests, but uses a random interval between 0.5 and 1.5 *
DOWNLOAD_DELAY.
You can also change this setting per spider.
See also Scrapy's Docs - DOWNLOAD_DELAY