Scrapy high CPU usage - scrapy

I have a very simple test spider which does no parsing. However I'm passing a large number of urls (500k) to the spider in the start_requests method and seeing very high (99/100%) cpu usage. Is this the expected behaviour? if so how can I optimize this (perhaps batching and using spider_idle?)
class TestSpider(Spider):
name = 'test_spider'
allowed_domains = 'mydomain.com'
def __init__(self, **kw):
super(Spider, self).__init__(**kw)
urls_list = kw.get('urls')
if urls_list:
self.urls_list = urls_list
def parse(self, response):
pass
def start_requests(self):
with open(self.urls_list, 'rb') as urls:
for url in urls:
yield Request(url, self.parse)

I think that the main problem here is that you are scraping too many links, try adding a Rule to avoid scraping links that didnt contains what you want.
Scrapy provides really useful Docs, check them out!:
http://doc.scrapy.org/en/latest/topics/spiders.html

Related

How can I add a new spider arg to my own template in Scrapy/Zyte

I am working on a paid proxy spider template and would like the ability to pass in a new argument on the command line for a Scrapy crawler. How can I do that?
This is achievable by using kwargs in your spider's __init__-Method:
import scrapy
class YourSpider(scrapy.Spider):
name = your_spider
def __init__(self, *args, **kwargs):
super(YourSpider, self).__init__(*args, **kwargs)
self.your_arg = kwargs.get("your_cmd_arg", 42)
Now it would be possible to call the spider as follows:
scrapy crawl your_spider -a your_cmd_arg=foo
For more information on the topic, feel free to check this page in the Scrapy documentation.

How do I tell the spider to stop requesting after n failed requests?

import scrapy
class MySpider(scrapy.Spider):
start_urls = []
def __init__(self, **kwargs):
for i in range(1, 1000):
self.start_urls.append("some url"+i)
def parse(self, response):
print(response)
Here we queue 1000 urls in __init__ function, but I want to stop making all those requests if it fails or returns something undesirable. How do I tell the spider to stop making requests say after 10 failed requests.
You might want to set CLOSESPIDER_ERRORCOUNT to 10 in that case. It probably doesn't account for failed requests only, though. Alternatively, you might set HTTPERROR_ALLOWED_CODES to handle even the error responses (failed requests) and implement your own failed request counter inside the spider. Then, when the counter is above threshold, you raise CloseSpider exception yourself.

Scrapy. How to yield item after spider_close call?

I want to yield an item only when the crawling is finished.
I am trying to do it via
def spider_closed(self, spider):
item = EtsyItem()
item['total_sales'] = 1111111
yield item
But it does not yield anything, though the function is called.
How do I yield an item after the scraping is over?
Depending on what you want to do, there might be a veeeery hacky solution for this.
Instead of spider_closed you may want to consider using spider_idle signal which is fired before spider_closed. One difference between idle and close is that spider_idle allows execution of requests which then may contain a callback or errback to yield the desired item.
Inside spider class:
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
# ...
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
# ...
def yield_item(self, response):
yield MyItem(name='myname')
def spider_idle(self, spider):
req = Request('https://fakewebsite123.xyz',
callback=lambda:None, errback=self.yield_item)
self.crawler.engine.crawl(req, spider)
However this comes with several side effects so i discourage anyone from using this in production, for example the final request which will raise a DNSLookupError. I just want to show what is possible.
Oof, I'm afraid spider_closed is used for tearing down. I suppose you can do it by attaching some custom stuff to Pipeline to post-process your items.

Run scrapy on a set of hundred plus urls

I need to download CPU and GPU data of a set of phones fro gsmarena. Now as a step one, I downloaded the urls of those phones by running scrapy and deleted the unnecessary items.
COde for the same is below.
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from gsmarena_data.items import gsmArenaDataItem
class MobileInfoSpider(Spider):
name = "mobile_info"
allowed_domains = ["gsmarena.com"]
start_urls = (
# 'http://www.gsmarena.com/samsung-phones-f-9-10.php',
# 'http://www.gsmarena.com/apple-phones-48.php',
# 'http://www.gsmarena.com/microsoft-phones-64.php',
# 'http://www.gsmarena.com/nokia-phones-1.php',
# 'http://www.gsmarena.com/sony-phones-7.php',
# 'http://www.gsmarena.com/lg-phones-20.php',
# 'http://www.gsmarena.com/htc-phones-45.php',
# 'http://www.gsmarena.com/motorola-phones-4.php',
# 'http://www.gsmarena.com/huawei-phones-58.php',
# 'http://www.gsmarena.com/lenovo-phones-73.php',
# 'http://www.gsmarena.com/xiaomi-phones-80.php',
# 'http://www.gsmarena.com/acer-phones-59.php',
# 'http://www.gsmarena.com/asus-phones-46.php',
# 'http://www.gsmarena.com/oppo-phones-82.php',
# 'http://www.gsmarena.com/blackberry-phones-36.php',
# 'http://www.gsmarena.com/alcatel-phones-5.php',
# 'http://www.gsmarena.com/xolo-phones-85.php',
# 'http://www.gsmarena.com/lava-phones-94.php',
# 'http://www.gsmarena.com/micromax-phones-66.php',
# 'http://www.gsmarena.com/spice-phones-68.php',
'http://www.gsmarena.com/gionee-phones-92.php',
)
def parse(self, response):
phone = gsmArenaDataItem()
hxs = Selector(response)
phone_listings = hxs.css('.makers')
for phone_listings in phone_listings:
phone['model'] = phone_listings.xpath("ul/li/a/strong/text()").extract()
phone['link'] = phone_listings.xpath("ul/li/a/#href").extract()
yield phone
Now, I need to run scrapy on those set of urls to get the CPU and GPU data. All that info comes css selector = ".ttl".
Kindly guide how to loop scrapy on the set of urls and output the data in a single csv or json. I'm well aware will creating items and using css selectors. Need help with how to loop on those hundred plus pages.
I have a list of urls like:
www.gsmarena.com/samsung_galaxy_s5_cdma-6338.php
www.gsmarena.com/samsung_galaxy_s5-6033.php
www.gsmarena.com/samsung_galaxy_core_lte_g386w-6846.php
www.gsmarena.com/samsung_galaxy_core_lte-6099.php
www.gsmarena.com/acer_iconia_one_8_b1_820-7217.php
www.gsmarena.com/acer_iconia_tab_a3_a20-7136.php
www.gsmarena.com/microsoft_lumia_640_dual_sim-7082.php
www.gsmarena.com/microsoft_lumia_532_dual_sim-6951.php
Which are the links to phone descriptions on gsm arena.
Now I need to download the CPU and GPU info of the 100 models I have.
I extracted the urls of those 100 models for which the data is required.
The spider written for the same is,
from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem
class MobileInfoSpider(Spider):
name = "cpu_gpu_info"
allowed_domains = ["gsmarena.com"]
start_urls = (
"http://www.gsmarena.com/microsoft_lumia_435_dual_sim-6949.php",
"http://www.gsmarena.com/microsoft_lumia_435-6942.php",
"http://www.gsmarena.com/microsoft_lumia_535_dual_sim-6792.php",
"http://www.gsmarena.com/microsoft_lumia_535-6791.php",
)
def parse(self, response):
phone = gsmArenaDataItem()
hxs = Selector(response)
cpu_gpu = hxs.css('.ttl')
for phone_listings in phone_listings:
phone['cpu'] = cpu_gpu.xpath("ul/li/a/strong/text()").extract()
phone['gpu'] = cpu_gpu.xpath("ul/li/a/#href").extract()
yield phone
If somehow I could run on the urls for which I want to extract this data, I could get the required data in a single csv file.
I think you need information from every vendors. If so you don't have to put those hundreds of urls in the start-url, alternatively you can use this link as start-url after that in parse() you could extract those urls programatically and process what you want.
This answer will help you to do so.

Scrapy request+response+download time

UPD: Not close question because I think my way is not so clear as should be
Is it possible to get current request + response + download time for saving it to Item?
In "plain" python I do
start_time = time()
urllib2.urlopen('http://example.com').read()
time() - start_time
But how i can do this with Scrapy?
UPD:
Solution enought for me but I'm not sure of quality of results. If you have many connections with timeout errors Download time may be wrong (even DOWNLOAD_TIMEOUT * 3)
For
settings.py
DOWNLOADER_MIDDLEWARES = {
'myscraper.middlewares.DownloadTimer': 0,
}
middlewares.py
from time import time
from scrapy.http import Response
class DownloadTimer(object):
def process_request(self, request, spider):
request.meta['__start_time'] = time()
# this not block middlewares which are has greater number then this
return None
def process_response(self, request, response, spider):
request.meta['__end_time'] = time()
return response # return response coz we should
def process_exception(self, request, exception, spider):
request.meta['__end_time'] = time()
return Response(
url=request.url,
status=110,
request=request)
inside spider.py in def parse(...
log.msg('Download time: %.2f - %.2f = %.2f' % (
response.meta['__end_time'], response.meta['__start_time'],
response.meta['__end_time'] - response.meta['__start_time']
), level=log.DEBUG)
You could write a Downloader Middleware which would time each request. It would add a start time to the request before it's made and then a finish time when it's finished. Typically, arbitrary data such as this is stored in the Request.meta attribute. This timing information could later be read by your spider and added to your item.
This downloader middleware sounds like it could be useful on many projects.
Not sure if you need a Middleware here. Scrapy has a request.meta which you can query and yield. For download latency, simply yield
download_latency=response.meta.get('download_latency'),
The amount of time spent to fetch the response, since the request has
been started, i.e. HTTP message sent over the network. This meta key
only becomes available when the response has been downloaded. While
most other meta keys are used to control Scrapy behavior, this one is
supposed to be read-only.
I think the best solution is by using scrapy signals. Whenever the request reaches the downloader it emits request_reached_downloader signal. After download it emits response_downloaded signal. You can catch it from the spider and assign time and its differences to meta from there.
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SignalSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
return spider
More elaborate answer is on here