How can I add a new spider arg to my own template in Scrapy/Zyte - scrapy

I am working on a paid proxy spider template and would like the ability to pass in a new argument on the command line for a Scrapy crawler. How can I do that?

This is achievable by using kwargs in your spider's __init__-Method:
import scrapy
class YourSpider(scrapy.Spider):
name = your_spider
def __init__(self, *args, **kwargs):
super(YourSpider, self).__init__(*args, **kwargs)
self.your_arg = kwargs.get("your_cmd_arg", 42)
Now it would be possible to call the spider as follows:
scrapy crawl your_spider -a your_cmd_arg=foo
For more information on the topic, feel free to check this page in the Scrapy documentation.

Related

How do I tell the spider to stop requesting after n failed requests?

import scrapy
class MySpider(scrapy.Spider):
start_urls = []
def __init__(self, **kwargs):
for i in range(1, 1000):
self.start_urls.append("some url"+i)
def parse(self, response):
print(response)
Here we queue 1000 urls in __init__ function, but I want to stop making all those requests if it fails or returns something undesirable. How do I tell the spider to stop making requests say after 10 failed requests.
You might want to set CLOSESPIDER_ERRORCOUNT to 10 in that case. It probably doesn't account for failed requests only, though. Alternatively, you might set HTTPERROR_ALLOWED_CODES to handle even the error responses (failed requests) and implement your own failed request counter inside the spider. Then, when the counter is above threshold, you raise CloseSpider exception yourself.

Scrapy. How to yield item after spider_close call?

I want to yield an item only when the crawling is finished.
I am trying to do it via
def spider_closed(self, spider):
item = EtsyItem()
item['total_sales'] = 1111111
yield item
But it does not yield anything, though the function is called.
How do I yield an item after the scraping is over?
Depending on what you want to do, there might be a veeeery hacky solution for this.
Instead of spider_closed you may want to consider using spider_idle signal which is fired before spider_closed. One difference between idle and close is that spider_idle allows execution of requests which then may contain a callback or errback to yield the desired item.
Inside spider class:
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
# ...
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
# ...
def yield_item(self, response):
yield MyItem(name='myname')
def spider_idle(self, spider):
req = Request('https://fakewebsite123.xyz',
callback=lambda:None, errback=self.yield_item)
self.crawler.engine.crawl(req, spider)
However this comes with several side effects so i discourage anyone from using this in production, for example the final request which will raise a DNSLookupError. I just want to show what is possible.
Oof, I'm afraid spider_closed is used for tearing down. I suppose you can do it by attaching some custom stuff to Pipeline to post-process your items.

Run multiple spiders from script in scrapy

I am doing scrapy project I want run multiple spiders at a time
This is code for run spiders from script. I getting error .. how to do
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
from scrapy import signals, log
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
TO_CRAWL = [DmozSpider, CraigslistSpider]
RUNNING_CRAWLERS = []
def spider_closing(spider):
"""Activates on spider closed signal"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
reactor.stop()
log.start(loglevel=log.DEBUG)
for spider in TO_CRAWL:
settings = Settings()
# crawl responsibly
settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
crawler = Crawler(settings)
crawler_obj = spider()
RUNNING_CRAWLERS.append(crawler_obj)
# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()
blocks process so always keep as the last statement
reactor.run()
Sorry to not answer the question itself but just bringing into your attention scrapyd and scrapinghub (at least for a quick test). reactor.run() (when you make it) will run any number of Scrapy instances on a single CPU. Do you want this side effect? Even if you have a look on scrapyd's code, they don't run multiple instances with a single thread but they do fork/spawn subprocesses.
You need something like the code below. You can easily find it from Scrapy docs :)
First utility you can use to run your spiders is
scrapy.crawler.CrawlerProcess. This class will start a Twisted reactor
for you, configuring the logging and setting shutdown handlers. This
class is the one used by all Scrapy commands.
# -*- coding: utf-8 -*-
import sys
import logging
import traceback
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from scrapy.utils.project import get_project_settings
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
SPIDER_LIST = [
DmozSpider, CraigslistSpider
]
if __name__ == "__main__":
try:
## set up the crawler and start to crawl one spider at a time
process = CrawlerProcess(get_project_settings())
for spider in SPIDER_LIST:
process.crawl(spider)
process.start()
except Exception, e:
exc_type, exc_obj, exc_tb = sys.exc_info()
logging.info('Error on line {}'.format(sys.exc_info()[-1].tb_lineno))
logging.info("Exception: %s" % str(traceback.format_exc()))
References:
http://doc.scrapy.org/en/latest/topics/practices.html

Scrapy high CPU usage

I have a very simple test spider which does no parsing. However I'm passing a large number of urls (500k) to the spider in the start_requests method and seeing very high (99/100%) cpu usage. Is this the expected behaviour? if so how can I optimize this (perhaps batching and using spider_idle?)
class TestSpider(Spider):
name = 'test_spider'
allowed_domains = 'mydomain.com'
def __init__(self, **kw):
super(Spider, self).__init__(**kw)
urls_list = kw.get('urls')
if urls_list:
self.urls_list = urls_list
def parse(self, response):
pass
def start_requests(self):
with open(self.urls_list, 'rb') as urls:
for url in urls:
yield Request(url, self.parse)
I think that the main problem here is that you are scraping too many links, try adding a Rule to avoid scraping links that didnt contains what you want.
Scrapy provides really useful Docs, check them out!:
http://doc.scrapy.org/en/latest/topics/spiders.html

Scrapy request+response+download time

UPD: Not close question because I think my way is not so clear as should be
Is it possible to get current request + response + download time for saving it to Item?
In "plain" python I do
start_time = time()
urllib2.urlopen('http://example.com').read()
time() - start_time
But how i can do this with Scrapy?
UPD:
Solution enought for me but I'm not sure of quality of results. If you have many connections with timeout errors Download time may be wrong (even DOWNLOAD_TIMEOUT * 3)
For
settings.py
DOWNLOADER_MIDDLEWARES = {
'myscraper.middlewares.DownloadTimer': 0,
}
middlewares.py
from time import time
from scrapy.http import Response
class DownloadTimer(object):
def process_request(self, request, spider):
request.meta['__start_time'] = time()
# this not block middlewares which are has greater number then this
return None
def process_response(self, request, response, spider):
request.meta['__end_time'] = time()
return response # return response coz we should
def process_exception(self, request, exception, spider):
request.meta['__end_time'] = time()
return Response(
url=request.url,
status=110,
request=request)
inside spider.py in def parse(...
log.msg('Download time: %.2f - %.2f = %.2f' % (
response.meta['__end_time'], response.meta['__start_time'],
response.meta['__end_time'] - response.meta['__start_time']
), level=log.DEBUG)
You could write a Downloader Middleware which would time each request. It would add a start time to the request before it's made and then a finish time when it's finished. Typically, arbitrary data such as this is stored in the Request.meta attribute. This timing information could later be read by your spider and added to your item.
This downloader middleware sounds like it could be useful on many projects.
Not sure if you need a Middleware here. Scrapy has a request.meta which you can query and yield. For download latency, simply yield
download_latency=response.meta.get('download_latency'),
The amount of time spent to fetch the response, since the request has
been started, i.e. HTTP message sent over the network. This meta key
only becomes available when the response has been downloaded. While
most other meta keys are used to control Scrapy behavior, this one is
supposed to be read-only.
I think the best solution is by using scrapy signals. Whenever the request reaches the downloader it emits request_reached_downloader signal. After download it emits response_downloaded signal. You can catch it from the spider and assign time and its differences to meta from there.
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SignalSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
return spider
More elaborate answer is on here