How to send private data in spider to pipeline? - scrapy

Say, each time I run scrapy like below
scrapy crawl [spidername] -a file='filename'
I want send the filename to pipeline to specify the item storage location. Each time the location may be different so it can't define in settings.py.
The file save in spider as private var
def __init__(self,file):
self.filename=file
How can I send the parameter to pipeline?

There are a few ways to do this. One way would be to define the filename as a parameter in your spider's init() method, and then pass it to your pipeline as an argument when you call the process_item() method.
Another way would be to define the filename as a class variable in your spider, and then access it from your pipeline as an attribute of the spider.
Here is an example of how you could do it using a class variable:
class MySpider(Spider):
filename = 'filename'
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
def parse(self, response):
# do stuff
def process_item(self, item, spider):
item['filename'] = spider.filename
return item

Related

Scrapy Custom Function Cannot Fire scrapy.Requests

Seems the yield scrapy.Requests cannot be fired in a function like the following code.
Anyone could give me a help to clear me or help me fire?
Really appreciate for your help.
class MySpider(CrawlSpider):
...
def start_requests(self):
yield scrapy.Request(url,
callback=self.parse_items)
...
def parse_items(self, response):
def __fire_here(response)
...
def __fire_here(response):
# Cannot fire here, why?
yield scrapy.Request(url,
callback=self.parse_items)
To avoid code duplication you can call your __fire_here function in this way:
def parse_items(self, response):
yield self.__fire_here(response)
def __fire_here(self, response):
# yield some request here
Your code seems to make endless calls from one function to another. Can you check your logics?

Scrapy Request method‘s meta-args is shallow copy,but the Request method‘s meta-args is deep copy in scrapy_redis.Why?

scrapy:
import scrapy
from scrapy.spider import Request
class TestspiderSpider(scrapy.Spider):
name = 'testspider'
allowed_domains = ['mzitu.com']
start_urls = ['http://www.mzitu.com/']
def start_requests(self):
L =[]
print("L-id:",id(L),"first")
yield Request(url="http://www.mzitu.com/5675",callback=self.parse,meta={"L":L},dont_filter=True)
def parse(self, response):
L = response.meta.get('L')
print("L-id:", id(L),"second")
The ouput:
L-id: 2769118042568 first
L-id: 2769118042568 second
They're equal
This is shallow copy
scrapy_redis
from scrapy_redis.spiders import RedisSpider
from scrapy.spider import Request
class MzituSpider(RedisSpider): #scrapy_redis
name = 'mzitu'
redis_key = 'a:a' #this is discard
def start_requests(self): #Because Rewrite the method of RedisSpider
L =[]
print("L-id:",id(L),"first")
yield Request(url="http://www.mzitu.com/5675",callback=self.parse,meta={"L":L},dont_filter=True)
def parse(self, response):
L = response.meta.get('L')
print("L-id:", id(L),"second")
The output:
L-id: 1338852857992
first L-id: 1338852858312 second
They're not equal
This is deep copy
Question:
I want to know why?
And how can i Solve it?
Let the scrapy_redis change to become shallow copy
This has to do with the fact that scrapy-redis uses its own scheduler class which serializes/deserializes all requests through redis before pushing them further to the downloader (it keeps a queue on redis). There is no "easy" way around this as it's basically the core scrapy-redis functionality. My advise is to not put too much runtime sensitive stuff into meta as this even generally not the best idea in scrapy.

Scrapy, make http request in pipeline

Assume I have an scraped item that looks like this
{
name: "Foo",
country: "US",
url: "http://..."
}
In a pipeline I want to make a GET request to the url and check some headers like content_type and status. When the headers do not meet certain conditions I want to drop the item. Like
class MyPipeline(object):
def process_item(self, item, spider):
request(item['url'], function(response) {
if (...) {
raise DropItem()
}
return item
}, function(error){
raise DropItem()
})
Smells like this is not possible using pipelines. What do you think? Any ideas how to achieve this?
The spider:
import scrapy
import json
class StationSpider(scrapy.Spider):
name = 'station'
start_urls = ['http://...']
def parse(self, response):
jsonResponse = json.loads(response.body_as_unicode())
for station in jsonResponse:
yield station
Easy way
import requests
def process_item(self, item, spider):
response = requests.get(item['url'])
if r.status_code ...:
raise DropItem()
elif response.text ...:
raise DropItem()
else:
return item
Scrapy way
Now I think you shouldn't do this inside a Pipeline, you should treat it inside the spider not yielding an item but a request and then yielding the item.
Now if you still want to include a scrapy Request inside a pipeline you could do something like this:
class MyPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
...
self.crawler.engine.crawl(
Request(
url='someurl',
callback=self.custom_callback,
),
spider,
)
# you have to drop the item, and send it again after your check
raise DropItem()
# YES, you can define a method callback inside the same pipeline
def custom_callback(self, response):
...
yield item
Check that we are emulating the same behaviour of spider callbacks inside the pipeline. You need to figure out a way to always drop the items when you want to do an extra request, and just pass the ones that are being by the extra callback.
One way could be sending different types of items, and check them inside the process_item of the pipeline:
def process_item(self, item, spider):
if isinstance(item, TempItem):
...
elif isinstance(item, FinalItem):
yield item

Custom scrapy xml+rss exporter to S3

I'm trying to create a custom xml feed, that will contain the spider scraped items, as well as some other high level information, stored in the spider definition. The output should be stored on S3.
The desired output looks like the following:
<xml>
<title>my title defined in the spider</title>
<description>The description from the spider</description>
<items>
<item>...</item>
</items>
</xml>
In order to do so, I defined a custom exporter, which is able to export the desired output file locally.
spider.py:
class DmozSpider(scrapy.Spider):
name = 'dmoz'
allowed_domains = ['dmoz.org']
start_urls = ['http://www.dmoz.org/Computers/']
title = 'The DMOZ super feed'
def parse(self, response):
...
yield item
exporters.py:
from scrapy.conf import settings
class CustomItemExporter(XmlItemExporter):
def __init__(self, *args, **kwargs):
self.title = kwargs.pop('title', 'no title found')
self.link = settings.get('FEED_URI', 'localhost')
super(CustomItemExporter, self).__init__(*args, **kwargs)
def start_exporting(self):
...
self._export_xml_field('title', self.title)
...
settings.py:
FEED_URI = 's3://bucket-name/%(name)s.xml'
FEED_EXPORTERS = {
'custom': 'my.exporters.CustomItemExporter',
}
I'm able to run the whole thing and get the output on s3 by running the following command:
scrapy crawl dmoz -t custom
or, if I want to export a json locally instead: scrapy crawl -o dmoz.json dmoz
But at this point, I'm unable to retrieve the spider title to put it in the output file.
I tried implementing a custom pipeline, which outputs data locally (following numerous examples):
pipelines.py:
class CustomExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_feed.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CustomItemExporter(
file,
title = spider.title),
)
self.exporter.start_exporting()
The problem is, the file is stored locally, and this short circuits the FeedExporter logic defined in feedexport.py, that handles all the different storages.
No info from the FeedExporter is available in the pipeline, and I would like to reuse all that logic without duplicating code. Am I missing something? Thanks for any help.
Here's my solution:
get rid of the pipeline.
Override scrapy's FeedExporter
myproject/feedexport.py:
from scrapy.extensions.feedexport import FeedExporter as _FeedExporter
from scrapy.extensions.feedexport import SpiderSlot
class FeedExporter(_FeedExporter):
def open_spider(self, spider):
uri = self.urifmt % self._get_uri_params(spider)
storage = self._get_storage(uri)
file = storage.open(spider)
extra = {
# my extra settings
}
exporter = self._get_exporter(file, fields_to_export=self.export_fields, extra=extra)
exporter.start_exporting()
self.slot = SpiderSlot(file, exporter, storage, uri)
All I wanted to do was basically to pass those extra settings to the exporter, but the way it's built, there is no choice but to override.
To support other scrapy export formats simultaneously, I would have to consider overriding the dont_fail settings to True in some scrapy exporters to prevent them from failing
Replace scrapy's feed exporter by the new one
myproject/feedexport.py:
EXTENSIONS = {
'scrapy.extensions.feedexport.FeedExporter': None,
'myproject.feedexport.FeedExporter': 0,
}
... or the 2 feed exporters would run at the same time

How do Scrapy from_settings and from_crawler class methods work?

I need to add the following class method to my existing pipeline
http://doc.scrapy.org/en/latest/faq.html#i-m-getting-an-error-cannot-import-name-crawler
i am not sure how to have 2 of these class methods in my class
from twisted.enterprise import adbapi
import MySQLdb.cursors
class MySQLStorePipeline(object):
"""A pipeline to store the item in a MySQL database.
This implementation uses Twisted's asynchronous database API.
"""
def __init__(self, dbpool):
self.dbpool = dbpool
#classmethod
def from_settings(cls, settings):
dbargs = dict(
host= settings['DB_HOST'],
db= settings['DB_NAME'],
user= settings['DB_USER'],
passwd= settings['DB_PASSWD'],
charset='utf8',
use_unicode=True,
)
dbpool = adbapi.ConnectionPool('MySQLdb', **dbargs)
return cls(dbpool)
def process_item(self, item, spider):
pass
From my understanding of class methods, several class methods in a python class should just be fine. It just depends on which one the caller requires. However, I have only seen from_crawler until now in scrapy pipelines. From there you can get access to the settings via crawler.settings
Are you sure that from_settings is required? I did not check all occurences, but in middleware.py priority seems to apply: If a crawler object is available and a from_crawler method exists, this is taken. Otherwise, if there is a from_settings method, that is taken. Otherwise, the raw constructor is taken.
if crawler and hasattr(mwcls, 'from_crawler'):
mw = mwcls.from_crawler(crawler)
elif hasattr(mwcls, 'from_settings'):
mw = mwcls.from_settings(settings)
else:
mw = mwcls()
I admit, I do not know if this is also the place where pipelines get created (I guess not, but there is no pipelines.py), but the implementation seems very reasonable.
So, I'd just either:
reimplement the whole method as from_crawler and only use that one
add method from_crawler and use both
The new method could look like follows (to duplicate as little code as possible):
#classmethod
def from_crawler(cls, crawler):
obj = cls.from_settings(crawler.settings)
obj.do_something_on_me_with_crawler(crawler)
return obj
Of course this depends a bit on what you need.