I'm trying to create a custom xml feed, that will contain the spider scraped items, as well as some other high level information, stored in the spider definition. The output should be stored on S3.
The desired output looks like the following:
<title>my title defined in the spider</title>
<description>The description from the spider</description>
In order to do so, I defined a custom exporter, which is able to export the desired output file locally.
class DmozSpider(scrapy.Spider):
name = 'dmoz'
allowed_domains = ['dmoz.org']
start_urls = ['http://www.dmoz.org/Computers/']
title = 'The DMOZ super feed'
def parse(self, response):
yield item
from scrapy.conf import settings
class CustomItemExporter(XmlItemExporter):
def __init__(self, *args, **kwargs):
self.title = kwargs.pop('title', 'no title found')
self.link = settings.get('FEED_URI', 'localhost')
super(CustomItemExporter, self).__init__(*args, **kwargs)
def start_exporting(self):
self._export_xml_field('title', self.title)
FEED_URI = 's3://bucket-name/%(name)s.xml'
'custom': 'my.exporters.CustomItemExporter',
I'm able to run the whole thing and get the output on s3 by running the following command:
scrapy crawl dmoz -t custom
or, if I want to export a json locally instead: scrapy crawl -o dmoz.json dmoz
But at this point, I'm unable to retrieve the spider title to put it in the output file.
I tried implementing a custom pipeline, which outputs data locally (following numerous examples):
class CustomExportPipeline(object):
def __init__(self):
self.files = {}
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_feed.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CustomItemExporter(
title = spider.title),
The problem is, the file is stored locally, and this short circuits the FeedExporter logic defined in feedexport.py, that handles all the different storages.
No info from the FeedExporter is available in the pipeline, and I would like to reuse all that logic without duplicating code. Am I missing something? Thanks for any help.
Here's my solution:
get rid of the pipeline.
Override scrapy's FeedExporter
from scrapy.extensions.feedexport import FeedExporter as _FeedExporter
from scrapy.extensions.feedexport import SpiderSlot
class FeedExporter(_FeedExporter):
def open_spider(self, spider):
uri = self.urifmt % self._get_uri_params(spider)
storage = self._get_storage(uri)
file = storage.open(spider)
extra = {
# my extra settings
exporter = self._get_exporter(file, fields_to_export=self.export_fields, extra=extra)
self.slot = SpiderSlot(file, exporter, storage, uri)
All I wanted to do was basically to pass those extra settings to the exporter, but the way it's built, there is no choice but to override.
To support other scrapy export formats simultaneously, I would have to consider overriding the dont_fail settings to True in some scrapy exporters to prevent them from failing
Replace scrapy's feed exporter by the new one
'scrapy.extensions.feedexport.FeedExporter': None,
'myproject.feedexport.FeedExporter': 0,
... or the 2 feed exporters would run at the same time
Say, each time I run scrapy like below
scrapy crawl [spidername] -a file='filename'
I want send the filename to pipeline to specify the item storage location. Each time the location may be different so it can't define in settings.py.
The file save in spider as private var
def __init__(self,file):
How can I send the parameter to pipeline?
There are a few ways to do this. One way would be to define the filename as a parameter in your spider's init() method, and then pass it to your pipeline as an argument when you call the process_item() method.
Another way would be to define the filename as a class variable in your spider, and then access it from your pipeline as an attribute of the spider.
Here is an example of how you could do it using a class variable:
class MySpider(Spider):
filename = 'filename'
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
def parse(self, response):
# do stuff
def process_item(self, item, spider):
item['filename'] = spider.filename
return item
I'm new to scrapy and I started a simple project several days ago. I have successfully implemented items.py, my_spider.py and piplines.py to scrape some information into a json file. Now I'd like to add some features to my spider and encountered some questions.
I have already scraped the desired information on threads of a forum, including the file_urls and image_urls. I'm a little confused about the tutorial by Scrapy Documentation, here are the related parts in my files:
'my_project.pipelines.InfoPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 300,
'scrapy.pipelines.files.FilesPipeline': 300,
FILES_STORE = './Downloads'
IMAGES_STORE = './Downloads'
class InfoIterm(scrapy.Item):
movie_number_title = scrapy.Field()
movie_pics_links = scrapy.Field()
magnet_link = scrapy.Field()
torrent_link = scrapy.Field()
torrent_name = scrapy.Field()
class TorrentItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
def process_item(self, item, spider):
contents = json.dumps(dict(item), indent=4, sort_keys=True, ensure_ascii=False)
with open("./threads_data.json", "wb") as f:
return item
def parse_thread(self, response):
json_item = InfoIterm()
json_item['movie_number_title'] = response.xpath("//span[#id='thread_subject']/text()").getall()
json_item['movie_pics_links'] = response.xpath("//td[#class='t_f']//img/#file").getall()
json_item['magnet_link'] = response.xpath("//div[#class='blockcode']/div//li/text()").getall()
json_item['torrent_name'] = response.xpath("//p[#class='attnm']/a/text()").getall()
json_item['torrent_link'] = self.base_url + response.xpath("//p[#class='attnm']/a/#href").getall()[0]
yield json_item
torrent_link = self.base_url + response.xpath("//p[#class='attnm']/a/#href").getall()
yield {'file_urls': torrent_link}
movie_pics_links = response.xpath("//td[#class='t_f']//img/#file").getall()
yield {'image_urls': movie_pics_links}
Now I can download images successfully, but files are not downloaded. My json file is also overridden by the last image_urls.
So, here are my questions:
Can one spider use multiple piplines? If possible, what's the best way to use them (For example, in my case. Some example will be great!)?
In some cases some of these json_item['xxx'] are not presented on certain threads, and the consol will print some information reporting the problem. I tried to use try-except on each line of there code, but it becomes really ugly and I believe there should be some better way to do that. What is the best way to do that?
Thanks a lot.
1- Yes you can use several pipelines, you need to mind the order in which they are called though. (More on that here)
If they are meant to process different Item objects, all you need to do is to check the class of the item received in the process_item method. Process the ones you want, return the others untouched.
2- What is the error, can't help much without that information. Please post an execution log.
import scrapy
from scrapy.spider import Request
class TestspiderSpider(scrapy.Spider):
name = 'testspider'
allowed_domains = ['mzitu.com']
start_urls = ['http://www.mzitu.com/']
def start_requests(self):
L =[]
yield Request(url="http://www.mzitu.com/5675",callback=self.parse,meta={"L":L},dont_filter=True)
def parse(self, response):
L = response.meta.get('L')
print("L-id:", id(L),"second")
The ouput:
L-id: 2769118042568 first
L-id: 2769118042568 second
They're equal
This is shallow copy
from scrapy_redis.spiders import RedisSpider
from scrapy.spider import Request
class MzituSpider(RedisSpider): #scrapy_redis
name = 'mzitu'
redis_key = 'a:a' #this is discard
def start_requests(self): #Because Rewrite the method of RedisSpider
L =[]
yield Request(url="http://www.mzitu.com/5675",callback=self.parse,meta={"L":L},dont_filter=True)
def parse(self, response):
L = response.meta.get('L')
print("L-id:", id(L),"second")
The output:
L-id: 1338852857992
first L-id: 1338852858312 second
They're not equal
This is deep copy
I want to know why?
And how can i Solve it?
Let the scrapy_redis change to become shallow copy
This has to do with the fact that scrapy-redis uses its own scheduler class which serializes/deserializes all requests through redis before pushing them further to the downloader (it keeps a queue on redis). There is no "easy" way around this as it's basically the core scrapy-redis functionality. My advise is to not put too much runtime sensitive stuff into meta as this even generally not the best idea in scrapy.
To catch all redirection paths, including when the final url was already crawled, I wrote a custom duplicate filter:
import logging
from scrapy.dupefilters import RFPDupeFilter
from seoscraper.items import RedirectionItem
class CustomURLFilter(RFPDupeFilter):
def __init__(self, path=None, debug=False):
super(CustomURLFilter, self).__init__(path, debug)
def request_seen(self, request):
request_seen = super(CustomURLFilter, self).request_seen(request)
if request_seen is True:
item = RedirectionItem()
item['sources'] = [ u for u in request.meta.get('redirect_urls', u'') ]
item['destination'] = request.url
return request_seen
Now, how can I send the RedirectionItem directly to the pipeline?
Is there a way to instantiate the pipeline from the custom filter so that I can send data directly? Or shall I also create a custom scheduler and get the pipeline from there but how?
I need to add the following class method to my existing pipeline
i am not sure how to have 2 of these class methods in my class
from twisted.enterprise import adbapi
import MySQLdb.cursors
class MySQLStorePipeline(object):
"""A pipeline to store the item in a MySQL database.
This implementation uses Twisted's asynchronous database API.
def __init__(self, dbpool):
self.dbpool = dbpool
def from_settings(cls, settings):
dbargs = dict(
host= settings['DB_HOST'],
db= settings['DB_NAME'],
user= settings['DB_USER'],
passwd= settings['DB_PASSWD'],
dbpool = adbapi.ConnectionPool('MySQLdb', **dbargs)
return cls(dbpool)
def process_item(self, item, spider):
From my understanding of class methods, several class methods in a python class should just be fine. It just depends on which one the caller requires. However, I have only seen from_crawler until now in scrapy pipelines. From there you can get access to the settings via crawler.settings
Are you sure that from_settings is required? I did not check all occurences, but in middleware.py priority seems to apply: If a crawler object is available and a from_crawler method exists, this is taken. Otherwise, if there is a from_settings method, that is taken. Otherwise, the raw constructor is taken.
if crawler and hasattr(mwcls, 'from_crawler'):
mw = mwcls.from_crawler(crawler)
elif hasattr(mwcls, 'from_settings'):
mw = mwcls.from_settings(settings)
mw = mwcls()
I admit, I do not know if this is also the place where pipelines get created (I guess not, but there is no pipelines.py), but the implementation seems very reasonable.
So, I'd just either:
reimplement the whole method as from_crawler and only use that one
add method from_crawler and use both
The new method could look like follows (to duplicate as little code as possible):
def from_crawler(cls, crawler):
obj = cls.from_settings(crawler.settings)
return obj
Of course this depends a bit on what you need.