Some questions about using multiple piplines in scrapy - scrapy

I'm new to scrapy and I started a simple project several days ago. I have successfully implemented items.py, my_spider.py and piplines.py to scrape some information into a json file. Now I'd like to add some features to my spider and encountered some questions.
I have already scraped the desired information on threads of a forum, including the file_urls and image_urls. I'm a little confused about the tutorial by Scrapy Documentation, here are the related parts in my files:
**settings.py**
...
ITEM_PIPELINES = {
'my_project.pipelines.InfoPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 300,
'scrapy.pipelines.files.FilesPipeline': 300,
}
FILES_STORE = './Downloads'
IMAGES_STORE = './Downloads'
**items.py**
...
class InfoIterm(scrapy.Item):
movie_number_title = scrapy.Field()
movie_pics_links = scrapy.Field()
magnet_link = scrapy.Field()
torrent_link = scrapy.Field()
torrent_name = scrapy.Field()
class TorrentItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
**piplines.py**
...
def process_item(self, item, spider):
contents = json.dumps(dict(item), indent=4, sort_keys=True, ensure_ascii=False)
with open("./threads_data.json", "wb") as f:
f.write(contents.encode("utf-8"))
return item
**my_spider.py**
...
def parse_thread(self, response):
json_item = InfoIterm()
json_item['movie_number_title'] = response.xpath("//span[#id='thread_subject']/text()").getall()
json_item['movie_pics_links'] = response.xpath("//td[#class='t_f']//img/#file").getall()
json_item['magnet_link'] = response.xpath("//div[#class='blockcode']/div//li/text()").getall()
json_item['torrent_name'] = response.xpath("//p[#class='attnm']/a/text()").getall()
json_item['torrent_link'] = self.base_url + response.xpath("//p[#class='attnm']/a/#href").getall()[0]
yield json_item
torrent_link = self.base_url + response.xpath("//p[#class='attnm']/a/#href").getall()
yield {'file_urls': torrent_link}
movie_pics_links = response.xpath("//td[#class='t_f']//img/#file").getall()
yield {'image_urls': movie_pics_links}
Now I can download images successfully, but files are not downloaded. My json file is also overridden by the last image_urls.
So, here are my questions:
Can one spider use multiple piplines? If possible, what's the best way to use them (For example, in my case. Some example will be great!)?
In some cases some of these json_item['xxx'] are not presented on certain threads, and the consol will print some information reporting the problem. I tried to use try-except on each line of there code, but it becomes really ugly and I believe there should be some better way to do that. What is the best way to do that?
Thanks a lot.

1- Yes you can use several pipelines, you need to mind the order in which they are called though. (More on that here)
If they are meant to process different Item objects, all you need to do is to check the class of the item received in the process_item method. Process the ones you want, return the others untouched.
2- What is the error, can't help much without that information. Please post an execution log.

Related

Scrapy Request method‘s meta-args is shallow copy,but the Request method‘s meta-args is deep copy in scrapy_redis.Why?

scrapy:
import scrapy
from scrapy.spider import Request
class TestspiderSpider(scrapy.Spider):
name = 'testspider'
allowed_domains = ['mzitu.com']
start_urls = ['http://www.mzitu.com/']
def start_requests(self):
L =[]
print("L-id:",id(L),"first")
yield Request(url="http://www.mzitu.com/5675",callback=self.parse,meta={"L":L},dont_filter=True)
def parse(self, response):
L = response.meta.get('L')
print("L-id:", id(L),"second")
The ouput:
L-id: 2769118042568 first
L-id: 2769118042568 second
They're equal
This is shallow copy
scrapy_redis
from scrapy_redis.spiders import RedisSpider
from scrapy.spider import Request
class MzituSpider(RedisSpider): #scrapy_redis
name = 'mzitu'
redis_key = 'a:a' #this is discard
def start_requests(self): #Because Rewrite the method of RedisSpider
L =[]
print("L-id:",id(L),"first")
yield Request(url="http://www.mzitu.com/5675",callback=self.parse,meta={"L":L},dont_filter=True)
def parse(self, response):
L = response.meta.get('L')
print("L-id:", id(L),"second")
The output:
L-id: 1338852857992
first L-id: 1338852858312 second
They're not equal
This is deep copy
Question:
I want to know why?
And how can i Solve it?
Let the scrapy_redis change to become shallow copy
This has to do with the fact that scrapy-redis uses its own scheduler class which serializes/deserializes all requests through redis before pushing them further to the downloader (it keeps a queue on redis). There is no "easy" way around this as it's basically the core scrapy-redis functionality. My advise is to not put too much runtime sensitive stuff into meta as this even generally not the best idea in scrapy.

Should objects know of the objects they're used in?

class Item:
def __init__(self, box, description):
self._box = box
self._description = description
class Box:
def __init__(self):
self.item_1 = Item(self, 'A picture')
self.item_2 = Item(self, 'A pencil')
#etc
old_stuff = Box()
print(old_stuff.item_1.box.item_1.box.item_2.box.item_1)
Above is shown an example piece of code which demonstrates my problem better than I ever could with plain text. Is there a better way to find in what box something is? (In what box is the picture?) Since I am not particularly fond of the above solution because it allows for this weird up and down calling which could go on forever. Is there a better way to solve this problem or is this just a case of: If it's stupid and it works, it ain't stupid.
Note: this trick isn't python specific. It's doable in all object-oriented programming laguages.
There is no right or wrong way to do this. The solution depends on how you want to use the object.
If your use-case requires that an item know in which box it is stored, then you need a reference to the box; if not, then you don't need the association.
Similarly, if you need to which items are in a given box, then you need references to the items in the box object.
The immediate requirement (that is, the current context) always dictates how one designs a class model; for example, one models an item or a box differently in a UI layer from how one would model it in a service layer.
You must introduce new class - ItemManager or simply dict or other external structure to store information about which box contain your item:
class Item:
def __init__(self, description):
self.description = description
class Box:
def __init__(self, item_1, item_2):
self.item_1 = item_1
self.item_2 = item_2
class ItemManager:
def __init__(self):
self.item_boxes = {}
def register_item(self, item, box):
self.item_boxes[item] = box
def deregister_item(self, item):
del self.item_boxes[item]
def get_box(self, item):
return self.item_boxes.get(item, None)
item_manager = ItemManager()
item_1 = Item("A picture")
item_2 = Item("A pencil")
item_3 = Item("A teapot")
old_stuff = Box(item_1, item_2)
item_manager.register_item(item_1, old_stuff)
item_manager.register_item(item_2, old_stuff)
new_stuff = Box(item_3, None)
item_manager.register_item(item_3, new_stuff)
box_with_picture = item_manager.get_box(item_2)
print box_with_picture.item_1.description
Also see SRP: an item should not know which box contains it.

Custom scrapy xml+rss exporter to S3

I'm trying to create a custom xml feed, that will contain the spider scraped items, as well as some other high level information, stored in the spider definition. The output should be stored on S3.
The desired output looks like the following:
<xml>
<title>my title defined in the spider</title>
<description>The description from the spider</description>
<items>
<item>...</item>
</items>
</xml>
In order to do so, I defined a custom exporter, which is able to export the desired output file locally.
spider.py:
class DmozSpider(scrapy.Spider):
name = 'dmoz'
allowed_domains = ['dmoz.org']
start_urls = ['http://www.dmoz.org/Computers/']
title = 'The DMOZ super feed'
def parse(self, response):
...
yield item
exporters.py:
from scrapy.conf import settings
class CustomItemExporter(XmlItemExporter):
def __init__(self, *args, **kwargs):
self.title = kwargs.pop('title', 'no title found')
self.link = settings.get('FEED_URI', 'localhost')
super(CustomItemExporter, self).__init__(*args, **kwargs)
def start_exporting(self):
...
self._export_xml_field('title', self.title)
...
settings.py:
FEED_URI = 's3://bucket-name/%(name)s.xml'
FEED_EXPORTERS = {
'custom': 'my.exporters.CustomItemExporter',
}
I'm able to run the whole thing and get the output on s3 by running the following command:
scrapy crawl dmoz -t custom
or, if I want to export a json locally instead: scrapy crawl -o dmoz.json dmoz
But at this point, I'm unable to retrieve the spider title to put it in the output file.
I tried implementing a custom pipeline, which outputs data locally (following numerous examples):
pipelines.py:
class CustomExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_feed.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CustomItemExporter(
file,
title = spider.title),
)
self.exporter.start_exporting()
The problem is, the file is stored locally, and this short circuits the FeedExporter logic defined in feedexport.py, that handles all the different storages.
No info from the FeedExporter is available in the pipeline, and I would like to reuse all that logic without duplicating code. Am I missing something? Thanks for any help.
Here's my solution:
get rid of the pipeline.
Override scrapy's FeedExporter
myproject/feedexport.py:
from scrapy.extensions.feedexport import FeedExporter as _FeedExporter
from scrapy.extensions.feedexport import SpiderSlot
class FeedExporter(_FeedExporter):
def open_spider(self, spider):
uri = self.urifmt % self._get_uri_params(spider)
storage = self._get_storage(uri)
file = storage.open(spider)
extra = {
# my extra settings
}
exporter = self._get_exporter(file, fields_to_export=self.export_fields, extra=extra)
exporter.start_exporting()
self.slot = SpiderSlot(file, exporter, storage, uri)
All I wanted to do was basically to pass those extra settings to the exporter, but the way it's built, there is no choice but to override.
To support other scrapy export formats simultaneously, I would have to consider overriding the dont_fail settings to True in some scrapy exporters to prevent them from failing
Replace scrapy's feed exporter by the new one
myproject/feedexport.py:
EXTENSIONS = {
'scrapy.extensions.feedexport.FeedExporter': None,
'myproject.feedexport.FeedExporter': 0,
}
... or the 2 feed exporters would run at the same time

django Autocomplete-light how to choose a specific method from a mode

I am new at django and autocomplete-light. I try to get a different fields of the model from autocomplete-light, but it always return the same field. And the reason is because def in the Model defined one field. So I created another def, but can not make autocomplete-light to call that specific def. Here is my code.
models.py:
class Item(models.Model):
...
serial_number=models.CharField(max_length=100, unique=True)
barcode=models.CharField(max_length=25, unique=True)
def __unicode__(self):
return self.serial_number
def bar(self):
return self.barcode
.......
autocomplete_light_registry.py
autocomplete_light.register(Item,
name='AutocompleteItemserial',
search_fields=['serial_number'],
)
autocomplete_light.register(Item,
name='AutocompleteItembarcode',
search_fields=['barcode'],
)
Here is the issue: when I try to get the barcodes from the autocomplete-light, it returns serial_numbers. No matter what I try to get from the Item model, it always returns the serial number. I really appreciate for the answers. Thank you.
Just in case, here is the form.py
forms.py
class ItemForm(forms.ModelForm):
widgets = {
'serial_number': autocomplete_light.TextWidget('AutocompleteItemserial'),
'barcode': autocomplete_light.TextWidget('AutocompleteItembarcode'),
}
Although this is an old post but as I just faced the same issue therefore I am sharing my solution.
The reason autocomplete is returning serial_number is because django-autocomplete-light uses the __unicode__ method of the model to show the results. In your AutocompleteItembarcode all that is being done is autocomplete-light is searching by barcode field of Item.
Try the following.
In app/autocomplete_light_registry.py
from django.utils.encoding import force_text
class ItemAutocomplete(autocomplete_light.AutocompleteModelBase):
search_fields = ['serial_number']
model = Item
choices = Item.objects.all()
def choice_label(self, choice):
"""
Return the human-readable representation of a choice.
"""
barcode = Item.objects.get(pk=self.choice_value(choice)).barcode
return force_text(barcode)
autocomplete_light.register(ItemAutocomplete)
For more help you can have a look at the source code.

How do Scrapy from_settings and from_crawler class methods work?

I need to add the following class method to my existing pipeline
http://doc.scrapy.org/en/latest/faq.html#i-m-getting-an-error-cannot-import-name-crawler
i am not sure how to have 2 of these class methods in my class
from twisted.enterprise import adbapi
import MySQLdb.cursors
class MySQLStorePipeline(object):
"""A pipeline to store the item in a MySQL database.
This implementation uses Twisted's asynchronous database API.
"""
def __init__(self, dbpool):
self.dbpool = dbpool
#classmethod
def from_settings(cls, settings):
dbargs = dict(
host= settings['DB_HOST'],
db= settings['DB_NAME'],
user= settings['DB_USER'],
passwd= settings['DB_PASSWD'],
charset='utf8',
use_unicode=True,
)
dbpool = adbapi.ConnectionPool('MySQLdb', **dbargs)
return cls(dbpool)
def process_item(self, item, spider):
pass
From my understanding of class methods, several class methods in a python class should just be fine. It just depends on which one the caller requires. However, I have only seen from_crawler until now in scrapy pipelines. From there you can get access to the settings via crawler.settings
Are you sure that from_settings is required? I did not check all occurences, but in middleware.py priority seems to apply: If a crawler object is available and a from_crawler method exists, this is taken. Otherwise, if there is a from_settings method, that is taken. Otherwise, the raw constructor is taken.
if crawler and hasattr(mwcls, 'from_crawler'):
mw = mwcls.from_crawler(crawler)
elif hasattr(mwcls, 'from_settings'):
mw = mwcls.from_settings(settings)
else:
mw = mwcls()
I admit, I do not know if this is also the place where pipelines get created (I guess not, but there is no pipelines.py), but the implementation seems very reasonable.
So, I'd just either:
reimplement the whole method as from_crawler and only use that one
add method from_crawler and use both
The new method could look like follows (to duplicate as little code as possible):
#classmethod
def from_crawler(cls, crawler):
obj = cls.from_settings(crawler.settings)
obj.do_something_on_me_with_crawler(crawler)
return obj
Of course this depends a bit on what you need.