Scrapy item metadata? - scrapy

My spider yields two types of items, which are then processed by a pipeline. Is there a way for the pipeline to identify each type of item (other than through the keys). Some sort of metadata type or title field?

In your pipelines.py:
def process_item(self, item, spider):
if isinstance(item, YourItemType1):
# code to process Item Type 1

Related

Returning nested structures from spider

I'm trying to work out how to get scrapy to return a nested data structure, as the only examples I can find deal with flat structures.
I am trying to scrape a forum, which is comprised of a list of threads, with each thread having a list of posts.
I can successfully scrape the list of threads, and the list of posts, but I am not sure how to get all the posts attached to the thread, instead of all jumbled together.
In the end, I am aiming for output like this:
<thread id="1">
<post>Post 1</post>
<post>Post 2</post>
</thread>
<thread id="2">
<post>Post A</post>
<post>Post B</post>
</thread>
If I do something like this:
def parse(self, response):
# For each thread on this page
yield scrapy.Request(thread_url, self.parse_posts)
def parse_posts(self, response):
# For each post on this page
yield {'content': ... }
Then I just get a list of all posts without them being arranged into threads. Something like this doesn't work, of course:
def parse(self, response):
# For each thread on this page
yield {
'id': ...,
'posts': scrapy.Request(thread_url, self.parse_posts)
}
So I am not sure how to get the "child" requests to go into the "parent" object.
As far as getting the association, like JimmyZhang said, this is exactly what meta is for. Parse an ID out of the thread list page before yielding a request, pass that thread ID into the request via the meta keyword, then access the ID when processing the post.
def parse(self, response):
# For each thread on this page
thread_id = sel.xpath('thread_id_getter_xpath').extract()
yield scrapy.Request(thread_url, callback=self.parse_posts,
meta={'thread_id': thread_id})
def parse_posts(self, response):
# For each post on this page
thread_id = response.meta['thread_id'])
yield {'thread_id': thread_id, 'content': ... }
At this point, the items are associated. How you compile data into a hierarchical format is entirely up to you, and dependent on your needs. You could, for instance, write a pipeline to compile it all in a dictionary and output it at the end of the crawl.
def process_item(self, item, spider)
# Assume self.forum is an empty dict at initialization
self.forum.setdefault(item.thread_id, [])
self.forum[item.thread_id].append(['post': item.post_id,
'content': item.content])
def close_spider(self, spider)
# Do something with self.forum, like output it as XML or JSON
# ... or just print it to the stdout.
print self.forum
Or you could compile an XML tree incrementally saving. Or serialize each item into a JSON string and dump to a file line by line. Or add items to a database as you go. Or whatever else your needs dictate.
You can use metadata.
First:
yield : scrapy.Request(thread_url, self.parse_posts,'meta'={'thread_id' : id})
Second, define a thread item:
class thread_item(Item):
thread_id = Field()
posts = Field()
Third, fetch the thread_id in parse_posts:
thread_id = response.meta['thread_id']
# parse posts content, construct thread item
yield item
Fourth, write a pipeline, and output the thread item.

Scrapy pipeline architecture - need to return variables

I need some advice on how to proceed with my item pipeline. I need to POST an item to an API (working well) and with the response object get the ID of the entity created (have this working too) and then use it to populate another entity. Ideally, the item pipeline can return the entity ID. Basically, I am in a situation where I have a one to many relationship that I need to encode in a no-SQL database. What would be the best way to proceed?
The best way to proceed for you is to use Mongodb, a NO-sql databse which runs best in compliance with scrapy. The pipeline for the mongodb can be found here and the the process is explained in this tutorial .
Now what is explained in the solution from Pablo Hoffman, updating different items from different pipelines into one can be achieved by the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. (Not tested the code but hope it would help)
def check_spider_pipeline(process_item_method):
#functools.wraps(process_item_method)
def wrapper(self, item, spider):
# message template for debugging
msg = '%%s %s pipeline step' % (self.__class__.__name__,)
# if class is in the spider's pipeline, then use the
# process_item method normally.
if self.__class__ in spider.pipeline:
spider.log(msg % 'executing', level=log.DEBUG)
return process_item_method(self, item, spider)
# otherwise, just return the untouched item (skip this step in
# the pipeline)
else:
spider.log(msg % 'skipping', level=log.DEBUG)
return item
return wrapper
And the decorator goes something like this :
class MySpider(BaseSpider):
pipeline = set([
pipelines.Save,
pipelines.Validate,
])
def parse(self, response):
# insert scrapy goodness here
return item
class Save(BasePipeline):
#check_spider_pipeline
def process_item(self, item, spider):
# more scrapy goodness here
return item
At last you can take help from this question.
Perhaps I don't understand your question, but it sounds like you just need to call your submission code in the def close_spider(self, spider): method. Have you tried that?

How to get in openerp all objects from a class?

I need to get all objects from a class and iterate through them.
I tried this, but without any results:
def my_method(self, cr, uid, ids, context=None):
pool_obj = pooler.get_pool(cr.dbname)
my_objects=pool_obj.get('project.myobject')
#here i'll iterate through them...
How can I get in 'my_objects' variable all objects of class 'project.myobject'?
You have to search with empty parameters to get all the ids of existing objects, like:
myobj = pool.get('project.myobject')
ids = myobj.search(cr, uid, [])
Then you can browse or read them passing an id or the list of ids.
It seems you forget to import pooler.
from openerp import pooler
May it will help you.

Scrapy: Default values for items & fields. What is the best implementation?

As far as I could find out from the documentation and various discussions on the net, the ability to add default values to fields in a scrapy item has been removed.
This doesn't work
category = Field(default='null')
So my question is: what is a good way to initialize fields with a default value?
I already tried to implement it as a item pipeline as suggested here, without any success.
https://groups.google.com/forum/?fromgroups=#!topic/scrapy-users/-v1p5W41VDQ
figured out what the problem was. the pipeline is working (code follows for other people's reference). my problem was, that I am appending values to a field. and I wanted the default method work on one of these listvalues... chose a different way and it works. I am now implementing it with a custom setDefault processor method.
class DefaultItemPipeline(object):
def process_item(self, item, spider):
item.setdefault('amz_VendorsShippingDurationFrom', 'default')
item.setdefault('amz_VendorsShippingDurationTo', 'default')
# ...
return item
Typically, a constructor is used to initialize fields.
class SomeItem(scrapy.Item):
id = scrapy.Field()
category = scrapy.Field()
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self['category'] = 'null' # set default value
This may not be a clean solution, but it avoids unnecessary pipelines.

Repackage Scrapy Spider Items

To keep things organized I determined there are three item classes that a spider will populate.
Each item class has a variety of fields that are populated.
class item_01(Item):
item1 = Field()
item2 = Field()
item3 = Field()
class item_02(Item):
item4 = Field()
item5 = Field()
class item_03(Item):
item6 = Field()
item7 = Field()
item8 = Field()
There are multiple pages to crawl with the same items.
In the spider I use XPathItemLoader to populate the 'containers'.
The goal is to pass the items to a mysql pipeline to populate a single table. But here is the problem.
When I yield the three containers (per page) they are passed as such into the pipeline, as three separate containers.
They go through the pipeline as their own BaseItem and populate only their section of the mysql table, leaving the other columns 'NULL'.
What I would like to do is repackage these three containers into a single BaseItem so that they are passed into the pipeline as a single ITEM.
Does anyone have any suggestions as to repackage the items? Either in the spider or pipeline?
Thanks
I did this hack to get things moving but if someone can improve or hint at a better solution please share it.
Loading my items in the spider like this:
items = [item1.load_item(), item2.load_item(), item3.load_item()]
I then defined a function outside the spider:
def rePackIt(items):
rePackage = rePackageItems()
rePack = {}
for item in items:
rePack.update(dict(item))
for key, value in rePack.items():
rePackage.fields[key] = value
return rePackage
Where in the items.py I added:
class rePackageItems(Item):
"""Repackage the items"""
pass
After the spider is done crawling the page and loading items I yield:
yield rePackIt(items)
which takes me to the pipelines.py.
In the process_item to unpack the item I did the following:
def process_item(self, item, spider):
items = item.fields
items is now a dictionary that contains all the extracted fields from the spider which I then used to insert into a single database table