I'm trying to work out how to get scrapy to return a nested data structure, as the only examples I can find deal with flat structures.
I am trying to scrape a forum, which is comprised of a list of threads, with each thread having a list of posts.
I can successfully scrape the list of threads, and the list of posts, but I am not sure how to get all the posts attached to the thread, instead of all jumbled together.
In the end, I am aiming for output like this:
<thread id="1">
<post>Post 1</post>
<post>Post 2</post>
</thread>
<thread id="2">
<post>Post A</post>
<post>Post B</post>
</thread>
If I do something like this:
def parse(self, response):
# For each thread on this page
yield scrapy.Request(thread_url, self.parse_posts)
def parse_posts(self, response):
# For each post on this page
yield {'content': ... }
Then I just get a list of all posts without them being arranged into threads. Something like this doesn't work, of course:
def parse(self, response):
# For each thread on this page
yield {
'id': ...,
'posts': scrapy.Request(thread_url, self.parse_posts)
}
So I am not sure how to get the "child" requests to go into the "parent" object.
As far as getting the association, like JimmyZhang said, this is exactly what meta is for. Parse an ID out of the thread list page before yielding a request, pass that thread ID into the request via the meta keyword, then access the ID when processing the post.
def parse(self, response):
# For each thread on this page
thread_id = sel.xpath('thread_id_getter_xpath').extract()
yield scrapy.Request(thread_url, callback=self.parse_posts,
meta={'thread_id': thread_id})
def parse_posts(self, response):
# For each post on this page
thread_id = response.meta['thread_id'])
yield {'thread_id': thread_id, 'content': ... }
At this point, the items are associated. How you compile data into a hierarchical format is entirely up to you, and dependent on your needs. You could, for instance, write a pipeline to compile it all in a dictionary and output it at the end of the crawl.
def process_item(self, item, spider)
# Assume self.forum is an empty dict at initialization
self.forum.setdefault(item.thread_id, [])
self.forum[item.thread_id].append(['post': item.post_id,
'content': item.content])
def close_spider(self, spider)
# Do something with self.forum, like output it as XML or JSON
# ... or just print it to the stdout.
print self.forum
Or you could compile an XML tree incrementally saving. Or serialize each item into a JSON string and dump to a file line by line. Or add items to a database as you go. Or whatever else your needs dictate.
You can use metadata.
First:
yield : scrapy.Request(thread_url, self.parse_posts,'meta'={'thread_id' : id})
Second, define a thread item:
class thread_item(Item):
thread_id = Field()
posts = Field()
Third, fetch the thread_id in parse_posts:
thread_id = response.meta['thread_id']
# parse posts content, construct thread item
yield item
Fourth, write a pipeline, and output the thread item.
Related
My spider yields two types of items, which are then processed by a pipeline. Is there a way for the pipeline to identify each type of item (other than through the keys). Some sort of metadata type or title field?
In your pipelines.py:
def process_item(self, item, spider):
if isinstance(item, YourItemType1):
# code to process Item Type 1
I need some advice on how to proceed with my item pipeline. I need to POST an item to an API (working well) and with the response object get the ID of the entity created (have this working too) and then use it to populate another entity. Ideally, the item pipeline can return the entity ID. Basically, I am in a situation where I have a one to many relationship that I need to encode in a no-SQL database. What would be the best way to proceed?
The best way to proceed for you is to use Mongodb, a NO-sql databse which runs best in compliance with scrapy. The pipeline for the mongodb can be found here and the the process is explained in this tutorial .
Now what is explained in the solution from Pablo Hoffman, updating different items from different pipelines into one can be achieved by the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. (Not tested the code but hope it would help)
def check_spider_pipeline(process_item_method):
#functools.wraps(process_item_method)
def wrapper(self, item, spider):
# message template for debugging
msg = '%%s %s pipeline step' % (self.__class__.__name__,)
# if class is in the spider's pipeline, then use the
# process_item method normally.
if self.__class__ in spider.pipeline:
spider.log(msg % 'executing', level=log.DEBUG)
return process_item_method(self, item, spider)
# otherwise, just return the untouched item (skip this step in
# the pipeline)
else:
spider.log(msg % 'skipping', level=log.DEBUG)
return item
return wrapper
And the decorator goes something like this :
class MySpider(BaseSpider):
pipeline = set([
pipelines.Save,
pipelines.Validate,
])
def parse(self, response):
# insert scrapy goodness here
return item
class Save(BasePipeline):
#check_spider_pipeline
def process_item(self, item, spider):
# more scrapy goodness here
return item
At last you can take help from this question.
Perhaps I don't understand your question, but it sounds like you just need to call your submission code in the def close_spider(self, spider): method. Have you tried that?
As far as I could find out from the documentation and various discussions on the net, the ability to add default values to fields in a scrapy item has been removed.
This doesn't work
category = Field(default='null')
So my question is: what is a good way to initialize fields with a default value?
I already tried to implement it as a item pipeline as suggested here, without any success.
https://groups.google.com/forum/?fromgroups=#!topic/scrapy-users/-v1p5W41VDQ
figured out what the problem was. the pipeline is working (code follows for other people's reference). my problem was, that I am appending values to a field. and I wanted the default method work on one of these listvalues... chose a different way and it works. I am now implementing it with a custom setDefault processor method.
class DefaultItemPipeline(object):
def process_item(self, item, spider):
item.setdefault('amz_VendorsShippingDurationFrom', 'default')
item.setdefault('amz_VendorsShippingDurationTo', 'default')
# ...
return item
Typically, a constructor is used to initialize fields.
class SomeItem(scrapy.Item):
id = scrapy.Field()
category = scrapy.Field()
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self['category'] = 'null' # set default value
This may not be a clean solution, but it avoids unnecessary pipelines.
I have two indirectly related tables - Posts and Follower_to_followee
models.py:
class Post(models.Model):
auth_user = models.ForeignKey(User, null=True, blank=True, verbose_name='Author', help_text="Author")
title = models.CharField(blank=True, max_length=255, help_text="Post Title")
post_content = models.TextField (help_text="Post Content")
class Follower_to_followee(models.Model):
follower = models.ForeignKey(User, related_name='user_followers', null=True, blank=True, help_text="Follower")
followee = models.ForeignKey(User, related_name='user_followees', null=True, blank=True, help_text="Followee")
The folowee is indirectly related to post auth_user (post author) in posts. It is, though, directly related to Django user table and user table is directly related to post table.
How can I select all followees for a specific follower and include post counts for each followee in the result of the query without involving the user table? Actually, at this point I am not even clear how to do that involving the user table. Please help.
It's possible to write query generating single SQL, try something like
qs = User.objects.filter(user_followees__follower=specific_follower).annotate(
post_count=models.Count('post'))
for u in qs:
print u, u.post_count
Check the second part of https://stackoverflow.com/a/13293460/165603 (things work similarly except the extra M2M manager)
When being used inside User.objects.filter, both user_followees__follower=foo and user_followers__followee=foo would cause joining of the table of the Follower_to_followee model and a where condition checking for follower=foo or followee=foo
(Note that user_followees__followee=foo or user_followerers__follower=foo works differently from above, Django ORM simplifies them smartly and would generate something like User.objects.filter(pk=foo.pk)).
I'm not entirely sure I understand the question, but here is a simple solution. Note that this could be written more succinctly, but I broke it up so you can see each step.
How can I select all followees for a specific follower?
# First grab all the follower_to_followee entries for a given
# follower called: the_follower
follows = Follower_to_followee.objects.filter(follower=the_follower)
followee_counts = []
# Next, let's iterate through those objects and pick out
# the followees and their posts
for follow in follows:
followee = follow.followee
# post for each followee
followee_posts = Post.objects.filter(auth_user=followee).count()
# Count number of posts in the queryset
count = followee_posts.count()
# Add followee/post_counts to our list of followee_counts
followee_counts.append((followee, count))
# followee_counts is now a list of followee/post_count tuples
For get post counts you can use this:
#get follower
follower = User.objects.get(username='username_of_fallower')
#get all followees for a specific follower
for element in Follower_to_followee.objects.filter(follower=follower):
element.followee.post_set.all().count()
views.py
def view_name(request):
followers = Follower_to_followee.objects.filter(user=request.user)
.......
html
{{user}}<br/>
My followers:<br/>
{% follower in followers %}
<p>{{follower}} - {{follower.user.follower_to_followee_set.count}}</p>
{% endfor %}
To keep things organized I determined there are three item classes that a spider will populate.
Each item class has a variety of fields that are populated.
class item_01(Item):
item1 = Field()
item2 = Field()
item3 = Field()
class item_02(Item):
item4 = Field()
item5 = Field()
class item_03(Item):
item6 = Field()
item7 = Field()
item8 = Field()
There are multiple pages to crawl with the same items.
In the spider I use XPathItemLoader to populate the 'containers'.
The goal is to pass the items to a mysql pipeline to populate a single table. But here is the problem.
When I yield the three containers (per page) they are passed as such into the pipeline, as three separate containers.
They go through the pipeline as their own BaseItem and populate only their section of the mysql table, leaving the other columns 'NULL'.
What I would like to do is repackage these three containers into a single BaseItem so that they are passed into the pipeline as a single ITEM.
Does anyone have any suggestions as to repackage the items? Either in the spider or pipeline?
Thanks
I did this hack to get things moving but if someone can improve or hint at a better solution please share it.
Loading my items in the spider like this:
items = [item1.load_item(), item2.load_item(), item3.load_item()]
I then defined a function outside the spider:
def rePackIt(items):
rePackage = rePackageItems()
rePack = {}
for item in items:
rePack.update(dict(item))
for key, value in rePack.items():
rePackage.fields[key] = value
return rePackage
Where in the items.py I added:
class rePackageItems(Item):
"""Repackage the items"""
pass
After the spider is done crawling the page and loading items I yield:
yield rePackIt(items)
which takes me to the pipelines.py.
In the process_item to unpack the item I did the following:
def process_item(self, item, spider):
items = item.fields
items is now a dictionary that contains all the extracted fields from the spider which I then used to insert into a single database table