Scrapy pipeline architecture - need to return variables - scrapy

I need some advice on how to proceed with my item pipeline. I need to POST an item to an API (working well) and with the response object get the ID of the entity created (have this working too) and then use it to populate another entity. Ideally, the item pipeline can return the entity ID. Basically, I am in a situation where I have a one to many relationship that I need to encode in a no-SQL database. What would be the best way to proceed?

The best way to proceed for you is to use Mongodb, a NO-sql databse which runs best in compliance with scrapy. The pipeline for the mongodb can be found here and the the process is explained in this tutorial .
Now what is explained in the solution from Pablo Hoffman, updating different items from different pipelines into one can be achieved by the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. (Not tested the code but hope it would help)
def check_spider_pipeline(process_item_method):
#functools.wraps(process_item_method)
def wrapper(self, item, spider):
# message template for debugging
msg = '%%s %s pipeline step' % (self.__class__.__name__,)
# if class is in the spider's pipeline, then use the
# process_item method normally.
if self.__class__ in spider.pipeline:
spider.log(msg % 'executing', level=log.DEBUG)
return process_item_method(self, item, spider)
# otherwise, just return the untouched item (skip this step in
# the pipeline)
else:
spider.log(msg % 'skipping', level=log.DEBUG)
return item
return wrapper
And the decorator goes something like this :
class MySpider(BaseSpider):
pipeline = set([
pipelines.Save,
pipelines.Validate,
])
def parse(self, response):
# insert scrapy goodness here
return item
class Save(BasePipeline):
#check_spider_pipeline
def process_item(self, item, spider):
# more scrapy goodness here
return item
At last you can take help from this question.

Perhaps I don't understand your question, but it sounds like you just need to call your submission code in the def close_spider(self, spider): method. Have you tried that?

Related

How do I construct complex django query statements?

I am not very familiar with SQL and so trying to make more complex calls via Django ORM is stumping me. I have a Printer model that spawns Jobs and the jobs receive statuses via a State model with a foreign key relationship to it. The jobs status is determined by the most recent state object associated with it. This is so I can track the history of states of jobs throughout its life cycle. I want to be able to determine which Printers have successful jobs associated with them.
from django.db import models
class Printer(models.Model):
label = models.CharField(max_length=120)
class Job(models.Model):
label = models.CharField(max_length=120)
printer = models.ForeignKey(
Printer,
related_name='jobs',
related_query_name='job'
)
def set_state(self, state):
State.objects.create(state=state, job=self)
#property
def current_state(self):
return self.states.latest('created_at').state
class State(models.Model):
created_at = models.DateTimeField(auto_now_add=True)
state = models.SmallIntegerField()
job = models.ForeignKey(
Job,
related_name='states',
related_query_name='state'
)
I need a QuerySet of Printer objects that have at least one related job with its most recent (latest) state object which has State.state == '200'. Is there a way to construct a compound call which will achieve this using the database and not having to pull in all Job objects to run python iterations on? Perhaps a custom manager? I've been reading posts about Subquery and Annotation and OuterRef, but these ideas are just not sinking in in a way that is showing me a path. I need them explained like I'm 5. They are very unpythonic statements..
The naive python way to describe what I want:
printers = []
for printer in Printer.objects.all():
for job in printer.jobs.objects.all():
if job.states.latest().state == '200':
printers.append(printer)
printers = list(set(printers))
But with the least number of DB round trips possible. Help!
edit: further question, what's the best way to filter Jobs based on the current state. Since Job.current_state is a calculated property it cannot be used in a QuerySet filter. But, again, I don't want to have to pull in all Job objects.
Took about two days to sink in, but I think I have an answer using annotation and Subqueries:
state_sq = State.objects.filter(job=OuterRef('pk')).order_by('-created_at')
successful_jobs = Job.objects.annotate(
latest_state=Subquery(state_sq.values('state')[:1])
).filter(printer=OuterRef('pk'), latest_state='200')
printers_with_successful_jobs = Printer.objects.annotate(
has_success_jobs=Exists(successful_jobs)
).filter(has_success_jobs=True)
And further, I constructed a custom manager to return latest_state by default.
class JobManager(models.Manager):
def get_queryset(self):
state_sq = State.objects.filter(
object_id=OuterRef('pk')
).order_by('-created_at')
return super().get_queryset().annotate(
latest_state=Subquery(state_sq.values('state')[:1])
)
class Job(models.Model):
objects = JobManager()
...

Scrapy item metadata?

My spider yields two types of items, which are then processed by a pipeline. Is there a way for the pipeline to identify each type of item (other than through the keys). Some sort of metadata type or title field?
In your pipelines.py:
def process_item(self, item, spider):
if isinstance(item, YourItemType1):
# code to process Item Type 1

Where to write predefined queries in django?

I am working with a team of engineers, and this is my first Django project.
Since I have done SQL before, I chose to write the predefined queries that the front-end developers are supposed to use to build this page (result set paging, simple find etc.).
I just learned Django QuerySet, and I am ready to use it, but I do not know on which file/class to write them.
Should I write them as methods inside each class in models.py? Django documentation simply writes them in the shell, and I haven't read it say where to put them.
Generally, the Django pattern is that you will write your queries in your views in the views.py file. Here you will take each of your predefined queries for a given URL and return a response that renders a template (that presumably your front end team will build with you.) or returns a JSON response (for example through Django Rest Framework for an SPA front-end).
The tutorial is strong on this, so that may be a better bet for where to put things than the docs itself.
Queries can be run anywhere, but django is built to receive Requests through the URL schema, and return a response. This is typically done in the views.py, and each view is generally called by a line in the urls.py file.
If you're particularly interested in following the fat models approach and putting them there, then you might be interested in the Manager objects, which are what define querysets that you get through, for example MyModel.objects.all()
My example view (for a class based view, which provides information about a list of matches:
class MatchList(generics.ListCreateAPIView):
"""
Retrieve, update or delete a Match.
"""
queryset = Match.objects.all()
serializer_class = MatchSerialiser
That queryset could be anything, though.
A function based view with a different queryset would be:
def event(request, event_slug):
from .models import Event, Comment, Profile
event = Event.objects.get(event_url=event_slug)
future_events = Event.objects.filter(date__gt=event.date)
comments = Comment.objects.select_related('user').filter(event=event)
final_comments = []
return render(request, 'core/event.html', {"event": event, "future_events": future_events})
edit: That second example is quite old, and the query would be better refactored to:
future_events=Event.objects.filter(date__gt=event.date).select_related('comments')
Edit edit: It's worth pointing out, QuerySet isn't a language, in the way that you're using it. It's django's API for the Object Relational Mapper that sits on top of the database, in the same way that SQLAlchemy also does - in fact, you can swap out or use SQLAlchemy instead of using the Django ORM, if you really wanted. Mostly you'll hear people talking about the Django ORM. :)
If you have some model SomeModel and you wanted to access its objects via a raw SQL query you would do: SomeModel.objects.raw(raw_query).
For example: SomeModel.objects.raw('SELECT * FROM myapp_somemodel')
https://docs.djangoproject.com/en/1.11/topics/db/sql/#performing-raw-queries
Django file structure:
app/
models.py
views.py
urls.py
templates/
app/
my_template.html
In models.py
class MyModel(models.Model):
#field definition and relations
In views.py:
from .models import MyModel
def my_view():
my_model = MyModel.objects.all() #here you use the querysets
return render('my_template.html', {'my_model': my_model}) #pass the object to the template
In the urls.py
from .views import my_view
url(r'^myurl/$', my_view, name='my_view'), # here you write the url that points to your view
And finally in my_template.html
# display the data using django template
{% for obj in object_list %}
<p>{{ obj }}</p>
{% endfor %}

Returning nested structures from spider

I'm trying to work out how to get scrapy to return a nested data structure, as the only examples I can find deal with flat structures.
I am trying to scrape a forum, which is comprised of a list of threads, with each thread having a list of posts.
I can successfully scrape the list of threads, and the list of posts, but I am not sure how to get all the posts attached to the thread, instead of all jumbled together.
In the end, I am aiming for output like this:
<thread id="1">
<post>Post 1</post>
<post>Post 2</post>
</thread>
<thread id="2">
<post>Post A</post>
<post>Post B</post>
</thread>
If I do something like this:
def parse(self, response):
# For each thread on this page
yield scrapy.Request(thread_url, self.parse_posts)
def parse_posts(self, response):
# For each post on this page
yield {'content': ... }
Then I just get a list of all posts without them being arranged into threads. Something like this doesn't work, of course:
def parse(self, response):
# For each thread on this page
yield {
'id': ...,
'posts': scrapy.Request(thread_url, self.parse_posts)
}
So I am not sure how to get the "child" requests to go into the "parent" object.
As far as getting the association, like JimmyZhang said, this is exactly what meta is for. Parse an ID out of the thread list page before yielding a request, pass that thread ID into the request via the meta keyword, then access the ID when processing the post.
def parse(self, response):
# For each thread on this page
thread_id = sel.xpath('thread_id_getter_xpath').extract()
yield scrapy.Request(thread_url, callback=self.parse_posts,
meta={'thread_id': thread_id})
def parse_posts(self, response):
# For each post on this page
thread_id = response.meta['thread_id'])
yield {'thread_id': thread_id, 'content': ... }
At this point, the items are associated. How you compile data into a hierarchical format is entirely up to you, and dependent on your needs. You could, for instance, write a pipeline to compile it all in a dictionary and output it at the end of the crawl.
def process_item(self, item, spider)
# Assume self.forum is an empty dict at initialization
self.forum.setdefault(item.thread_id, [])
self.forum[item.thread_id].append(['post': item.post_id,
'content': item.content])
def close_spider(self, spider)
# Do something with self.forum, like output it as XML or JSON
# ... or just print it to the stdout.
print self.forum
Or you could compile an XML tree incrementally saving. Or serialize each item into a JSON string and dump to a file line by line. Or add items to a database as you go. Or whatever else your needs dictate.
You can use metadata.
First:
yield : scrapy.Request(thread_url, self.parse_posts,'meta'={'thread_id' : id})
Second, define a thread item:
class thread_item(Item):
thread_id = Field()
posts = Field()
Third, fetch the thread_id in parse_posts:
thread_id = response.meta['thread_id']
# parse posts content, construct thread item
yield item
Fourth, write a pipeline, and output the thread item.

Scrapy: Default values for items & fields. What is the best implementation?

As far as I could find out from the documentation and various discussions on the net, the ability to add default values to fields in a scrapy item has been removed.
This doesn't work
category = Field(default='null')
So my question is: what is a good way to initialize fields with a default value?
I already tried to implement it as a item pipeline as suggested here, without any success.
https://groups.google.com/forum/?fromgroups=#!topic/scrapy-users/-v1p5W41VDQ
figured out what the problem was. the pipeline is working (code follows for other people's reference). my problem was, that I am appending values to a field. and I wanted the default method work on one of these listvalues... chose a different way and it works. I am now implementing it with a custom setDefault processor method.
class DefaultItemPipeline(object):
def process_item(self, item, spider):
item.setdefault('amz_VendorsShippingDurationFrom', 'default')
item.setdefault('amz_VendorsShippingDurationTo', 'default')
# ...
return item
Typically, a constructor is used to initialize fields.
class SomeItem(scrapy.Item):
id = scrapy.Field()
category = scrapy.Field()
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self['category'] = 'null' # set default value
This may not be a clean solution, but it avoids unnecessary pipelines.