Persist items using a POST request within a Pipeline - scrapy

I want to persist items within a Pipeline posting them to a url.
I am using this code within the Pipeline
class XPipeline(object):
def process_item(self, item, spider):
log.msg('in SpotifylistPipeline', level=log.DEBUG)
yield FormRequest(url="http://www.example.com/additem, formdata={'title': item['title'], 'link': item['link'], 'description': item['description']})
but it seems it's not making the http request.
Is it possible to make http request from pipelines? If not, do I have to do it in the Spider?
Do I need to specify a callback function? If so, which one?
If I can make the http call, can I check the response (JSON) and return the item if everything went ok, or discard the item if it didn't get saved?
As I final thing, is there a diagram that explains the flow that Scrapy follows from beginning to end? I am getting slightly lost which what gets called when. For instance, if Pipelines returned items to Spiders, what do Spiders do with those items? What's after a Pipeline call?
Many thanks in advance
Migsy

You can inherit your pipeline from scrapy.contrib.pipeline.media.MediaPipeline and yield Requests in 'get_media_requests'. Responses are passed into 'media_downloaded' callback.

Quote:
This method is called for every item pipeline component and must
either return a Item (or any descendant class) object or raise a
DropItem exception. Dropped items are no longer processed by further
pipeline components.
So, only spider can yield a request with a callback.
Pipelines are used for processing items.
You better describe what do you want to achieve.
is there a diagram that explains the flow that Scrapy follows from beginning to end
Architecture overview
For instance, if Pipelines returned items to Spiders
Pipelines do not return items to spiders. The items returned are passed to the next pipeline.

This could be done easily by using the requests library. If you don't want to use another library then look into urllib2.
import requests
class XPipeline(object):
def process_item(self, item, spider):
r = requests.post("http://www.example.com/additem", data={'title': item['title'], 'link': item['link'], 'description': item['description']})
if r.status_code == 200:
return item
else:
raise DropItem("Failed to post item with title %s." % item['title'])

Related

Insert Record to BigQuery or some RDB during API Call

I am writing a REST API GET endpoint that needs to both return a response and store records to either GCP Cloud SQL (MySQL), but I want the return to not be dependent on completion of the writing of the records. Basically, my code will look like:
def predict():
req = request.json.get("instances")
resp = make_response(req)
write_to_bq(req)
write_to_bq(resp)
return resp
Is there any easy way to do this with Cloud SQL Client Library or something?
Turns our flask has a functionality that does what I require:
#app.route("predict", method=["GET"]):
def predict():
# do some stuff with the request.json object
return jsonify(response)
#app.after_request
def after_request_func(response):
# do anything you want that relies on context of predict()
#response.call_on_close
def persist():
# this will happen after response is sent,
# so even if this function fails, the predict()
# will still get it's response out
write_to_db()
return response
One important thing is that a method tagged with after_request must take an argument and return something of type flask.Response. Also I think if method has call_on_close tag, you cannot access from context of main method, so you need to define anything you want to use from the main method inside the after_request tagged method but outside (above) the call_on_close method.

Scrapy spider_idle signal - need to add requests with parse item callback

In my Scrapy spider I have overridden the start_requests() method, in order to retrieve some additional urls from a database, that represent items potentially missed in the crawl (orphaned items). This should happen at the end of the crawling process. Something like (pseudo code):
def start_requests(self):
for url in self.start_urls:
yield Request(url, dont_filter=True)
# attempt to crawl orphaned items
db = MySQLdb.connect(host=self.settings['AWS_RDS_HOST'],
port=self.settings['AWS_RDS_PORT'],
user=self.settings['AWS_RDS_USER'],
passwd=self.settings['AWS_RDS_PASSWD'],
db=self.settings['AWS_RDS_DB'],
cursorclass=MySQLdb.cursors.DictCursor,
use_unicode=True,
charset="utf8",)
c=db.cursor()
c.execute("""SELECT p.url FROM products p LEFT JOIN product_data pd ON p.id = pd.product_id AND pd.scrape_date = CURDATE() WHERE p.website_id = %s AND pd.id IS NULL""", (self.website_id,))
while True:
url = c.fetchone()
if url is None:
break
# record orphaned product
self.crawler.stats.inc_value('orphaned_count')
yield Request(url['url'], callback=self.parse_item)
db.close()
Unfortunately, it appears as though the crawler queues up these orphaned items during the rest of the crawl - so, in effect, too many are regarded as orphaned (because the crawler has not yet retrieved these items in the normal crawl, when the database query is executed).
I need this orphaned process to happen at the end of the crawl - so I believe I need to use the spider_idle signal.
However, my understanding is I can't just simply yield requests in my spider idle method - instead I can use self.crawler.engine.crawl?
I need requests to be processed by my spider's parse_item() method (and for my configured middleware, extensions and pipelines to be obeyed). How can I achieve this?
the idle method that was connected to the idle signal (let's say the idle method is called idle_method) should receive the spider as argument, so you could do something like:
def idle_method(self, spider):
self.crawler.engine.crawl(Request(url=myurl, callback=spider.parse_item), spider)

How to make Scrapy execute callbacks before the start_requests method finishes?

I have a large file of relative urls that I want to scrape with Scrapy, and I've written some code to read this file line-by-line and build requests for my spider to parse. Below is some sample code.
spider:
def start_requests(self):
with open(self._file) as infile:
for line in infile:
inlist = line.replace("\n","").split(",")
item = MyItem(data = inlist[0])
request = scrapy.Request(
url = "http://foo.org/{0}".format(item["data"]),
callback = self.parse_some_page
)
request.meta["item"]
yield request
def parse_some_page(self,response):
...
request = scrapy.Request(
url = "http://foo.org/bar",
callback = self.parse_some_page2
)
yield request
This works fine, but with a large input file, I'm seeing that parse_some_page2 isn't invoked until start_requests finishes yielding all the initial requests. Is there some way I can make Scrapy start invoking the callbacks earlier? Ultimately, I don't want to wait for a million requests before I start seeing items flow through the pipeline.
I came up with 2 solutions. 1) Run spiders in separate processes if there are too many large sites. 2) Use deferreds and callbacks via Twisted (please don't run away, it won't be too scary). I'll discuss how to use the 2nd method because the first one can simply be googled.
Every function that executes yield request will "block" until a result is available. So your parse_some_page() function yields a scrapy response object and will not go on to the next URL until a response is returned. I did manage to find some sites (mostly foreign government sites) that take a while to fetch and hopefully it simulates a similar situation you're experiencing. Here is a quick and easy example:
# spider/stackoverflow_spider.py
from twisted.internet import defer
import scrapy
class StackOverflow(scrapy.Spider):
name = 'stackoverflow'
def start_requests(self):
urls = [
'http://www.gob.cl/en/',
'http://www.thaigov.go.th/en.html',
'https://www.yahoo.com',
'https://www.stackoverflow.com',
'https://swapi.co/',
]
for index, url in enumerate(urls):
# create callback chain after a response is returned
deferred = defer.Deferred()
deferred.addCallback(self.parse_some_page)
deferred.addCallback(self.write_to_disk, url=url, filenumber=index+1)
# add callbacks and errorbacks as needed
yield scrapy.Request(
url=url,
callback=deferred.callback) # this func will start the callback chain AFTER a response is returned
def parse_some_page(self, response):
print('[1] Parsing %s' % (response.url))
return response.body # this will be passed to the next callback
def write_to_disk(self, content, url, filenumber):
print('[2] Writing %s content to disk' % (url))
filename = '%d.html' % filenumber
with open(filename, 'wb') as f:
f.write(content)
# return what you want to pass to the next callback function
# or raise an error and start Errbacks chain
I've changed things slightly to be a bit easier to read and run. The first thing to take note of in start_requests() is that Deferred objects are created and callback functions are being chained (via addCallback()) within the urls loop. Now take a look at the callback parameter for scrapy.Request:
yield scrapy.Request(
url=url,
callback=deferred.callback)
What this snippet will do is start the callback chain immediately after scrapy.Response becomes available from the request. In Twisted, Deferreds start running callback chains only after Deferred.callback(result) is executed with a value.
After a response is provided, the parse_some_page() function will run with the Response as an argument. What you will do is extract what ever you need here and pass it to the next callback (ie. write_to_disk() my example). You can add more callbacks to the Deferred in the loop if necessary.
So the difference between this answer and what you did originally is that you used yield to wait for all the responses first, then execute callbacks. Where as my method uses Deferred.callback() as the callback for each request such that each response will be processed immediately.
Hopefully this helps (and/or works).
References
Twisted Deferred Reference
Explanation of parse(): Briefly summarizes how yeild/return affects parsing.
Non-Blocking Recipes (Klien): I blog post I wrote a while back on async callbacks in Klien/Twisted. Might be helpful to newbies.
PS
I have no clue if this will actually work for you since I couldn't find a site that is too large to parse. Also, I'm brand-spankin' new at Scrapy :D but I have years of Twisted under my belt.

Scrapy: Changing media pipeline download priorities: How to delay media files downloads at the very end of the crawl?

http://doc.scrapy.org/en/latest/topics/media-pipeline.html
When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).
I want to do the exact opposite: Scrape all HTML urls first, then, download all media files at once. How can I do that?
Not an answer but if you're curious to know how this behavior is implemented, check MediaPipeline pipeline source code, especially the process_item method:
def process_item(self, item, spider):
info = self.spiderinfo
requests = arg_to_iter(self.get_media_requests(item, info))
dlist = [self._process_request(r, info) for r in requests]
dfd = DeferredList(dlist, consumeErrors=1)
return dfd.addCallback(self.item_completed, item, info)
You see that a bunch of requests are queued, to be processed (request sent + response downloaded) BEFORE item_completed is eventually called, returning the original item + the downloaded media info.
In the nominal case, requests generated by the MediaPipeline subclasses will be sent for download immediately by using crawler.engine.download directly:
(...)
else:
request.meta['handle_httpstatus_all'] = True
dfd = self.crawler.engine.download(request, info.spider)
dfd.addCallbacks(
callback=self.media_downloaded, callbackArgs=(request, info),
errback=self.media_failed, errbackArgs=(request, info))
return dfd

Where is a Response transformed into one of its subclasses?

I'm trying to write a downloader middleware that ignores responses that don't have a pre-defined element. However, I can't use the css method of the HtmlResponse class inside the middleware because, at that point, the response's type is just Response. When it reaches the spider it's an HtmlResponse, but then it's too late because I can't perform certain actions to the middleware state.
Where is the response's final type set?
Without seeing your code of the middleware it is hard to tell what the matter is.
However my middleware below gets an HtmlResponse object:
class FilterMiddleware(object):
def process_response(self, request, response, spider):
print response.__class__
print type(response)
return response**strong text**
Both print statements verify this:
<class 'scrapy.http.response.html.HtmlResponse'>
<class 'scrapy.http.response.html.HtmlResponse'>
And I can use the css method on the response without any exception. The order of the middleware in the settings.py does not matter either: with 10, 100 or 500 I get the same result as above.
However if I configure the middleware to 590 or above I get plain old Response object. And this is because the conversion happens in the HttpCompressionMiddleware class on line 35 in the current version.
To solve your issue order your middleware somewhere later on the pipeline (with a lower order number) or convert the response yourself (I would not do this however).