Insert Record to BigQuery or some RDB during API Call - api

I am writing a REST API GET endpoint that needs to both return a response and store records to either GCP Cloud SQL (MySQL), but I want the return to not be dependent on completion of the writing of the records. Basically, my code will look like:
def predict():
req = request.json.get("instances")
resp = make_response(req)
write_to_bq(req)
write_to_bq(resp)
return resp
Is there any easy way to do this with Cloud SQL Client Library or something?

Turns our flask has a functionality that does what I require:
#app.route("predict", method=["GET"]):
def predict():
# do some stuff with the request.json object
return jsonify(response)
#app.after_request
def after_request_func(response):
# do anything you want that relies on context of predict()
#response.call_on_close
def persist():
# this will happen after response is sent,
# so even if this function fails, the predict()
# will still get it's response out
write_to_db()
return response
One important thing is that a method tagged with after_request must take an argument and return something of type flask.Response. Also I think if method has call_on_close tag, you cannot access from context of main method, so you need to define anything you want to use from the main method inside the after_request tagged method but outside (above) the call_on_close method.

Related

Which request to use to fetch data from database based on some data sent?

I am using django-rest-framework's genericAPIViews
I want to send some data from the front end to the backend and depending upon the data sent Django should query a model and return some data to the frontend. The data sent is protected data and thus can't be attached in the URL so, GET request can't be used. I am not manipulating the database, just querying it and returning a response (a typical GET use case).
Now in DRF's genericAPIViews, I can't find a view which does this:
As can be seen from Tom Christie's GitHub page only 2 views have a post handler:
CreateAPIView: return self.create()
ListCreateAPIView: return self.create()
As can be seen both these views have post methods which create entries in the database which I don't want. Is there a built-in class which does my job or should I use generics.GenericAPIView and write my own post handler?
Currently I am using generic.View which has post(self, request, *args, **kwargs)
I think you have a few options to choose from. One way is to use a ModelViewSet which could be quite useful because of how it nicely handles the communication between views, serializers and models. Here is a link to django-rest-framework ModelViewSet docs.
These are the actions that it provides by default (since it inherits from GenericAPIView):
.list(), .retrieve(), .create(), .update(), .partial_update(), .destroy().
If you don't want all of them you could specify which methods you want by doing the following:
class ModelViewSet(views.ModelViewSet):
queryset = App.objects.all()
serializer_class = AppSerializer
http_method_names = ['get', 'post', 'head']
Note: http_method_names seems to be working from Django >= 1.8
Source: Disable a method in a ViewSet, django-rest-framework

How to make Scrapy execute callbacks before the start_requests method finishes?

I have a large file of relative urls that I want to scrape with Scrapy, and I've written some code to read this file line-by-line and build requests for my spider to parse. Below is some sample code.
spider:
def start_requests(self):
with open(self._file) as infile:
for line in infile:
inlist = line.replace("\n","").split(",")
item = MyItem(data = inlist[0])
request = scrapy.Request(
url = "http://foo.org/{0}".format(item["data"]),
callback = self.parse_some_page
)
request.meta["item"]
yield request
def parse_some_page(self,response):
...
request = scrapy.Request(
url = "http://foo.org/bar",
callback = self.parse_some_page2
)
yield request
This works fine, but with a large input file, I'm seeing that parse_some_page2 isn't invoked until start_requests finishes yielding all the initial requests. Is there some way I can make Scrapy start invoking the callbacks earlier? Ultimately, I don't want to wait for a million requests before I start seeing items flow through the pipeline.
I came up with 2 solutions. 1) Run spiders in separate processes if there are too many large sites. 2) Use deferreds and callbacks via Twisted (please don't run away, it won't be too scary). I'll discuss how to use the 2nd method because the first one can simply be googled.
Every function that executes yield request will "block" until a result is available. So your parse_some_page() function yields a scrapy response object and will not go on to the next URL until a response is returned. I did manage to find some sites (mostly foreign government sites) that take a while to fetch and hopefully it simulates a similar situation you're experiencing. Here is a quick and easy example:
# spider/stackoverflow_spider.py
from twisted.internet import defer
import scrapy
class StackOverflow(scrapy.Spider):
name = 'stackoverflow'
def start_requests(self):
urls = [
'http://www.gob.cl/en/',
'http://www.thaigov.go.th/en.html',
'https://www.yahoo.com',
'https://www.stackoverflow.com',
'https://swapi.co/',
]
for index, url in enumerate(urls):
# create callback chain after a response is returned
deferred = defer.Deferred()
deferred.addCallback(self.parse_some_page)
deferred.addCallback(self.write_to_disk, url=url, filenumber=index+1)
# add callbacks and errorbacks as needed
yield scrapy.Request(
url=url,
callback=deferred.callback) # this func will start the callback chain AFTER a response is returned
def parse_some_page(self, response):
print('[1] Parsing %s' % (response.url))
return response.body # this will be passed to the next callback
def write_to_disk(self, content, url, filenumber):
print('[2] Writing %s content to disk' % (url))
filename = '%d.html' % filenumber
with open(filename, 'wb') as f:
f.write(content)
# return what you want to pass to the next callback function
# or raise an error and start Errbacks chain
I've changed things slightly to be a bit easier to read and run. The first thing to take note of in start_requests() is that Deferred objects are created and callback functions are being chained (via addCallback()) within the urls loop. Now take a look at the callback parameter for scrapy.Request:
yield scrapy.Request(
url=url,
callback=deferred.callback)
What this snippet will do is start the callback chain immediately after scrapy.Response becomes available from the request. In Twisted, Deferreds start running callback chains only after Deferred.callback(result) is executed with a value.
After a response is provided, the parse_some_page() function will run with the Response as an argument. What you will do is extract what ever you need here and pass it to the next callback (ie. write_to_disk() my example). You can add more callbacks to the Deferred in the loop if necessary.
So the difference between this answer and what you did originally is that you used yield to wait for all the responses first, then execute callbacks. Where as my method uses Deferred.callback() as the callback for each request such that each response will be processed immediately.
Hopefully this helps (and/or works).
References
Twisted Deferred Reference
Explanation of parse(): Briefly summarizes how yeild/return affects parsing.
Non-Blocking Recipes (Klien): I blog post I wrote a while back on async callbacks in Klien/Twisted. Might be helpful to newbies.
PS
I have no clue if this will actually work for you since I couldn't find a site that is too large to parse. Also, I'm brand-spankin' new at Scrapy :D but I have years of Twisted under my belt.

Django-Rest-Framework: How to Document GET-less Endpoint?

My co-worker implemented an API that only allows GET requests with an ID parameter (so I can GET /foo/5 but can't GET /foo/). If I try to access the API's endpoint without providing an ID parameter, it (correctly) throws an unimplemented exception.
I want to fix this endpoint to show its documentation when viewed, without an ID, over the web. However, I still want it to throw an exception when that endpoint is accessed programatically.
As I remember it, django-rest-framework is capable of distinguishing those two cases (via request headers), but I'm not sure how to define the endpoint such that it returns either documentation HTML or an exception as appropriate.
Can anyone help provide the pattern for this?
Based on the description, I would guess that the endpoint is a function based view, which is registered on a route where it listens for get requests WITH parameters. I would suggest to register another route where you will listen for get requests without parameters...
from rest_framework.decorators import api_view
from rest_framework import status
#api_view(['GET'])
def existing_get_item_api(request, item_id, *args, **kwargs):
# query and return the item here ...
pass
#api_view(['GET'])
def get_help(request, *args, **kwargs):
# compose the help
return Response(data=help, status = status.HTTP_200_OK)
# somewhere in urls.py
urlpatterns = [
url(r'api/items/(?P<item_id>[0-9]+)/', existing_get_item_api),
url(r'api/items/', get_help),
]
Let me know how is this working out for you.
We can user modelviewsets and routers for this implementation
viewsets.py
class AccountViewSet(viewsets.ModelViewSet):
"""
A simple ViewSet for viewing and editing accounts.
"""
http_method_names = ['GET']
queryset = Account.objects.all()
serializer_class = AccountSerializer
routers.py
from rest_framework import routers
router = routers.SimpleRouter()
router.register(r'accounts', AccountViewSet)

Where is a Response transformed into one of its subclasses?

I'm trying to write a downloader middleware that ignores responses that don't have a pre-defined element. However, I can't use the css method of the HtmlResponse class inside the middleware because, at that point, the response's type is just Response. When it reaches the spider it's an HtmlResponse, but then it's too late because I can't perform certain actions to the middleware state.
Where is the response's final type set?
Without seeing your code of the middleware it is hard to tell what the matter is.
However my middleware below gets an HtmlResponse object:
class FilterMiddleware(object):
def process_response(self, request, response, spider):
print response.__class__
print type(response)
return response**strong text**
Both print statements verify this:
<class 'scrapy.http.response.html.HtmlResponse'>
<class 'scrapy.http.response.html.HtmlResponse'>
And I can use the css method on the response without any exception. The order of the middleware in the settings.py does not matter either: with 10, 100 or 500 I get the same result as above.
However if I configure the middleware to 590 or above I get plain old Response object. And this is because the conversion happens in the HttpCompressionMiddleware class on line 35 in the current version.
To solve your issue order your middleware somewhere later on the pipeline (with a lower order number) or convert the response yourself (I would not do this however).

Persist items using a POST request within a Pipeline

I want to persist items within a Pipeline posting them to a url.
I am using this code within the Pipeline
class XPipeline(object):
def process_item(self, item, spider):
log.msg('in SpotifylistPipeline', level=log.DEBUG)
yield FormRequest(url="http://www.example.com/additem, formdata={'title': item['title'], 'link': item['link'], 'description': item['description']})
but it seems it's not making the http request.
Is it possible to make http request from pipelines? If not, do I have to do it in the Spider?
Do I need to specify a callback function? If so, which one?
If I can make the http call, can I check the response (JSON) and return the item if everything went ok, or discard the item if it didn't get saved?
As I final thing, is there a diagram that explains the flow that Scrapy follows from beginning to end? I am getting slightly lost which what gets called when. For instance, if Pipelines returned items to Spiders, what do Spiders do with those items? What's after a Pipeline call?
Many thanks in advance
Migsy
You can inherit your pipeline from scrapy.contrib.pipeline.media.MediaPipeline and yield Requests in 'get_media_requests'. Responses are passed into 'media_downloaded' callback.
Quote:
This method is called for every item pipeline component and must
either return a Item (or any descendant class) object or raise a
DropItem exception. Dropped items are no longer processed by further
pipeline components.
So, only spider can yield a request with a callback.
Pipelines are used for processing items.
You better describe what do you want to achieve.
is there a diagram that explains the flow that Scrapy follows from beginning to end
Architecture overview
For instance, if Pipelines returned items to Spiders
Pipelines do not return items to spiders. The items returned are passed to the next pipeline.
This could be done easily by using the requests library. If you don't want to use another library then look into urllib2.
import requests
class XPipeline(object):
def process_item(self, item, spider):
r = requests.post("http://www.example.com/additem", data={'title': item['title'], 'link': item['link'], 'description': item['description']})
if r.status_code == 200:
return item
else:
raise DropItem("Failed to post item with title %s." % item['title'])