Best way to cache RESTful API results of GET calls - api

I'm thinking about the best way to create a cache layer in front or as first layer for GET requests to my RESTful API (written in Ruby).
Not every request can be cached, because even for some GET requests the API has to validate the requesting user / application. That means I need to configure which request is cacheable and how long each cached answer is valid. For a few cases I need a very short expiration time of e.g. 15s and below. And I should be able to let cache entries expire by the API application even if the expiration date is not reached yet.
I already thought about many possible solutions, my two best ideas:
first layer of the API (even before the routing), cache logic by myself (to have all configuration options in my hand), answers and expiration date stored to Memcached
a webserver proxy (high configurable), perhaps something like Squid but I never used a proxy for a case like this before and I'm absolutely not sure about it
I also thought about a cache solution like Varnish, I used Varnish for "usual" web applications and it's impressive but the configuration is kind of special. But I would use it if it's the fastest solution.
An other thought was to cache to the Solr Index, which I'm already using in the data layer to not query the database for most requests.
If someone has a hint or good sources to read about this topic, let me know.

Firstly, build your RESTful API to be RESTful. That means authenticated users can also get cached content as to keep all state in the URL it needs to contain the auth details. Of course the hit rate will be lower here, but it is cacheable.
With a good deal of logged in users it will be very beneficial to have some sort of model cache behind a full page cache as many models are still shared even if some aren't (in a good OOP structure).
Then for a full page cache you are best of to keep all the requests off the web server and especially away from the dynamic processing in the next step (in your case Ruby). The fastest way to cache full pages from a normal web server is always a caching proxy in front of the web servers.
Varnish is in my opinion as good and easy as it gets, but some prefer Squid indeed.

memcached is a great option, and I see you mentioned this already as a possible option. Also Redis seems to be praised a lot as another option at this level.
On an application level, in terms of a more granular approach to cache on a file by file and/or module basis, local storage is always an option for common objects a user may request over and over again, even as simple as just dropping response objects into session so that can be reused vs making another http rest call and coding appropriately.
Now people go back and forth debating about varnish vs squid, and both seem to have their pros and cons, so I can't comment on which one is better but many people say Varnish with a tuned apache server is great for dynamic websites.

Since REST is an HTTP thing, it could be that the best way of caching requests is to use HTTP caching.
Look into using ETags on your responses, checking the ETag in requests to reply with '304 Not Modified' and having Rack::Cache to serve cached data if the ETags are the same. This works great for cache-control 'public' content.
Rack::Cache is best configured to use memcache for its storage needs.
I wrote a blog post last week about the interesting way that Rack::Cache uses ETags to detect and return cached content to new clients: http://blog.craz8.com/articles/2012/12/19/rack-cache-and-etags-for-even-faster-rails
Even if you're not using Rails, the Rack middleware tools are quite good for this stuff.

Redis Cache is best option.
check here.
It is open source. Advanced key-value cache and store.

I’ve used redis successfully this way in my REST view:
from django.conf import settings
import hashlib
import json
from redis import StrictRedis
from django.utils.encoding import force_bytes
def get_redis():
#get redis connection from RQ config in settings
rc = settings.RQ_QUEUES['default']
cache = StrictRedis(host=rc['HOST'], port=rc['PORT'], db=rc['DB'])
return cache
class EventList(ListAPIView):
queryset = Event.objects.all()
serializer_class = EventSerializer
renderer_classes = (JSONRenderer, )
def get(self, request, format=None):
if IsAdminUser not in self.permission_classes: # dont cache requests from admins
# make a key that represents the request results you want to cache
# your requirements may vary
key = get_key_from_request()
# I find it useful to hash the key, when query parms are added
# I also preface event cache key with a string, so I can clear the cache
# when events are changed
key = "todaysevents" + hashlib.md5(force_bytes(key)).hexdigest()
# I dont want any cache issues (such as not being able to connect to redis)
# to affect my end users, so I protect this section
try:
cache = get_redis()
data = cache.get(key)
if not data:
# not cached, so perform standard REST functions for this view
queryset = self.filter_queryset(self.get_queryset())
serializer = self.get_serializer(queryset, many=True)
data = serializer.data
# cache the data as a string
cache.set(key, json.dumps(data))
# manage the expiration of the cache
expire = 60 * 60 * 2
cache.expire(key, expire)
else:
# this is the place where you save all the time
# just return the cached data
data = json.loads(data)
return Response(data)
except Exception as e:
logger.exception("Error accessing event cache\n %s" % (e))
# for Admins or exceptions, BAU
return super(EventList, self).get(request, format)
in my Event model updates, I clear any event caches.
This hardly ever is performed (only Admins create events, and not that often),
so I always clear all event caches
class Event(models.Model):
...
def clear_cache(self):
try:
cache = get_redis()
eventkey = "todaysevents"
for key in cache.scan_iter("%s*" % eventkey):
cache.delete(key)
except Exception as e:
pass
def save(self, *args, **kwargs):
self.clear_cache()
return super(Event, self).save(*args, **kwargs)

Related

Can caching an API based on the hash of the whole URL be a potential threat?

I am adding caching to an API server. The implementation of the caching system of the framework I am using simply hashes the whole URL of the request and uses it as a cache key (as well as some other data like language, etc..).
But with this simple system I can add fake query parameters to the url with arbitrary values and my system will also cache those requests. For example:
GET https://example.com/apicall # Cached
GET https://example.com/apicall?fake=1 # Also cached
GET https://example.com/apicall?fake=2 # Also cached!
Is this something I should worry about? That people can very easily fill my cache with junk entries that aren't used? Or am I exaggerating the potential impact of this?

API design for machine learning models

Using Django rest framework to build an API webservice that contains many of already trained machine learning models. Some models can predict a batch_size of 1 or an image at a time. Others need a history of data (timelines) to be able to predict/forecasts. Usually these timelines can hardly fit and passed as parameter. Being that, we want to give the requester the ability to request by either:
sending the data (small batches) to predict as parameter.
passing a database id/reference as parameter then the API will query the database and do the predictions.
So the question is, what would be the best API design for identifying which approach the requester chose?. Some considered approaches:
Add /db to the path of the endpoint ex: POST models/<X>/db. The problem with this approach is that (2x) endpoints are generated for each model.
Add parameter db as boolean to each request. The problem with such approach is that it adds additional overhead for each request just to check which approach. Also, make the code less readable.
Global variable set for each requester when signed for the API token. The problem is that you restricted the requester for 1 mode which is not convenient.
What would be the best approach for this case
The fact that you currently have more than one source would cause me to seriously consider attempting to abstract the "source" component as much as possible, to allow all manner of sources. For example, suppose that future users would like to pull data out of a mongodb, instead of a whatever db you currently are using? Or from some other storage structure? Or pull from a third party? Or, or, or....
In any case the question is now "how much do they all have in common, and what should they all implement?"
class Source(object):
def __get_batch__(self, batch_size=1):
raise NotImplementedError() #each source needs to implement this on its own
#http_library.POST_endpoint("/db")
class DBSource(Source):
def __init__(self, post_data):
if post_data["table"] in ["data1", "data2"]:
self.table = table
else:
raise Exception("Must use predefined table to prevent SQL injection")
def __get_batch__(self, batch_size=1):
return sql_library.query("SELECT * FROM {} LIMIT ?".format(self.table), batch_size)
#http_library.POST_endpoint("/local")
class LocalSource(Source):
def __init__(self, post_data):
self.data = post_data["data"]
def __get_batch__(self, batch_size=1):
data = self.data[self.i, self.i+batch_size]
i += batch_size
return data
This is just an example. However, if a fixed part of your path designates "the source", then you have left yourself open to scale this indefinitely.
Add /db to the path of the endpoint ex: POST models//db. The problem with this approach is that (2x) endpoints are generated for
each model.
Inevitable. DRY out common code to sub-methods.
Add parameter db as boolean to each request. The problem with such approach is that it adds additional overhead for each request just to
check which approach. Also, make the code less readable.
There won't be any additional overhead (that's what your underlying framework does to match a URL to a function/method anyway). However, these are 2 separate functionalities, I would keep them separate, so I would prefer the first approach.
Global variable set for each requester when signed for the API token. The problem is that you restricted the requester for 1 mode
which is not convenient.
Yikes! unless you provide a UI letting a user to select his preference and apply it globally (I don't think any UX will agree to that)
That being said, the api design should be driven by questioning who is mastering (or owning) the data. If it's the application and user already knows the ID of that entity, then you shouldn't be asking the data from the user.
If it's the user, and then if it won't fit in a POST body, then I would say, a real-time API may not be the right solution, think about message queues/pub-sub based systems.
If you need a hybrid solution as you asked in the question, then, I would prefer the 1st approach.

Optimize API call in Symfony

How optimize an API call in symfony?
I call with Guzzle bundle, but the time in some situations is very long.
In client application call a function from the server.
In server application extract the objects from the database and send back to the client.
In client creat the new object with properties from server respons.
One of the ways to improve your API calls is to use caching. In Symfony there are many different ways to achieve this. I can show you one of them (PhpFileCache example):
In services.yml create cache service:
your_app.cache_provider:
class: Doctrine\Common\Cache\PhpFileCache
arguments: ["%kernel.cache_dir%/path_to/your_cache_dir", ".your.cached_file_name.php"]
(Remember, you need Doctrine extension in your app to work)
Then pass your caching service your_app.cache_provider to any service where you need caching:
Again in your services.yml:
some_service_of_yours:
class: AppBundle\Services\YourService
arguments: ['#your_app.cache_provider']
Finally, in your service (where you want to perform API caching):
use Doctrine\Common\Cache\CacheProvider;
class YourService
{
private $cache;
public function __construct(CacheProvider $cache)
{
$this->cache = $cache;
}
public function makeApiRequest()
{
$key = 'some_unique_identifier_of_your_cache_record';
if(!$data = $this->cache->fetch($key))
{
$data = $provider->makeActualApiCallHere('http://some_url');
$this->cache->save($key, serialize($data), 10800); //10800 here is amount of seconds to store your data in cache before its invalidated, change it to your needs
}
return $data; // now you can use the data
}
}
This is quite GENERIC example, you should change it to your exact needs, but idea is simple. You can cache data and avoid unnecessary API calls to speed things up. Be careful though, because cache has drawback of presenting stale(obsolete) data. Some things can (and should) be cached, but some things don't.
If you control the server
You should put a cache reverse proxy like Varnish on top of your PHP server. The PHP app must send HTTP cache headers to tell to the proxy how many time it must cache the request. Alternatively, you can use a library like FOSHttpCache to setup a cache invalidation strategy (the PHP server will purge the cache from the proxy when an update of the data occurs - it's a more advanced and complex scenario).
The PHP server will not even be called if the requested resource is in the reverse proxy cache.
You should also use a profiler like Blackfire.io or xhprof to find why some parts of your PHP code (or your SQL queries) take so many time to be executed, then optimize.
If you control the client
You can use this HTTP cache middleware for Guzzle to cache every API result according to HTTP headers sent by the API.

Define custom load balancing algorithm

Here is the situation:
I have a number of web servers, say 10. I need to use a (software) load balancer which can be implemented using a reverse proxy server, like HAProxy or Varnish. Now, All the traffic that we serve is over https and not http, so Varnish is out of the question.
Now, I want to divide the users' request into a few categories, which depend on one of the input (POST) parameters of the request. Depending on that parameter, I need to divide the request among the servers, as based on that, (even if all other input (POST) parameters are same) different servers would serve differently.
So, I need to define a custom load balancing algorithm, such that, for a particular value of that parameter, I divide the load to specific 3 (say), for some other value, divide the request to specific 2 and for other value(s), to remaining 5.
Since I cannot use varnish, as it cannot be use to terminate ssl (defining custom algorithm would have been easy in VCL), I am thinking of using HA-Proxy.
So, here is the question:
Can anyone help me with how to define a custom load balancing function using HA-Proxy?
I've researched a lot and I could not find any such document with which we can. So, if it is not possible with HA-Proxy, can you refer me to some other reverse-proxy service, that can be used as a load balancer too, such that it meets both the above criteria? (ssl termination and ability to define a custom load balancing).
EDIT:
This question is in succession with one of my previous questions. Varnish to be used for https
I'm not sure what your goal is, but I'd suggest NOT doing custom routing based on the HTTP request body at all. This will perform very poorly, and likely outweigh any benefit you are trying to achieve.
Anything that has to parse values beyond typical HTTP headers at your load balancer will slow things down. Cookies by themselves are generally a bad idea if you can avoid it.
If you can control the path/route values that is likely a much better idea than to parse every POST for certain values.
You can probably achieve what you want via NGINX with lua scripts (the Kong platform is based on them), but I can't say how hard that would be for you...
https://github.com/openresty/lua-nginx-module#readme
Here's an article with a specific example of setting different upstreams based on lua input.
http://sosedoff.com/2012/06/11/dynamic-nginx-upstreams-with-lua-and-redis.html
server {
...BASE CONFIG HERE...
port_in_redirect off;
location /somepath {
lua_need_request_body on;
set $upstream "default.server.hostname";
rewrite_by_lua '
ngx.req.read_body() -- explicitly read the req body
local data = ngx.req.get_body_data()
if data then
-- use data: see
-- https://github.com/openresty/lua-nginx-module#ngxreqget_body_data
ngx.var.upstream = some_deterministic_value
end
'
...OTHER PARAMS...
proxy_pass http://$upstream
}
}

Server with the sole purpose of setting cookies

At work we ran up against the problem of setting server-side cookies - a lot of them. Right now we have a PHP script, the sole purpose of which is to set a cookie on the client for our domain. This happens a lot more than 'normal' requests to the server (which is running an app), so we've discussed moving it to its own server. This would be an Apache server, probably dedicated, with one PHP script 3 lines long, just running over and over again.
Surely there must be a faster, better way of doing this, rather than starting up the whole PHP environment. Basically, I need something super simple that can sit around all day/night doing the following:
Check if a certain cookie is set, and
If that cookie is not set, fill it with a random hash (right now it's a simple md5(microtime))
Any suggestions?
You could create a simple http server yourself to accept requests and return the set-cookie header and empty body. This would allow you to move the cookie generation overhead to wherever you see fit.
I echo the sentiments above though; Unless cookie generation is significantly expensive, I don't think you will gain much by moving from your current setup.
By way of an example, here is an extremely simple server written with Tornado that simply sets a cookie on GET or HEAD requests to '/'. It includes an async example listening for '/async' which may be of use depending on what you are doing to get your cookie value.
import time
import tornado.ioloop
import tornado.web
class CookieHandler(tornado.web.RequestHandler):
def get(self):
cookie_value = str( time.time() )
self.set_cookie('a_nice_cookie', cookie_value, expires_days=10)
# self.set_secure_cookie('a_double_choc_cookie', cookie_value)
self.finish()
def head(self):
return self.get()
class AsyncCookieHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def get(self):
self._calculate_cookie_value(self._on_create_cookie)
#tornado.web.asynchronous
def head(self):
self._calculate_cookie_value(self._on_create_cookie)
def _on_create_cookie(self, cookie_value):
self.set_cookie('double_choc_cookie', cookie_value, expires_days=10)
self.finish()
def _calculate_cookie_value(self, callback):
## meaningless async example... just wastes 2 seconds
def _fake_expensive_op():
val = str(time.time())
callback(val)
tornado.ioloop.IOLoop.instance().add_timeout(time.time()+2, _fake_expensive_op)
application = tornado.web.Application([
(r"/", CookieHandler),
(r"/async", AsyncCookieHandler),
])
if __name__ == "__main__":
application.listen(8888)
tornado.ioloop.IOLoop.instance().start()
Launch this process with Supervisord and you'll have a simple, fast, low-overhead server that sets cookies.
You could try using mod_headers (usually available in the default install) to manually construct a Set-Cookie header and emit it -- no programming needed as long as it's the same cookie every time. Something like this could work in an .htaccess file:
Header add Set-Cookie "foo=bar; Path=/; Domain=.foo.com; Expires=Sun, 06 May 2012 00:00:00 GMT"
However, this won't work for you. There's no code here. It's just a stupid header. It can't come up with the new random value you'd want, and it can't adjust the expire date as is standard practice.
This would be an Apache server, probably dedicated, with one PHP script 3 lines long, just running over and over again. [...] Surely there must be a faster, better way of doing this, rather than starting up the whole PHP environment.
Are you using APC or another bytecode cache? If so, there's almost no startup cost. Because you're talking about setting up an entire server just for this, it sounds like you control the server as well. This means that you can turn off apc.stat for even less of a startup hit.
Really though, if all that script is doing is building an md5 hash and setting a cookie, it should already be blisteringly fast, especially if it's mod_php. Do you already know, though benchmarking and testing, that the script isn't performing as well as you'd like? If so, can you share those benchmarks with us?
It would be interesting to know why do you think you need extra server - do you actually have a bottle neck for generating the cookie or somewhere else ? Is it the log writing as requests happen alot ? ajax polling ? Client download speed ?
Atleast for starters, i'd look something more efficient than fetching time to generate the "random hash". For example, on this intel i7 laptop i have, generating 999999 md5 hashes from microtime takes roughly about 4 seconds and doing same thing with random numbers is second faster (not taking a seeding of rand into account).
Then, if you take opening/and closing of socket into account, just moving your script (which is most likely already really fast - that is, without knowing how your pages take that into account), you will end up actually slowing down the requests. Actually, now that i've re-read your question, it makes me think that your cookie setter script is already a dedicated page ? Or do you just "include" into real content served by another php script? If not, try that approach. Also this would beneficial if you have default logging rules for apache, if cookies are set in on own page, your apache will log a row for that and in high load systems, this will cumulate to total io time spend by apache.
Also, consider that testing if cookie is set and then setting it, might be slower than just to forcefully set it always even if cookie exists or not ?
But overall, i don't think you'd need to set up a server just to offload cookie generation without knowing more about how you handle the cookies now.. Unless you are doing something really nasty.
Apache has a module called mod_usertrack which looks like it might do exactly what you want. There's no need for PHP and you could likely create a really optimised lightweight Apache config to serve this with.
If you want to go for something even faster and are happy to not use Apache you could use lighttpd and it's mod_usertrack or nginx's HttpUserId module