Caching Github API calls - api

I have a general question related to caching of API calls, in this instance calls to the Github API.
Let's say I have a page in my app that shows the filenames of a repo, and the content of the README. This means that I will have to do a few API calls in order to retrieve that.
Now, let's say I want to add something like memcached in between, so I'm not doing these calls over and over, if I don't need to.
How would you normally go about this? If I don't enable a webhook on Github, I have no way of knowing whether the cache should expire. I could always make a single call to get the current sha of HEAD, and if it hadn't changed, use cache instead. But that's on a repo-level, and not on a file level.
I can imagine I could do something like that with the object-sha's, but if I need to call the API anyway to get those, it defeats the purpose of caching.
How would you go about it? I know a service like prose.io has no caching right now, but if it should, what would the approach be?
Thanks

Would just using HTTP caching be good enough for your use case? The purpose of HTTP caching is not just to provide a way of not making requests if you already have a fresh response, rather - it also enables you to quickly validate if the response you already have in cache is valid (without the server sending the complete response again if it is fresh).
Looking at GitHub API responses, I can see that GitHub is correctly setting the relevant HTTP headers (ETag, Last-modified, Cache-control).
So, you just do a GET, e.g. for:
GET https://api.github.com/users/izuzak/repos
and this returns:
200 OK
...
ETag:"df739f00c5053d12ef3c625ad6b0fd08"
Last-Modified:Thu, 14 Feb 2013 22:31:14 GMT
...
Next time - you do a GET for the same resource, but also supply the relevant HTTP caching headers so that it is actually a conditional GET:
GET https://api.github.com/users/izuzak/repos
...
If-Modified-Since:Thu, 14 Feb 2013 22:31:14 GMT
If-None-Match:"df739f00c5053d12ef3c625ad6b0fd08"
...
And lo and behold - the server returns a 304 Not modified response and your HTTP client will pull the response from its cache:
304 Not Modified
So, GitHub API does HTTP caching right and you should use it. Granted, you have to use an HTTP client that supports HTTP caching also. The best thing is that if you get a 304 Not modified response - GitHub does not decrease your remaining API calls quota. See: https://docs.github.com/en/rest/overview/resources-in-the-rest-api#conditional-requests
GitHub API also sets the Cache-Control: private, max-age=60 header, so you have 60 seconds of freshness -- which means that requests for the same resource made less than 60 seconds apart will not even be made to the server.
Your reasoning about using a single conditional GET request to a resource that surely changes if anything in the repo changed (a resource showing the sha of HEAD, for example) sounds reasonable -- since if that resource hasn't changed, then you don't have to check the individual files since they haven't surely changed.

Related

ResponseCache attribute causing non 200 responses to be cached

We are using the [ResponseCacheAttribute] from Microsoft.AspNetCore.Mvc.Core with a policy like so on action methods or controllers:
[ResponseCache(CacheProfileName = "Default")]
In case a non 200 response is send like 400, 403 or 500 it is also being cached. So the first time we go to the server and get for example bad request. The second time no call is made to the server and the answer is still bad request (from disk cache).
I read in the documentation that when using response cache middleware only 200 responses are being cached. This attribute seems to be flawed and always adds the caching response header no matter what status code.
We like to define caching only on certain controllers or action methods. Not on all requests.
Does anyone know the solution for this?
I simulate the problem by using a status code result:
return StatusCode(500);
I would then expect it to always come back to this code with a breakpoint and never caching it.

RESTDataSource - How to know if response comes from get request or cache

I need to get some data from a REST API in my GraphQL API. For that I'm extending RESTDataSource from apollo-datasource-rest.
From what I understood, RESTDataSource caches automatically requests but I'd like to verify if it is indeed cached. Is there a way to know if my request is getting its data from the cache or if it's hitting the REST API?
I noticed that the first request takes some time, but the following ones are way faster and also, the didReceiveResponse method is not called everytime I make a query. Is it because the data is loaded from the cache?
I'm using apollo-server-express.
Thanks for your help!
You can time the requests like following:
console.time('restdatasource get req')
this.get(url)
console.timeEnd('restdatasource get req')
Now, if the time is under 100-150 milliseconds, that should be a request coming from the cache.
You can monitor the console, under the network tab. You will be able to see what endpoints the application is calling. If it uses cached data, there will be no new request to your endpoint logged
If you are trying to verify this locally, one good option is to setup a local proxy so that you can see all the network calls being made. (no network call meaning the call was read from cache) Then you can simply configure your app using this apollo documentation to forward all outgoing calls through a proxy like mitmproxy.

REST/HTTP: best status code to prevent cached upload?

I'm designing an API where a client PUTs a file to the server, but the server may already have a copy of this file and not need it re-uploaded.
I'm already planning on using Expect: 100-continue so that the server can inform the client before the client performs the entire, inefficient upload.
My question is, what's the best status code to return instead of 100 Continue in the case that the server doesn't need the upload?
Typically, the client could send an If-None-Match header, and the server could respond with a 412 Precondition Failed if there was already a match.
But, in my case, the de-duplication is almost an implementation detail, and I don't want the client to be concerned with knowing the de-dup'ing strategy (e.g. what the value to match is).
Would a 302 Found, a 303 See Other, or a 304 Not Modified make sense?
It doesn't seem like a 4xx is appropriate since it's not a client error, nor 5xx since I don't want to trigger any automatic retry logic in the client.
Thanks!
From the client's point of view, the PUT succeeded. So I believe a 2xx status code would be right; such as 200 with a message body giving a status message.
At least using cURL as a client, it turns out that 304 works great.

How to poll for updates with JSONP?

I have a Web server that updates its data once per minute, and want to make that data available to clients of all types. In order to reduce bandwidth, I set up the PHP script to support conditional GETs, using IF-MODIFIED-SINCE and/or IF-NONE-MATCH. The idea is that clients can poll every 30 seconds and thereby be sure that they won't miss anything, but also won't get duplicate data.
That all works great for most types of clients, and I've verified that it works with clients that support the standard HTTP conditional GET semantics.
But it doesn't work with JavaScript because JSONP inserts a <script> tag into the DOM and lets the browser handle things--and there's no support (at least, none that I know of) for conditional GETs in <script> tags.
So I modified my PHP script to support passing an etag value. The returned data contains an etag value that's unique for that minute. When the JavaScript client receives data from the server, it saves the etag value so it can use that value in subsequent requests. The request takes the form:
http://api.mydomain.com/script.php?fmt=json&callback=jscallback&etag=ab79bc65e
If the etag of the data doesn't match the passed etag, then I send the new data.
This all works well and was surprisingly easy to code up using jQuery. My dilemma, though is what to do if the etag matches. I see two choices:
Return an HTTP 304 (Not Modified)
Return an HTTP 200 (OK), but with the returned data containing just the header information (modified date, etag, etc.) and no actual data items.
If I do the first, then the JavaScript client code is greatly simplified. The browser seems to work just fine if it gets a 304 response to an injected <script> tag. But ... something bothers me about this solution. I don't know what it is, but it seems like I'm depending on behavior that could be browser-specific. Some browser might decide to report an error if it gets a 304.
Doing the second would require a little bit more work on the server, slightly more bandwidth, and would require the clients to check the data to see if the data was updated. It's more work for everybody, but it seems cleaner.
So, to my question. If you were writing a JavaScript client to get this data, which would you prefer? A silent failure that never calls your "success" callback? Or a "success" return that has no data (beyond status) in it? A third option?
Absent any discussion from others, I went with my gut here and implemented the second option. The web server returns an HTTP 200, and the data contains a "Not Modified" status code along with header information, but no records. That makes the JavaScript just slightly more complicated, but prevents me from depending on undocumented behavior.

Does the `Expires` HTTP header needs to be consistent across multiple cold-cache requests?

I'm implementing a custom web server of a kind. And am looking into adding an Expires header support. However, I'm a little unsure of how exactly to implement it.
If multiple cold-cache requests are being made to the same unchanged resource on the server and the server returned different Expires header (say it uses relative time to calculate the exact value of the Expires date e.g. +6 hours from the request time), does that invalidate the cache on all the proxy servers in-between as well? Or is it impossible to happen (per the spec)?
Does the Expires HTTP header needs to be consistent across multiple cold-cache requests?
Ok, never mind, found the relevant information under the Cache Revalidation and Reload Controls section of the HTTP Spec
Basically, you can serve all the different validators you want but you must be aware that in such case proxies may have a set of different validators from their own cache and from various user agents communicating with the proxy. They may choose to send one to you and that might not be the correct or the most optimal one for the end-users. However, a "best approach" has been suggested in the spec.
I suppose this should covers Expires headers as well as ETags, Cache-Control and whatnot.
Here's the relevant excerpt, in case anyone's interested:
When an intermediate cache is forced,
by means of a max-age=0 directive, to
revalidate its own cache entry, and
the client has supplied its own
validator in the request, the supplied
validator might differ from the
validator currently stored with the
cache entry. In this case, the cache
MAY use either validator in making its
own request without affecting semantic
transparency. However, the choice of
validator might affect performance.
The best approach is for the
intermediate cache to use its own
validator when making its request. If
the server replies with 304 (Not
Modified), then the cache can return
its now validated copy to the client
with a 200 (OK) response. If the
server replies with a new entity and
cache validator, however, the
intermediate cache can compare the
returned validator with the one
provided in the client's request,
using the strong comparison function.
If the client's validator is equal to
the origin server's, then the
intermediate cache simply returns 304
(Not Modified). Otherwise, it returns
the new entity with a 200 (OK)
response. If a request includes the
no-cache directive, it SHOULD NOT
include min-fresh, max-stale, or
max-age.