Does the `Expires` HTTP header needs to be consistent across multiple cold-cache requests? - http-headers

I'm implementing a custom web server of a kind. And am looking into adding an Expires header support. However, I'm a little unsure of how exactly to implement it.
If multiple cold-cache requests are being made to the same unchanged resource on the server and the server returned different Expires header (say it uses relative time to calculate the exact value of the Expires date e.g. +6 hours from the request time), does that invalidate the cache on all the proxy servers in-between as well? Or is it impossible to happen (per the spec)?
Does the Expires HTTP header needs to be consistent across multiple cold-cache requests?

Ok, never mind, found the relevant information under the Cache Revalidation and Reload Controls section of the HTTP Spec
Basically, you can serve all the different validators you want but you must be aware that in such case proxies may have a set of different validators from their own cache and from various user agents communicating with the proxy. They may choose to send one to you and that might not be the correct or the most optimal one for the end-users. However, a "best approach" has been suggested in the spec.
I suppose this should covers Expires headers as well as ETags, Cache-Control and whatnot.
Here's the relevant excerpt, in case anyone's interested:
When an intermediate cache is forced,
by means of a max-age=0 directive, to
revalidate its own cache entry, and
the client has supplied its own
validator in the request, the supplied
validator might differ from the
validator currently stored with the
cache entry. In this case, the cache
MAY use either validator in making its
own request without affecting semantic
transparency. However, the choice of
validator might affect performance.
The best approach is for the
intermediate cache to use its own
validator when making its request. If
the server replies with 304 (Not
Modified), then the cache can return
its now validated copy to the client
with a 200 (OK) response. If the
server replies with a new entity and
cache validator, however, the
intermediate cache can compare the
returned validator with the one
provided in the client's request,
using the strong comparison function.
If the client's validator is equal to
the origin server's, then the
intermediate cache simply returns 304
(Not Modified). Otherwise, it returns
the new entity with a 200 (OK)
response. If a request includes the
no-cache directive, it SHOULD NOT
include min-fresh, max-stale, or
max-age.

Related

Is the URL subject to HTTP/2 header compression?

I understand that, if you send duplicate header values in subsequent requests, the dynamic table makes it so that you do not send the value again but a reference to it in the table is sent instead.
My question is whether this applies to the URL as well?
Say you have repeated requests to the same URL (possibly containing long IDs and/or tokens), would bandwidth be saved in this instance?
There are various options that a client can use to send headers under HTTP/2 as defined in the HPACK specification. These basically say whether to use a previously referred to header, whether to store a header for later reference, whether to never store a header for reuse...etc. The client decides which of these to use for headers it sends.
In HTTP/2 the URL is sent in the :path pseudo-header so unlike in HTTP/1.1 it is a just like any other HTTP Header so could be compressed. Typically a URL is not repeated often, however, so it would be sent as a Literal Header Field without Indexing, which means this is a once off header so don’t store it for reuse. Of course, as it’s an HTTP header much like any other, there’s nothing to stop an HTTP/2 client sending this as an indexed type, but web browsers are unlikely to do this, so this is probably only really an option for custom clients.
Incidentally if wishing to know more about this, and finding the spec a little difficult to follow, then my book HTTP/2 in Action, goes into this in a lot more detail in Chapter 8.

HTTP HEAD verb's status code

According to the RFC http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.
The response to a HEAD request MAY be cacheable in the sense that the information contained in the response MAY be used to update a previously cached entity from that resource. If the new field values indicate that the cached entity differs from the current entity (as would be indicated by a change in Content-Length, Content-MD5, ETag or Last-Modified), then the cache MUST treat the cache entry as stale.
From this definition, should we have to return 200 such as the respective GET action, of should we have to return 204 because there is no content?
Personally, I think the better interpretation would be to use 204 status code. What is your interpretation?
See Section 10, which describes the status codes. The description of code 200 includes examples, and they include HEAD. So obviously they intend that the HEAD request should return this code.
The description of 204 explains the purpose:
This response is primarily intended to allow input for actions to take place without causing a change to the user agent's active document view, although any new or updated metainformation SHOULD be applied to the document currently in the user agent's active view.

Caching Github API calls

I have a general question related to caching of API calls, in this instance calls to the Github API.
Let's say I have a page in my app that shows the filenames of a repo, and the content of the README. This means that I will have to do a few API calls in order to retrieve that.
Now, let's say I want to add something like memcached in between, so I'm not doing these calls over and over, if I don't need to.
How would you normally go about this? If I don't enable a webhook on Github, I have no way of knowing whether the cache should expire. I could always make a single call to get the current sha of HEAD, and if it hadn't changed, use cache instead. But that's on a repo-level, and not on a file level.
I can imagine I could do something like that with the object-sha's, but if I need to call the API anyway to get those, it defeats the purpose of caching.
How would you go about it? I know a service like prose.io has no caching right now, but if it should, what would the approach be?
Thanks
Would just using HTTP caching be good enough for your use case? The purpose of HTTP caching is not just to provide a way of not making requests if you already have a fresh response, rather - it also enables you to quickly validate if the response you already have in cache is valid (without the server sending the complete response again if it is fresh).
Looking at GitHub API responses, I can see that GitHub is correctly setting the relevant HTTP headers (ETag, Last-modified, Cache-control).
So, you just do a GET, e.g. for:
GET https://api.github.com/users/izuzak/repos
and this returns:
200 OK
...
ETag:"df739f00c5053d12ef3c625ad6b0fd08"
Last-Modified:Thu, 14 Feb 2013 22:31:14 GMT
...
Next time - you do a GET for the same resource, but also supply the relevant HTTP caching headers so that it is actually a conditional GET:
GET https://api.github.com/users/izuzak/repos
...
If-Modified-Since:Thu, 14 Feb 2013 22:31:14 GMT
If-None-Match:"df739f00c5053d12ef3c625ad6b0fd08"
...
And lo and behold - the server returns a 304 Not modified response and your HTTP client will pull the response from its cache:
304 Not Modified
So, GitHub API does HTTP caching right and you should use it. Granted, you have to use an HTTP client that supports HTTP caching also. The best thing is that if you get a 304 Not modified response - GitHub does not decrease your remaining API calls quota. See: https://docs.github.com/en/rest/overview/resources-in-the-rest-api#conditional-requests
GitHub API also sets the Cache-Control: private, max-age=60 header, so you have 60 seconds of freshness -- which means that requests for the same resource made less than 60 seconds apart will not even be made to the server.
Your reasoning about using a single conditional GET request to a resource that surely changes if anything in the repo changed (a resource showing the sha of HEAD, for example) sounds reasonable -- since if that resource hasn't changed, then you don't have to check the individual files since they haven't surely changed.

Idempotence of HTTP PUT and DELETE

So the HTTP spec says that HTTP PUT and DELETE should be idempotent. Meaning, multiple PUT requests to the same URL with the same body should not result in additional side-effects on the server. Same goes with multiple HTTP DELETEs, if 2 or more DELETE requests are sent to the same URL, the second (or third, etc) requests should not return an error indicating that the resource has already been deleted.
However, what about PUT requests to a URI after a DELETE has been processed? Should it return 404?
For example, consider the following requests are executed in this order:
POST /api/items - creates an item resource, returns HTTP 201 and URI /api/items/6
PUT /api/items/6 - updates the data associated with item #6
PUT /api/items/6 - has no side effects as long as request body is same as previous PUT
DELETE /api/items/6 - deletes item #6 and returns HTTP 202
DELETE /api/items/6 - has no side effects, and also returns HTTP 202
GET /api/items/6 - this will now return a 404
PUT /api/items/6 - WHAT SHOULD HAPPEN HERE? 404? 409? something else?
So, should PUT be consistent with get and return a 404, or like #CodeCaster suggests, would a 409 be more appropriate?
RFC 2616, section 9.6, PUT:
The fundamental difference between the POST and PUT requests is
reflected in the different meaning of the Request-URI. The URI in a
POST request identifies the resource that will handle the enclosed
entity. That resource might be a data-accepting process, a gateway to
some other protocol, or a separate entity that accepts annotations.
In contrast, the URI in a PUT request identifies the entity enclosed
with the request -- the user agent knows what URI is intended and the
server MUST NOT attempt to apply the request to some other resource.
And:
If the resource could not be created or modified with the Request-URI, an appropriate error response SHOULD be given that reflects the nature of the problem.
So to define 'appropriate' is to look at the 400-series, indicating there's a client error. First I'll eliminate the irrelevant ones:
400 Bad Request: The request could not be understood by the server due to malformed
syntax.
401 Unauthorized: The request requires user authentication.
402 Payment Required: This code is reserved for future use.
406 Not Acceptable: The resource identified by the request [...] not acceptable
according to the accept headers sent in the request.
407 Proxy Authentication Required: This code [...] indicates that the
client must first authenticate itself with the proxy.
408 Request Timeout: The client did not produce a request within the time that the server was prepared to wait.
411 Length Required: The server refuses to accept the request without a defined Content-
Length.
So, which ones may we use?
403 Forbidden
The server understood the request, but is refusing to fulfill it.
Authorization will not help and the request SHOULD NOT be repeated.
This description actually fits pretty well, altough it is usually used in a permissions-related context (as in: YOU may not ...).
404 Not Found
The server has not found anything matching the Request-URI. No
indication is given of whether the condition is temporary or
permanent. The 410 (Gone) status code SHOULD be used if the server
knows, through some internally configurable mechanism, that an old
resource is permanently unavailable and has no forwarding address.
This status code is commonly used when the server does not wish to
reveal exactly why the request has been refused, or when no other
response is applicable.
This one too, especially the last line.
405 Method Not Allowed
The method specified in the Request-Line is not allowed for the
resource identified by the Request-URI. The response MUST include an
Allow header containing a list of valid methods for the requested
resource.
There are no valid methods we can respond with, since we don't want any method to be executed on this resource at the moment, so we cannot return a 405.
409 Conflict
Conflicts are most likely to occur in response to a PUT request. For
example, if versioning were being used and the entity being PUT
included changes to a resource which conflict with those made by an
earlier (third-party) request, the server might use the 409 response
to indicate that it can't complete the request. In this case, the
response entity would likely contain a list of the differences
between the two versions in a format defined by the response
Content-Type.
But that assumes there already is a resource at the URI (how can there be a conflict with nothing?).
410 Gone
The requested resource is no longer available at the server and no
forwarding address is known. This condition is expected to be
considered permanent. Clients with link editing capabilities SHOULD
delete references to the Request-URI after user approval. If the
server does not know, or has no facility to determine, whether or not
the condition is permanent, the status code 404 (Not Found) SHOULD be
used instead.
This one also makes sense.
I've edited this post a few times now, it was accepted when it claimed "use 410 or 404", but now I think 403 might also be applicable, since the RFC doesn't state a 403 has to be permissions-related (but it seems to be implemented that way by popular web servers). I think I have eliminated all other 400-codes, but feel free to comment (before you downvote).
Your question has an unstated, assumed premise, that the resource must exist for a PUT to succeed. This is not a valid assumption.
The relevant portion of the spec (RFC2616) says:
the user agent knows what URI is intended and the server MUST NOT attempt to apply the request to some other resource.
The spec does not say, "An object at the referenced URI must already exist in order for a PUT to that URI to succeed."
The easy example is a web store implemented via REST. GET returns a representation of the object at the given path, while DELETE removes the item at the given path. Those are easy. But the POST and PUT are not much more difficult to understand. POST can do anything, but one use of POST creates an object in a container that the client specifies, and lets the server return the URI of the newly created object within that container. PUT is more limited; it gives the server the representation for an object at a given URI. The object may already exist, or it may not. PUT is not a synonym for REPLACE.
In my opinion 409 or 410 is wrong for a PUT, unless the container itself - the thing you are trying to put into, does not exist.
therefore:
POST /container
==> returns 200 with `Location:/container/resource-12345`
PUT /container/resource-98928
==> returns 201 CREATED or 200 OK
PUT /this-container-does-not-exist/resource-22828282
--> returns 400
Of course it is up to you, whether you'd like your server to allow these PUT semantics. But there's nothing in the spec that says you must not allow clients to provide the URI of the resource that he is PUTting.

How to poll for updates with JSONP?

I have a Web server that updates its data once per minute, and want to make that data available to clients of all types. In order to reduce bandwidth, I set up the PHP script to support conditional GETs, using IF-MODIFIED-SINCE and/or IF-NONE-MATCH. The idea is that clients can poll every 30 seconds and thereby be sure that they won't miss anything, but also won't get duplicate data.
That all works great for most types of clients, and I've verified that it works with clients that support the standard HTTP conditional GET semantics.
But it doesn't work with JavaScript because JSONP inserts a <script> tag into the DOM and lets the browser handle things--and there's no support (at least, none that I know of) for conditional GETs in <script> tags.
So I modified my PHP script to support passing an etag value. The returned data contains an etag value that's unique for that minute. When the JavaScript client receives data from the server, it saves the etag value so it can use that value in subsequent requests. The request takes the form:
http://api.mydomain.com/script.php?fmt=json&callback=jscallback&etag=ab79bc65e
If the etag of the data doesn't match the passed etag, then I send the new data.
This all works well and was surprisingly easy to code up using jQuery. My dilemma, though is what to do if the etag matches. I see two choices:
Return an HTTP 304 (Not Modified)
Return an HTTP 200 (OK), but with the returned data containing just the header information (modified date, etag, etc.) and no actual data items.
If I do the first, then the JavaScript client code is greatly simplified. The browser seems to work just fine if it gets a 304 response to an injected <script> tag. But ... something bothers me about this solution. I don't know what it is, but it seems like I'm depending on behavior that could be browser-specific. Some browser might decide to report an error if it gets a 304.
Doing the second would require a little bit more work on the server, slightly more bandwidth, and would require the clients to check the data to see if the data was updated. It's more work for everybody, but it seems cleaner.
So, to my question. If you were writing a JavaScript client to get this data, which would you prefer? A silent failure that never calls your "success" callback? Or a "success" return that has no data (beyond status) in it? A third option?
Absent any discussion from others, I went with my gut here and implemented the second option. The web server returns an HTTP 200, and the data contains a "Not Modified" status code along with header information, but no records. That makes the JavaScript just slightly more complicated, but prevents me from depending on undocumented behavior.