Understanding caching strategies of a dynamically generated search page

Understanding caching strategies of a dynamically generated search page - http-headers

While studying the caching strategies adopted by various search engine websites and Stackoverflow itself, I can't help but notice the subtle differences in the response headers:
Google Search
Cache-Control: private, max-age=0
Expires: -1
Yahoo Search
Cache-Control: private
Connection: Keep-Alive
Keep-Alive: timeout=60, max=100
Stackoverflow Search
Cache-Control: private
There must be some logical explanation behind the settings adopted. Can someone care to explain the differences so that everyone of us can learn and benefit?

From RFC2616 HTTP/1.1 Header Field Definitions, 14.9.1 What is Cacheable:
private
Indicates that all or part of the response message is intended for a single
user and MUST NOT be cached by a shared cache. This allows an origin server
to state that the specified parts of the response are intended for only one
user and are not a valid response for requests by other users. A private
(non-shared) cache MAY cache the response.
max-age=0 means that it may be cached up to 0 seconds. The value zero would mean that no caching should be performed.
Expires=-1 should be ignored when there's a max-age present, and -1 is an invalid date and should be parsed as a value in the past (meaning already expired).
From RFC2616 HTTP/1.1 Header Field Definitions, 14.21 Expires:
Note: if a response includes a Cache-Control field with the max-age directive
(see section 14.9.3), that directive overrides the Expires field
HTTP/1.1 clients and caches MUST treat other invalid date formats, especially
including the value "0", as in the past (i.e., "already expired").
The Connection: Keep-Alive and Keep-Alive: timeout=60, max=100 configures settings for persistent connections. All connections using HTTP/1.1 are persistent unless otherwise specified, but these headers change the actual timeout values instead of using the browsers default (which varies greatly).

Related

Getting file contents with Range header returns Partial Content and subsequent request returns no data

We have a non-understandable issue with getting file contents using OneDrive API.
When we request file contents with Range header:
GET /blahblah/foobar.docx HTTP/1.1
Host: qw122q-ch3301.files.1drv.com
Accept: */*
Accept-Encoding: deflate, gzip
Range: bytes=0-77270
OneDrive returns:
HTTP/1.1 206 Partial Content
Cache-Control: no-cache
Content-Length: 18325
We checked that the file size is correct on OneDrive server using web interface. Usually OneDrive returns full requested content but from last week they returns partial contents. But it's OK if we can get remaining parts with another API calls.
But when we send another request with Range header:
Range: bytes=18325-77270
OneDrive returns no data:
HTTP/1.1 206 Partial Content
Control: no-cache
Content-Length: 0
Has anyone experienced this issue? I can't find any clues on this issue from OneDrive developer documents. Please shed some light on this..

Actually I have a theory so I'm going to take a shot at an answer. There are actually two different issues that are resulting in this confusing behavior, so I'll tackle each one separately.
Reported file size doesn't match content size
This is an unfortunate quirk of the system that is being tracked with this GitHub issue. Ryan explains in more detail here.
Range downloads of word docs do not correctly handle unsatisfiable ranges
When a range outside of the actual file size is requested we should be failing with a 416 Requested Range Not Satisfiable like we do for "normal" files. But that's obviously not working. You can see in the Content-Range of the result there's something screwy going on:
FileSize: 15 bytes
Range Requested: bytes=15-
Content-Range Response: bytes=15-14/15
The value of the Content-Range obviously makes no sense.
Together these two issues should result in the weird behavior you're seeing. We're close to resolving the first, while the second was unknown (at least to me) so I've opened a new GitHub issue to track it.

Do any browsers support trailers sent in chunked encoding responses?

HTTP/1.1 specifies that a response sent as Transfer-Encoding: chunked can include optional trailers (ie. what would normally be sent as headers, but for whatever reason can't be calculated before the content, so they can be appended to the end), for example:
Request:
GET /trailers.html HTTP/1.1
TE: chunked, trailers
Response:
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Trailer: My-Test-Trailer
D\r\n
All your base\r\n
B\r\n;
are belong\r\n
6\r\n
to us\r\n
0\r\n
My-Test-Trailer: something\r\n
\r\n
This request specifies in the TE header that it's expecting a chunked response, and will be looking for trailers after the final chunk.
The response specifies in the Trailer header the list of trailers it will be sending (in this case, just one: My-Test-Trailer)
Each of the chunks are sent as:
size of chunk in hex (D = 13), followed by a CRLF
chunk data (All your base), followed by a CRLF
A zero size chunk (0\r\n) indicates the end of the body.
Then the trailer(s) are specified (My-Test-Trailer: something\r\n), followed by a final CRLF.
Now, from everything I've read so far, trailers are rarely (if ever) used. Most discussions here and elsewhere concerning trailers typically start with "but why do you want to use trailers anyway?".
Putting aside the question of why, out of curiosity I've been trying to simulate a HTTP request/response exchange that uses trailers; but so far I have not yet been able to get it to work, and I'm not sure if it's something wrong with response I'm generating, or whether (as some have suggested) there are simply no clients that look for trailing headers.
Clients I've tried include: curl, wfetch, Chrome + jQuery.
In all cases, the client receives and correctly reconstructs the chunked response (All your base are belong to us); and I can see in the response headers that Trailer: My-Test-Trailer is being sent; but I'm not seeing My-Test-Trailier: something returned either in the response headers, or anywhere.
It's unclear whether a trailing header like this should appear in the client as a normal response header, after the entire response has been received and the connection closed?
Interestingly, the curl change logs appear to suggest that curl does support optional trailers, and that curl will process any trailers it finds into the normal header stream.
So does anybody know:
of a valid URL that I could ping, which sends trailers in a chunked response? (so that I can confirm whether it's just my test response that's not working); and
which clients are known to support (and access/display) trailers sent by the server?

No common browsers support HTTP/1.1 trailers. Look at the column "Headers in trailer" in the "Network" tab of browserscope.
Chrome: No, and won't fix (bug). Supports H/2 trailers (bug).
Firefox: No, and I don't see a ticket in bugzilla for it. Appears to support H/2.
IE: No
Edge: No
Safari: No
Opera: Old versions only (v10 - 12, removed in 14)
As you've discovered, a number of non-browser clients support it.

Over 5 years since asking this question, I can now definitively answer it myself.
Mozilla just announced that they will be supporting the new Server-Timing field as a HTTP trailing header (their first ever support for trailers).
https://bugzilla.mozilla.org/show_bug.cgi?id=1413999
However, more importantly, they confirm that it will be whitelisted so that Server-Timing is the only support value (emphasis mine):
Server-Timing is an HTTP trailer, not a header. :mcmanus tells me we currently parse trailers, but then silently throw them away. We don't want to change that behavior in general (we don't want to encourage trailers), so we'll want to whitelist the Server-Timing trailer, store it somewhere (probably even just a mServerTiming header will work for now, since it's the only trailer we support) and then make it available via some new channel.getTrailers() call.
So I guess that confirms it once and for all: trailing headers are not supported (and never likely to be in a general sense) by Moz, and presumably the same stance is taken by all other browser vendors.

Since this commit, Jodd HTTP Java client support trailer headers.
On the first question, I haven't yet found any live response that uses them ;)

As of May 2022, all browsers support the Trailer response header: https://caniuse.com/mdn-http_headers_trailer.
Library support:
Node.js
Jodd
EDIT ME, to add more. (This is a Community wiki answer.)

Just recently, according to caniuse.com, it is claimed that most major browser providers now support the Trailer response header in their latest versions (e.g. Firefox-88, Safari-14.1, Chrome-88, etc).
However, it seems that this only due to their support of Server-Timing (which can use trailers as mentioned in another answer), and there does not currently seem to be a general way to access a Trailer header from the browser Javascript - it's currently an open issue for the Fetch API, and the bug request for trailers in Chrome is marked wontfix.

What's the difference Expires and Cache-control:max-age?

Could you tell me the difference of Expires and Cache-control:max-age?

Expires was defined in the HTTP/1.0 specifications, and Cache-Control in the HTTP/1.1 specifications.
I would suggest defining both so you cater to both, the older clients that only understand HTTP/1.0, and the newer ones.

Expires was specified in HTTP 1.0 specification as compared to Cache-Control: max-age, which was introduced in the early HTTP 1.1 specification. The value of the Expires header has to be in a very specific date and time format, any error in which will make your resources non-cacheable. The Cache-Control: max-age header's value when sent to the browser is in seconds, the chances of any error happening in which is quite less.
Since you can specify only one of the two headers in your web.config file, I'd suggest going with the Cache-Control: max-age header because of the flexibility it offers in setting a relative timespan from the present date to a date in the future. You can basically set and forget, as compared to the case with Expires header, whose value you will have to remember to update at least once every year. And if you set both headers programmatically from within your code, know that the value of Cache-Control: max-age header will take precedence over Expires header. So, something to keep in mind there as well.
From Setting Expires and Cache-Control: max-age headers for static resources in ASP.NET

Cache Control Question

If I set this for cache control on my site:
Header unset Pragma
FileETag None
Header unset ETag
# 1 YEAR
<FilesMatch "\.(ico|pdf|flv|jpg|jpeg|png|gif|swf|mp3|mp4)$">
Header set Cache-Control "public"
Header set Expires "Thu, 15 Apr 2010 20:00:00 GMT"
Header unset Last-Modified
</FilesMatch>
# 2 HOURS
<FilesMatch "\.(html|htm|xml|txt|xsl)$">
Header set Cache-Control "max-age=7200, must-revalidate"
</FilesMatch>
# CACHED FOREVER
# MOD_REWRITE TO RENAME EVERY CHANGE
<FilesMatch "\.(js|css)$">
Header set Cache-Control "public"
Header set Expires "Thu, 15 Apr 2010 20:00:00 GMT"
Header unset Last-Modified
</FilesMatch>
...then what if I update any css or image or other files, will the users browser still use the caches version until it expires (a year later)?
Thanks

Your css, js and image files will never be cached, as you are setting a date in the past.
I assume this is a mistake, and you intended to set it for a year in the future, this is one reason to favour max-age over expires.
If this was the case, then your images will be cached up to a year. It's allowable to drop something out of the cache at any time, for example to clean out less-frequently used entries to reduce the size on disk that the cache is taking up.
There are two possible approaches to deal with the possibility of reducing the risk of staleness. One is to set a much lower expiry time, and use e-tags and modification dates so that after that expiry time has past you can send a 304 if there is no change, so the server need send only a few bytes rather than the entire entity.
The other is to keep the expiry at a year, but to change the URI used when you change. This can be useful in the case of e.g. a large file that is used on almost every page on your site. It requires that you change all references to that resource when it does change (because you are essentially changing to use a new resource), which can be fiddly and therefore is only advised as an optimisation in a few hotspot cases. If a file ignores query attributes (e.g. it's just served straight from a file) the browser won't know that, hence you could use something like /scripts/bigScript.js?version=1.2.3 and then change to /scripts/bigScript.js?version=1.2.4 when you change bigScript.js. This will have no effect on bigScript.js, but will cause the browser to get a new file, as for all it knows it's a completely different resource.

Yes, a response with an expiration date in the future will be considered as fresh until the expiration date:
The Expires entity-header field gives the date/time after which the response is considered stale. […]
The presence of an Expires header field with a date value of some time in the future on a response that otherwise would by default be non-cacheable indicates that the response is cacheable, unless indicated otherwise by a Cache-Control header field (section 14.9).
Note that an expiration date more than one year in the future may be interpreted as never expires:
To mark a response as "never expires," an origin server sends an Expires date approximately one year from the time the response is sent. HTTP/1.1 servers SHOULD NOT send Expires dates more than one year in the future.
So if a cache has the response stored, it will probably take the response from the cache even without revalidating the cached response before sending it.
Now if you change a resource that is already stored in caches and still fresh, there is no way to invalidate them:
[…] although they might continue to be "fresh," they do not accurately reflect what the origin server would return for a new request on that resource.
There is no way for the HTTP protocol to guarantee that all such cache entries are marked invalid. For example, the request that caused the change at the origin server might not have gone through the proxy where a cache entry is stored.
This is the reason for why such never expiring resources use a unique version number in the URL (e.g. style-v123.css) that is changed with each update. This is also what I recommend in this case.
By the way, declaring the response with Cache-Control as public doesn’t do anything in this case. This is only used when a response that required authorization should be cacheable:
public – Indicates that the response MAY be cached by any cache, even if it would normally be non-cacheable or cacheable only within a non- shared cache. (See also Authorization, section 14.8, for additional details.)
For further information on HTTP caching:
HTTP 1.1 specification – Caching in HTTP
Mark Nottingham’s Caching Tutorial

RFCs for content-type header?

I've looked # rfc 2231 and 2183.
Dealing with a multipart/related mime payload.
I'm trying to decypher if the following is syntactically correct, specifically the "start" attribute for the first Content-Type, but I haven't been able to find the correct RFC.
Content-Type: multipart/related; boundary="=_34e1b39f5c290f66360ff510d4c38da4"; type="application/smil"; start="<cid:eaec2c30d892902b14044d57dbb6ff85>"
--=_34e1b39f5c290f66360ff510d4c38da4
Content-ID: <eaec2c30d892902b14044d57dbb6ff85>
Content-Type: application/vnd.oma.drm.message; boundary=ihvdxymhvdhobklkqbcn;
name="IrishJi2.dm";
Content-Disposition: attachment;
filename="IrishJi2.dm";
--ihvdxymhvdhobklkqbcn
Content-Type: audio/mpeg
Content-Transfer-Encoding: binary
Some background information for the curious. application/vnd.oma.drm.* file types is just a wrapper around a payload item (mp3,jpg, etc) that tells cellular devices the wrapped file is to be considered protected payload and not to allow it to be forwarded or transfered off the phone in anyway. If not for contractual obligations I'd just rip the wrapper off, send the payload on, and be happy, but that is too easy and probably illegal.

From RFC 2387 (The MIME Multipart/Related Content-type):
3.2. The Start Parameter
The start parameter, if given, is the content-ID of the compound object's "root". If not present the "root" is the first body part in the Multipart/Related entity. The "root" is the element the applications processes first.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas