Must-revalidate headers of this request wrong? - apache

I noticed that chrome cached a video file. I replaced it with another one on the server and chrome kept serving the old one from cache (using JW flash player 5)
The headers of the request look like this:
joe#joe-desktop:~$ wget -O - -S --spider http://www.2xfun.de/files_geheimhihi14/20759.mp4
Spider mode enabled. Check if remote file exists.
--2011-05-15 22:40:56-- http://www.2xfun.de/files_geheimhihi14/20759.mp4
Resolving www.2xfun.de... 213.239.214.112
Connecting to www.2xfun.de|213.239.214.112|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Sun, 15 May 2011 20:40:56 GMT
Server: Apache
Last-Modified: Sun, 15 May 2011 20:37:59 GMT
ETag: "89b38-3bb227-4a35683b477c0"
Accept-Ranges: bytes
Content-Length: 3912231
Cache-Control: max-age=29030400, public, must-revalidate
Expires: Sun, 15 Apr 2012 20:40:56 GMT
Connection: close
Content-Type: video/mp4
Length: 3912231 (3.7M) [video/mp4]
Remote file exists.
I am using mod_headers and mod_expires in apache2 like this:
<FilesMatch "\.(flv|ico|pdf|avi|mov|ppt|doc|mp3|wmv|wav|mp4)$">
ExpiresDefault A29030400
Header append Cache-Control "public, must-revalidate"
</FilesMatch>
Did I spell revalidate wrong or something?
edit:
To make the use case clear: I want the files to be cached, because they are rather big and I want to save bandwidth. But on the other hand I want the files to be re-validated. So the client does a HEAD request and checks whether the content has changed (thats what the etag is for), and only re-fetches if necessary.

Your problem is that must-revalidate only kicks in once a cache entry is no longer fresh, but you've marked the response as cacheable for 29 million seconds. 'Cache-Control: max-age=0, must-revalidate' may be closer to what you want, if you want to allow caching but require revalidation on each use.

Related

Use wget to download pdf with no direct link

Some websites provide pdf files for viewing but I can't download such pdf files with wget.
Calling the website in question from my browser views the pdf:
https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
But using the following code I only get a pdf file with 0 lenght.
wget --content-disposition -nd https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
I tried some combinations with saving and loading cookies and referer but nothing works.
At this point I'm just curious what is happening and why wget is not fetching anything except maybe an empty index.html.
When I was looking at server response, it was saying the content length was 0.
--2021-04-17 14:59:35-- https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
Resolving www.lokalmatador.de (www.lokalmatador.de)... 37.202.6.70
Connecting to www.lokalmatador.de (www.lokalmatador.de)|37.202.6.70|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Sat, 17 Apr 2021 13:59:36 GMT
Server: Apache
Set-Cookie: fe_typo_user=477e8a1d2b3dd74bc5b6b408a6d74edd; expires=Mon, 17-May-2021 13:59:36 GMT; Max-Age=2592000; path=/; domain=.lokalmatador.de; httponly; samesite=lax
Upgrade: h2,h2c
Connection: Upgrade, Keep-Alive
Content-Length: Array
Cache-Control: max-age=2592000
Expires: Mon, 17 May 2021 13:59:36 GMT
X-UA-Compatible: IE=edge
X-Content-Type-Options: nosniff
Keep-Alive: timeout=5, max=100
Content-Type: application/pdf
Length: 0 [application/pdf]
Remote file exists but does not contain any link -- not retrieving.
So looked at the manual:
https://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html
And there is a command just exactly for this:
‘--ignore-length’
Unfortunately, some HTTP servers (CGI programs, to be more precise) send out bogus Content-Length headers, which makes Wget go wild, as it thinks not all the document was retrieved. You can spot this syndrome if Wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte.
With this option, Wget will ignore the Content-Length header—as if it never existed.
Then the wget command started working as expected:
wget --ignore-length -O epaper.pdf https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021
Here is output which I'm seeing with the ignore length:
--2021-04-17 14:56:19-- https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021
Resolving www.lokalmatador.de (www.lokalmatador.de)... 37.202.6.70
Connecting to www.lokalmatador.de (www.lokalmatador.de)|37.202.6.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [application/pdf]
Saving to: ‘epaper.pdf’
epaper.pdf [ <=> ] 4.39M 1.23MB/s in 3.6s
2021-04-17 14:56:23 (1.21 MB/s) - ‘epaper.pdf’ saved [4601842]

Cache-control max-age meta tag not registering

I've put this in my head section. It appears in the page source in the browser.
<meta http-equiv="Cache-Control" content="max-age=1209600">
However, when I look in the Chrome extension Live HTTP Headers, it says the following.
Cache-Control: max-age=0
Content-Encoding: gzip
Content-Length: 5849
Content-Type: text/html; charset=utf-8
Date: Sat, 05 Apr 2014 04:29:16 GMT
Expires: Sat, 05 Apr 2014 04:29:16 GMT
Last-Modified: Sat, 05 Apr 2014 03:33:19 GMT
The max-age isn't registering. I've emptied the browser cache but it makes no difference.
Any explanations? This is the site, incidentally.
UPDATES:
Firebug similarly records Cache-Control: max-age=0.
Google also makes clear here that max-age overrides the Expires header (which I don't set) and that you don't need both.
When you use tools like Live HTTP Headers, they show you the actual HTTP headers sent by the browser. What they do with meta tags used to simulate HTTP headers is a different issue. We can expect any conflict to be resolved in favor of the actual headers. (This has been normatively specified in HTML specs for Content-Type headers.)
To control cacheing, you should (at least primarily) use server configuration. See Caching Tutorial for Web Authors and Webmasters.

Cache-control in response headers

I have this server response for a file that I want not to be cached from the browsers. The response has two cache control headers.
Cache-Control: no-cache, no-store, must-revalidate (which is what I want and)
Cache-Control: private (which is appended by default from netscaler and the server side guys tell me they cannot remove it)
My question is which one will prevail?
HTTP/1.1 200 OK
Date: Mon, 20 Jan 2014 15:29:53 GMT
Server: Apache
Last-Modified: Fri, 17 Jan 2014 16:50:54 GMT
ETag: "682-4f02d58643780"
Accept-Ranges: bytes
Cteonnt-Length: 1666
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTR STP IND DEM"
Keep-Alive: timeout=5, max=1000
Connection: Keep-Alive
Content-Type: text/javascript
Cache-Control: no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: 0
Cache-Control: private
Content-Encoding: gzip
Content-Length: 716
As per RFC2616, setting the same header multiple times should be equivalent to setting it once with all values separated by comas.
Multiple message-header fields with the same field-name MAY be present in a message if and only if the entire field-value for that header field is defined as a comma-separated list [i.e., #(values)]. It MUST be possible to combine the multiple header fields into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma.
So in your case, it would be equivalent to
Cache-Control: no-cache, no-store, must-revalidate, private
private will just further prevent the response to be cached by a proxy between the server and the browser, so it shouldn't have any adverse effect.
Having researched a similar issue for a client, I can tell you from my own experience that, if this content is being served through a Citrix NetScaler and compression has been enabled, anything with a content-type of text will have a Cache-Control: private value set by the NetScaler. How you're getting two entries is beyond me. However, Yolanda's answer is most likely correct. The only reason for the caveat is that RFC2616 was superseded in 2014. (See https://www.w3.org/Protocols/rfc2616/rfc2616.html)
Regarding the NetScaler adding/replacing the Cache-Control header, it appears that it can be turned off; You just have to know how. Had to open a case with Citrix to learn about CTX124717 (FAQ:Preventing the Cache-Control Response Header from being Set to private).
If compression is enabled on the NetScaler, two of the default policies (ns_cmp_content_type and ns_adv_cmp_content_type) "compress data when the response contains Content-Type header and contains text" (see http://docs.citrix.com/en-us/netscaler/10-5/ns-optimization-wrapper-10-con/ns-compression-gen-wrapper-con/ns-compression-configactions-tsk.html). Using the NetScaler API Mgr (nsapimgr) you can prevent the Compression feature from adding the Cache-Control response header (nsapimgr -ys cmp_no_cc_hdr=1).

Analysis of HTTP header

Hello I want to analyze & understand at first place and then optimize the HTTP header responses of my site. What I get when I fetch as Google from webmasters is:
HTTP/1.1 200 OK
Date: Fri, 26 Oct 2012 17:34:36 GMT // The date and time that the message was sent
Server: Apache // A name for the server
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" // P3P Does an e-commerse store needs this?
ETag: c4241ffd9627342f5f6f8a4af8cc22ed // Identifies a specific version of a resource
Content-Encoding: gzip // The type of encoding used on the data
X-Content-Encoded-By: Joomla! 1.5 // This is obviously generated by Joomla, there wont be any issue if I just remove it, right?
Expires: Mon, 1 Jan 2001 00:00:00 GMT // Gives the date/time after which the response is considered stale: Since the date is set is already expired, this creates any conflicts?
Cache-Control: post-check=0, pre-check=0 // This means site is not cached? or what?
Pragma: no-cache // any idea?
Set-Cookie: 5d962cb89e7c3329f024e48072fcb9fe=9qdp2q2fk3hdddqev02a9vpqt0; path=/ // Why do I need to set cookie for any page?
Last-Modified: Fri, 26 Oct 2012 17:34:37 GMT
X-Powered-By: PleskLin // Can this be removed?
Cache-Control: max-age=0, must-revalidate // There are 2 cache-controls, this needs to be fixed right? which one is preffected? max-age=0, must-revalidate? post-check=0, pre-check=0?
Keep-Alive: timeout=3, max=100 // Whats that?
Connection: Keep-Alive
Transfer-Encoding: chunked // This shouldnt be deflate or gzip ??
Content-Type: text/html
post-check
Defines an interval in seconds after which an entity must be checked for freshness. The check may happen after the user is shown the resource but ensures that on the next roundtrip the cached copy will be up-to-date.
http://www.rdlt.com/cache-control-post-check-pre-check.html
pre-check
Defines an interval in seconds after which an entity must be checked for freshness prior to showing the user the resource.
Pragma: no-cache header field is an HTTP/1.0 header intended for use in requests. It is a means for the browser to tell the server and any intermediate caches that it wants a fresh version of the resource, not for the server to tell the browser not to cache the resource. Some user agents do pay attention to this header in responses, but the HTTP/1.1 RFC specifically warns against relying on this behavior.
Set-Cookie: When the user browses the same website in the future, the data stored in the cookie can be retrieved by the website to notify the website of the user's previous activity.[1] Cookies were designed to be a reliable mechanism for websites to remember the state of the website or activity the user had taken in the past. This can include clicking particular buttons, logging in, or a record of which pages were visited by the user even months or years ago.
X-Powered-By: specifies the technology (e.g. ASP.NET, PHP, JBoss) supporting the web application.This comes under common non-standard response headers and can be removed.
Keep-Alive It is meant to reduce the number of connections for a website. Instead of creating a new connection for each image/css/javascript in a webpage many requests will be made re-using the same connection.
Transfer-Encoding: The form of encoding used to safely transfer the entity to the user. Currently defined methods are: chunked, compress, deflate, gzip, identity.

Keep the assets fresh in browser and cancel the freshness check request of the cache [for rails 3.1 app on heroku]

I have lot of small images (of sizes ~3kb or so) and lot of css and js files. After the first request tey are getting cached on the browser, but when I reload the page the browser is trying to check the freshness of the cached content (by setting the If-Modified-Since etc) and gets the response 304 not modified. Each of this validation request seriously increase the page load time (say 20 time 300ms).
How can I cancel this cache freshness check with the server from the browser? How can instruct the browser to use local cached files/images for certain time (say 1 hour) without re-validating or checking the freshness of the local cache with the remote server for every reload with that time period?
sample small image fetch header details below [using rails 3.1, on heroku]:
Response Headers
HTTP/1.1 304 Not Modified
Server: nginx/0.7.67
Date: Thu, 10 Nov 2011 17:53:33 GMT
Connection: keep-alive
Via: 1.1 varnish
X-Varnish: 1968827848
Last-Modified: Tue, 08 Nov 2011 07:36:04 GMT
Cache-Control: public, max-age=31536000
Etag: "5bda917d22f8a144c293f3f19723dbc6"
Request Headers
GET /assets/icons/flash_close_button-5bda917d22f8a144c293f3f19723dbc6.png HTTP/1.1
Host: ???.heroku.com
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:6.0.1) Gecko/20100101 Firefox/6.0.1
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: http://???.heroku.com/
Cookie: ???
If-Modified-Since: Tue, 08 Nov 2011 07:36:04 GMT
If-None-Match: "5bda917d22f8a144c293f3f19723dbc6"
Cache-Control: max-age=0
This line:
Cache-Control: public, max-age=31536000
is telling the browser to not ask for updates for a long time, and store the files in a publicly accessible cache (which hear means public to the local machine - not the general public). Your browser should therefore not really be re-checking those files. Have you tried another browser to verify this behaviour exists elsewhere?
Saying all of this though, considering that your files are coming from the varnish cache and not your dyno, and are being returned as HTTP 304, 300ms for 20 files sounds like a very long time. However, This should be barely perceptible to the user.