Browser cache, Last-Modified header - header

Let's assume there are two URLs:
http://example.com/myaccount?user=12345
http://example.com/myaccount?user=34567
As far as I understand the browser will cache them separately and will not use the Last-Modified header from the first request to revalidate the second.
Is it possible to force the browser to use the Last-Modified header in this case?
Could you please explain why does it work this way?

I think that in certain situations the web server will ignore the query string. You can try for instance on a out-of-the-box Apache server do curl -I http://example.com/styles.css | grep Last-Modified and then run the same command for http://example.com/styles.css?v=2. Assuming the file exists, you'll probably get the same Last-Modified timestamp.
It might be the browsers (my guess is that they do that) who consider the ?v=2 as a different file or a file with an updated content. Also, I think that most Content Delivery Networks will be configured that way which allows to serve a fresh copy of a file if its query string differs.
It's an interesting question anyway. I'll read up on it more. I hope that somebody might explain this more closely here.

Related

Archiving an old PHP website: will any webhost let me totally disable query string support?

I want to archive an old website which was built with PHP. Its URLs are full of .phps and query strings.
I don't want anything to actually change from the perspective of the visitor -- the URLs should remain the same. The only actual difference is that it will no longer be interactive or dynamic.
I ran wget --recursive to spider the site and grab all the static content. So now I have thousands of files such as page.php?param1=a&param2=b. I want to serve them up as they were before, so that means they'll mostly have Content-Type: text/html, and the webserver needs to treat ? and & in the URL as literal ? and & in the files it looks up on disk -- in other words it needs to not support query strings.
And ideally I'd like to host it for free.
My first thought was Netlify, but deployment on Netlify fails if any files have ? in their filename. I'm also concerned that I may not be able to tell it that most of these files are to be served as text/html (and one as application/rss+xml) even though there's no clue about that in their filenames.
I then considered https://surge.sh/, but hit exactly the same problems.
I then tried AWS S3. It's not free but it's pretty close. I got further here: I was able to attach metadata to the files I was uploading so each would have the correct content type, and it doesn't mind the files having ? and & in their filenames. However, its webserver interprets ?... as a query string, and it looks up and serves the file without that suffix. I can't find any way to disable query strings.
Did I miss anything -- is there a way to make any of the above hosts act the way I want them to?
Is there another host which will fit the bill?
If all else fails, I'll find a way to transform all the filenames and all the links between the files. I found how to get wget to transform ? to #, which may be good enough. It would be a shame to go this route, however, since then the URLs are all changing.
I found a solution with Netlify.
I added the wget options --adjust-extension and --restrict-file-names=windows.
The --adjust-extension part adds .html at the end of filenames which were served as HTML but didn't already have that extension, so now we have for example index.php.html. This was the simplest way to get Netlify to serve these files as HTML. It may be possible to skip this and manually specify the content types of these files.
The --restrict-file-names=windows alters filenames in a few ways, the most important of which is that it replaces ? with #. This is needed since Netlify doesn't let us deploy files with ? in the name. It's a bit of a hack; this is not really what this option is meant for.
This gives static files with names like myfile.php#param1=value1&param2=value2.html and myfile.php.html.
I did some cleanup. For example, I needed to adjust a few link and resource paths to be absolute rather than relative due to how Netlify manages presence or lack of trailing slashes.
I wrote a _redirects file to define URL rewriting rules. As the Netlify redirect options documentation shows, we can test for specific query parameters and capture their values. We can use those values in the destinations, and we can specify a 200 code, which makes Netlify handle it as a rewrite rather than a redirection (i.e. the visitor still sees the original URL). An exclamation mark is needed after the 200 code if a "query-string-less" version (such as mypage.php.html) exists, to tell Netlify we are intentionally shadowing.
/mypage.php param1=:param1 param2=:param2 /mypage.php#param1=:param1&param2=:param2.html 200!
/mypage.php param1=:param1 /mypage.php#param1=:param1.html 200!
/mypage.php param2=:param2 /mypage.php#param2=:param2.html 200!
If not all query parameter combinations are actually used in the dumped files, not all of the redirect lines need to be included of course.
There's no need for a final /mypage.php /mypage.php.html 200 line, since Netlify automatically looks for a file with a .html extension added to the requested URL and serves it if found.
I wrote a _headers file to set the content type of my RSS file:
/rss.php
Content-Type: application/rss+xml
I hope this helps somebody.

Apache mod_proxy write Post to log file

Is there any way to capture post request and write it to log running apache mod_proxy (or any other mod)?
For example, I have one CMS behind apache mod_proxy and I want to capture Login textbox which uses POST verb in the apache log file, it is possible?
Thanks :).
Please take a look at mod_dumpio. All input and/or all output will be logged into error.log.
mod_security can log post data too, but a little complex.
This may suit to your needs mod_log_post (striped down version of mod_sec), but has less documentation and support. Though it might work within your purpose.

Can we detect if a site is on CDN?

Is there a way to detect if a site is on a Content Delivery Network and if yes, can we tell which service are they using?
A method that is achievable from the command line is using the 'host' command, with the -a flag set to see the DNS record e.g.
host -a www.visitbritain.com
Returns:
www.visitbritain.com. 0 IN CNAME d18sjq5nyxcof4.cloudfront.net.
Here you can see that the CNAME entry tells us that the site is using cloudfront as the CDN.
Just take a look at the urls of the images (and other media) of the site.
Reverse lookup IP's of the hostnames you see there and you will see who own them.
I built this little tool to identify the CDN used by a site or a domain, feel free to try it.
The URL: http://www.whatsmycdn.com/
You might also be able to tell from the HTTP headers of the media if the URL doesn't give it away. For example, media served by SimpleCDN has Server: SimpleCDN 5.6a4 in its headers.
cdn planet now have their cdn finder tool on github
http://www.cdnplanet.com/blog/better-cdn-finder/ The tool installs on the command line and allows you the feed in host names and check if they use a CDN.
If Website using GCP CDN you simply check it using curl
curl -I <https://site url>
In reponse you can find following headers there available
x-goog-metageneration: 2
x-goog-stored-content-encoding: identity
x-goog-stored-content-length: 17393
x-goog-meta-object-id: 11602
x-goog-meta-source-id: 013dea516b21eedfd422a05b96e2c3e4
x-goog-meta-file-hash: cf3690283997e18819b224c6c094f26c
Yes you can find by
host -a www.website.com
Apart from some excellent answers already posted here which include some direct methods which may or may not work for all the websites out there, there is also an indirect way to see if a CDN is there. And especially if its your own website and you want to know if you are getting what you are paying for !
The promise of a CDN is that connections from your users are terminated closer to them so that they get less TCP / TLS connection establishment overhead and static content is cached closet to them so that it loads faster, puts less strain on your origin servers.
To verify this, you can take measurements of site load times across the globe and see if all the users get similar loads times. No you dont have to get a machine everywhere in the world to do that ! Someone has already done that for you
Head to https://prober.tech/ and the URL you wish to test for load times.
Because this site itself is in Cloudflare's CDN, you can put that link itself in the test box and use it as baseline !
More information on using the tool can be found here

HTTP Content-type header for cached files

Using Apache with mod_rewrite, when I load a .css or .js file and view the HTTP headers, the Content-type is only set correctly the first time I load it - subsequent refreshes are missing Content-type altogether and it's creating some problems for me.
I can get around this by appending a random query string value to the end of each filename, eg. http://www.site.com/script.js?12345
However, I don't want to have to do that, since caching is good and all I want is for the Content-type to be present. I've tried using a RewriteRule to force the type but still didn't solve the problem. Any ideas?
Thanks, Brian
The answer depends on information you've not provided here, specifically where are you seeing these headers?
Unless it's from sniffing the network traffic between the browser and client, then you can't be sure if you are looking at a real request to the server or a request which has been satisfied from the cache. Indeed changing the URL as you describe is a very simple way to force a reload from the server rather than a load from the cache.
I don't think its as broken as you seem to. Fire up Wireshark and see for yourself - or just disable caching for these content types.
C.

configuring e-tags

I am using Yslow as a simple speed benchmarking tool and I came across a really confusing concept. The E-tag
So the main problem is : How do I configure E-tags? my grade in yslow says:
There are 19 components with misconfigured ETags
* http://thehotelinventory.com/media/js/jquery.min.js
* http://thehotelinventory.com/media/js/jquery.colorbox.min.js
* http://thehotelinventory.com/media/js/easyslider.min.js
* http://thehotelinventory.com/media/js/jquery.tools.min.js
* http://thehotelinventory.com/media/js/custom.min.js
* http://thehotelinventory.com/media/js/jquery.validate.min.js
* http://thehotelinventory.com/media/images/colorbox/loading_background.png
* http://thehotelinventory.com/media/images/productheaderbg.jpg
* http://thehotelinventory.com/media/images/buttons/field-bg. //etc
I browsed through the developer.yahoo.com guidelines on website optimization yet I can't really understand the thing with e-tags
This page shows how to disable ETags for IIS and this page shows how to do it for Apache.
Assuming you are running Apache...
You can set up a simple ETag like this:
FileETag MTime Size
If you have multiple servers, you want to disable ETags.
FileETag None
Put the above code in your httpd.conf (if you have access), otherwise you can put it in .htaccess.
Think of E-Tags as a sort of hash. When a browser makes a request for a resource, it sends along the E-tag of the file version it has cached. If the server decides that the files are similar enough (there are "strong" and "weak" versions of E-Tags so it's not always a simple comparison check) it will send a "304 Not Modified" response to the client, rather than the resource itself. This translates into a speed boost, since it prevents bandwidth from being wasted on unchanged files.
E-Tags are sent via HTTP headers.
There's a good example of E-Tags at work (and also how to disable them for Apache) here:
http://www.askapache.com/htaccess/apache-speed-etags.html
By removing the ETag header, you disable caches and browsers from being able to validate files, so they are forced to rely on your Cache-Control and Expires header.
Add these lines to .htaccess:
<ifModule mod_headers.c>
Header unset ETag
</ifModule>
FileETag None
Go straight to the source, YSlow provides guidance on all of it's advice, including how to configure ETags.
The best way to configure your ETags is to remove them. For static files, far-future expiration dates are a much better approach.
The way to remove them depends on the web server you're using. For IIS 7, it can be done with a simple HttpModule.
Entity tags are a feature of the HTTP protocol, see http://www.ietf.org/rfc/rfc2616.txt
Entity tags are used for comparing two or more entities from the same
requested resource. HTTP/1.1 uses entity tags in the ETag (section
14.19), If-Match (section 14.24), If-None-Match (section 14.26), and
If-Range (section 14.27) header fields. The definition of how they
are used and compared as cache validators is in section 13.3.3. An
entity tag consists of an opaque quoted string, possibly prefixed by
a weakness indicator.
wikipedia is the man's best friend:)
http://en.wikipedia.org/wiki/HTTP_ETag
Basically a hash as ShZ said, that should be unique or almost for a file.