CloudFront and mod_pagespeed - wrong content received - apache

I am using CloudFront with mod_pagespeed running on the server.
When updating a CSS or flushing the cache I see problematic behavior, first refresh on the browser returns the original css (this is fine). When I refresh a second time I get the correct manipulated CSS file name but the content of the file from CloudFront is still the original and not the correct manipulated content.
Why would this happen?
Any idea how to fix this?
Update:
For some reason it just stopped happening... I don't know why.

SimonW, since your original post there has been a feature added to pagespeed (in March 2013 in version 1.2.24.1) to deal with this issue directly. The directive is enabled via the following:
Apache:
ModPagespeedRewriteDeadlinePerFlushMs deadline_value_in_milliseconds
Nginx:
pagespeed RewriteDeadlinePerFlushMs deadline_value_in_milliseconds;
The docs describe the directive as follows (emphasis mine):
When PageSpeed attempts to rewrite an uncached (or expired) resource
it will wait for up to 10ms per flush window (by default) for it to
finish and return the optimized resource if it's available. If it has
not completed within that time the original (unoptimized) resource is
returned and the optimizer is moved to the background for future
requests. The following directive can be applied to change the
deadline. Increasing this value will increase page latency, but might
reduce load time (for instance on a bandwidth-constrained link where
it's worth waiting for image compression to complete). Note that a
value less than or equal to zero will cause PageSpeed to wait
indefinitely.
So, if you specify a value of 0 for deadline_value_in_milliseconds you should always get the fully optimized page. I would caution that the latency can be high on this in some cases. I my case, I really wanted this behavior, even with the latency concern, because the content was to be cached on my CDN's edge servers and, thus, I wanted the most optimized version possible to be served to the CDN for caching.

This could happen if you have multiple backend servers and CloudFront is hitting a different one than the HTML request went through. In that case the resource was rewritten on the HTML server, but not on the other server. There is a short timeout and if the other server doesn't finish the rewrite in that time, it will just serve the original content with Cache-Control: private,max-age=300. It's possible CloudFront caches that for a little while (even though obviously it shouldn't), but then eventually re-requests the resource from your backend and gets the correctly rewritten version this time.

Related

Fail2Ban ignore 404 of local redirect

Assume a bad actor scripts access to an Apache server to probe for vulnerabilities. With Fail2Ban we can catch some number of 404's and ban the IP. Now assume a single web page has a bad local reference to a CSS, JS, or image file. Repeated hits by the same legitimate site visitor will result in some number of 404s, and possibly an IP ban.
Is there a good way to separate these local requests from remote so that we don't ban the valued visitor?
I know all requests are remote, in that a page gets returned to a browser and the content of the page triggers more requests for assets. The thing is, how do we know the difference between that kind of page load pattern, and a script query for the same resource?
If we do know that a request is coming in based on a link that we just generated, we could do a 302 redirect rather than returning a 404, thus avoiding the banning process.
The HTTP Referer header can be used. If the Refer is the same origin as the requested page, or the same as the local site FQDN then we should not ban. But that header can be spoofed. So is this a good tool to use?
I'm thinking cookies can be used, or a session nonce, where a request might come in for assets from a page without a current session cookie. But I don't know if something like that is a built-in feature.
The best solution is obviously to make sure that all pages generated on a site include a valid reference back to the site, but we all know that's not possible. Some CMS add version info to files, or they adjust image paths to include an image size based on the client device/size. Any of these generated headers might simply be wrong until we can find and fix the code that creates them. Between the time we deploy something faulty and the time we fix it, I'm concerned about accidentally banning legitimate visitors with Fail2Ban (and other tools) that do not factor in where the request originates.
Is there another solution to this challenge? Thanks!
how do we know the difference between that kind of page load pattern
You don't in normal case (at least without some white- or black-list).
But you know URI- or paths segments, file extensions etc which would be rather never a target of such attack vectors, which you can ignore.
Some CMS add version info to files, or they adjust image paths to include an image size based on the client device/size.
But you surely knows the prefixes that where correct, so an RE allowing some paths segments would be possible. For instance this one:
# regex ignoring site and cms paths:
^<HOST> -[^"]*\"[A-Z]{3,}\s+/(?!site/|cms/)\S+ HTTP/[^"]+" 40\d\s\d+
will ignore this one:
192.0.2.1 - - [02/Mar/2021:18:01:06] "GET /site/style.css?ver=1.0 HTTP/1.1" 404 469
and match this one:
192.0.2.1 - - [02/Mar/2021:18:01:06] "GET /xampp/phpmyadmin/scripts/setup.php HTTP/1.1" 404 469
Similar you can write an regex with negative lookahead to ignore certain extensions like .css or .js or arguments like ?ver=1.0.
Another possibility would be to make a special fallback location logging completely worse requests in special log-file (not into access or error logs), like described in wiki :: Best practice so this way it would be possible to consider evildoers with definitely wrong URIs did not matching any proper location which can be handled by web server.
Or simply disable logging of 404 in known as valid locations (paths, prefixes, extensions whatever).
To ensure or completely avoid false positives you can firstly increase maxretry or reduce findtime and observe it a bit (so evildoers with too many attempts going banned and legitimate users with "broken" requests causing 404 but with not so large count of them will be still ignored). So you can cumulate whole list of "valid" 404 request of your application (in order to write more precise regex or filter it in some locations).

DNS resolves correctly using command line tools but fails on browser

Dig, wget, nslookup and curl commands work perfectly for a specific URL I have pointed to another server less than 24 hours ago.
Problem is, it just refuses to be resolved by the browser (Chrome, Safari and Firefox). The strangest part is that it is being successfully resolved by Postman (by testing the OPTIONS and the GET methods separately), but still doesn't return a proper response on the browser side of things.
DNS checks are returning positive, so this is when I started suspecting that the problem is actually within the headers of the HTTP protocol's requests which are sent - alongside the fact that different responses are being returned for the requests that don't include the default browser headers (being issued through the different command-line tools & Postman) and the ones who do (being issued by the browsers automatically or manually using the dev tools).
After fully flushing the current local system's DNS cache, including the browsers's and even trying another device on another network - I still get still no response on the browser.
Kept going, and attempted to verify that with a VPN (locally - which didn't work), and an online web proxy tool (which did work).
Finally, I extracted the router's default DNS server address, used nslookup to look up the URL again, this time specifically mentioning the desired DNS server (the one stated above), and after getting a successful response with the correct values, I am now pretty much sure the HTTP request is causing the problem.
The URL is hosted on Amazon S3 Static Hosting option, which I used many times before, and didn't have a problem with, with that exact same configuration. Looking up the recent changes/features that were possibly added, pointed out that I may need to explicitly set a CORS policy for the newly created bucket, on top of the usual public access policy that is needed.
After applying that as-well - it still doesn't seem to work.
As a quick change in direction that may possibly make some parts clearer about what's going on (and as I started to think that the browser might not be getting the correct Content-Type header in the response, which should be text/html header as its response, and therefore, possibly doesn't resolve the URL with the expected behavior), I went ahead and applied a 301 redirection on the S3 bucket, instead of the static files hosting, and again, it all works perfectly through the command line tools, but not through the browsers.
Anyway, the browser just doesn't seem to complete any of the requests being sent to the URL.
That might be the OPTIONS pre-flight request failing to respond correctly, and the browser just doesn't continue to issuing the GET request, or the URL is not being found by the DNS route the browser is taking, which is unclear to me currently if that is the option.
Any ideas? (besides the fact that sometimes it just takes longer time for some DNS servers that happen to be on the chosen route to update/refresh their cache, which doesn't appear to be affecting my local machine's DNS route specifically for this case. That, being said with caution, was verified by validating the different parts of DNS configuration and prioritization throughout the different possible parts on my system (Mac OS X), including the fact that the response gets back with the correct address successfully).
Found my answer here:
https://serverfault.com/questions/942030/aws-s3-static-hosting-how-to-debug-connection-timeout
As linked there, more details can be found here:
Non-Authoritative-Reason header field [HTTP]
Solution & Explanation: Because of the nature of the domain extension I have purchased (.dev extension) Chrome was silently using HTTPS because of the URL being part of Chrome's HTTP Strict Transport Security (HSTS), because all .dev domains should be using HTTPS only. Therefore, the issue was still showing up, even when explicitly typing http:// into the URL address bar.
This can be overridden by applying a CloudFront distribution with HTTPS support on top of the S3 Static Hosting, as usual (but still, it should be noted as HSTS listings can cause that for different cases, including this one as part of them, because of the .dev domain extension).
Useful Resources (for debugging purposes)
In addition to what is stated here:
https://gist.github.com/stollcri/7c09bafc97223481920e
You can issue a lookup query (and also add or delete your local set of HSTS listings) through the following Chrome's settings URL:
You can also check the current listings here: https://hstspreload.org/

Azure CDN caching with query string

i am curious about an issue that i am facing at the moment with Azure CDN and i don't have an answer for it. So, i have a CDN profile and endpoint configured to cache some content stored in a storage container. In the cache behavior, i am using default (ignore query strings). So i modified one file in the container, and i was able to retrieve the modified file from the container, but not from the CDN edge since the edge was returning the previous cached version of the file. So i proceed with the purge of the file in the CDN, and after the purge, i was able to get the modified version of the file. But, if i request the file to the cdn edge with any querystring parameter, i get the original version of the file, instead of the modified version of the file.
Example requesting the file via edge:
w/o qs: https://#storage_account#/#file_path#/hh.min.css -> It gives me the modified version
w qs: https://#storage_account#/#file_path#/hh.min.css?v=0.5 -> It gives me the original version
w qs (2): https://#storage_account#/#file_path#/hh.min.css?a=b -> It gives me the original version
Any idea why this is happening?
Thanks.
Most likely what's happening is the use of the query uses a cached asset, as mentioned per the documentation
Ignore query strings: Default mode. In this mode, the CDN point-of-presence (POP) node passes the query strings from the requestor to the origin server on the first request and caches the asset. All subsequent requests for the asset that are served from the POP ignore the query strings until the cached asset expires.
So my guess is the cached asset did not expire yet. To avoid this issue, you should consider bypassing the caching for query strings:
Bypass caching for query strings: In this mode, requests with query strings are not cached at the CDN POP node. The POP node retrieves the asset directly from the origin server and passes it to the requestor with each request.
If the above option results in latency, I'd recommend adjusting the caching rules.

Apache statically and dynamically compressed content

I have a website which has both dynamically generated (PHP) and static content. Setting up Apache to transparently compress everything in accordance with content negotiation is a trifle.
However, I am interested in not compressing static content that rarely, if ever, changes, but instead serving precompressed data in an "asis" manner.
The idea behind this is to reduce latency and save CPU power, and at the same time compress better. Basically, instead of compressing the same data over and over again, I would like the server to sendfile the contents without touching it, but with proper headers. And, ideally, it would work seamlessly with .html and .html.gz files, using transparent compression in one case and none in the other.
There is mod_asis, but this will not provide proper headers (most importantly the ones affecting cache and proxy operation), and it is agnostic of content negotiation. Adding a content-encoding for .gz seems to be the right thing, but does nothing, the ยด.html.gz` web pages appear as downloads (maybe this interferes with some default typemap?).
It seems that the gatling webserver does just what I want in this respect, but I'd really prefer staying with Apache, because despite anything one could blame Apache for, it's the one mainstream server that has been working kind of OK for many years.
Another workaround would be to serve the static content with another server on a different port or subdomain, but I'd prefer if it just worked "invisibly", and if the system was not made more complex than necessary.
Is there a well-known configuration idiom that makes Apache behave in the way indicated?

HTTP Content-type header for cached files

Using Apache with mod_rewrite, when I load a .css or .js file and view the HTTP headers, the Content-type is only set correctly the first time I load it - subsequent refreshes are missing Content-type altogether and it's creating some problems for me.
I can get around this by appending a random query string value to the end of each filename, eg. http://www.site.com/script.js?12345
However, I don't want to have to do that, since caching is good and all I want is for the Content-type to be present. I've tried using a RewriteRule to force the type but still didn't solve the problem. Any ideas?
Thanks, Brian
The answer depends on information you've not provided here, specifically where are you seeing these headers?
Unless it's from sniffing the network traffic between the browser and client, then you can't be sure if you are looking at a real request to the server or a request which has been satisfied from the cache. Indeed changing the URL as you describe is a very simple way to force a reload from the server rather than a load from the cache.
I don't think its as broken as you seem to. Fire up Wireshark and see for yourself - or just disable caching for these content types.
C.