When I update a file on S3 and I have CloudFront enabled, does S3 send an invalidation signal to CloudFront? Or do I need to send it myself after updating the file?
I can't seem to see an obvious answer in the documentation.
S3 doesn't send any invalidation information to CloudFront. By default CloudFront will hold information up to the maximum time specified by the Cache Control headers that were set when it retrieved the data from the origin (it may remove items from its cache earlier if it feels like it).
You can invalidate cache entries by creating an invalidation batch. This will cost you money: the 1st 1000 requests a month are free but beyond that it costs $0.005 per request - if you were invalidating 1000 files a day it would cost you $150 a month (unless you can make use of the wildcard feature). You can of course trigger this in response to an s3 event using an Amazon Lambda function.
Another approach is to use a different path when the object changes (in effect a generational cache key). Similarly you could append a query parameter to the url and change that query parameter when you want cloudfront to fetch a fresh copy (to do this you'll need to tell CloudFront to use query string parameters - by default it ignores them).
Another way if you only do infrequent (but large) changes is to simply create a new cloudfront distribution.
As far as I know, all CDNs work like this.
It's why you generally use something like foo-x.y.z.ext to version assets on a CDN. I wouldn't use foo.ext?x.y.z because there was something about certain browsers and proxies never caching assets with a ?QUERY_STRING.
In general you may want to check this out:
https://developers.google.com/speed/docs/best-practices/caching
It contains lots of best practices and goes into details what to do and how it works.
In regard to S3 and Cloudfront, I'm not super familiar with the cache invalidation, but what Frederick Cheung mentioned is all correct.
Some providers also allow you to clear the cache directly but because of the nature of a CDN these changes are almost never instant. Another method is to set a smaller TTL (expiration headers) so assets will be refreshed more often. But I think that defeats the purpose of a CDN as well.
In our case (Edgecast), cache invalidation is possible (a manual process) and free of charge, but we rarely do this because we version our assets accordingly.
Related
What would be best practice to refresh content that is already cached by CF?
We have few API that generate JSON and we cache them.
Once a while JSON should be updated and what we do right now is - purge them via API.
https://api.cloudflare.com/client/v4/zones/dcbcd3e49376566e2a194827c689802d/purge_cache
later on, when user hits the page with required JSON it will be cached.
But in our case we have 100+ JSON files that we purge at once and we want to send new cache to CF instead of waiting for users (to avoid bad experience for them).
Right now I consider to PING (via HTTP request) needed JSON endpoints just after we have purged cache.
My question if that is the right way and if CF already has some API to do what we need.
Thanks.
Currently, the purge API is the recommended way to invalidate cached content on-demand.
Another approach for your scenario could be to look at Workers and Workers KV, and combine it with the Cloudflare API. You could have:
A Worker reading the JSON from the KV and returning it to the user.
When you have a new version of the JSON, you could use the API to create/update the JSON stored in the KV.
This setup could be significantly performant, since the Worker code in (1) runs on each Cloudflare datacenter and returns quickly to the users. It is also important to note that KV is "eventually consistent" storage, so feasibility depends on your specific application.
The first test is always slow. The second test shows the speed benefits of Cloudflare. Why is that and does this mean users will have to load the website twice?
"speed benefits of Cloudflare" could be referring to a variety of unique features that Cloudflare offers (such as image compression, lazy loading javascript, etc.). For this answer, I am assuming that you are referring to its CDN/caching capabilities.
Essentially, being a CDN means that a client needs to request one of your site's resources from each of the CDN edge nodes to prime the cache at that node from the origin server.
GTmetrix is similar to a human website visitor in the sense that if it is the first to request a resource within its cache timeout from a CDN edge node, the request will have to go all the way back to the origin server rather than responding from the closer edge node. The second time that resource is requested from the edge node, however, the resource will be cached and will be served much quicker due to the reduced network latency.
I'd recommend reading up a bit more on how CDNs work if you are not already familiar with that. You will probably want to tweak your caching headers so that resources that are relatively static are rarely purged from the edge nodes which will reduce requests with this "first-timer penalty".
My project uses the Presets plugin with the flag onlyAllowPresets=true.
The reason for this is to close a potential vulnerability where a script might request an image thousands of times, resizing with 1px increment or something like that.
My question is: Is this a real vulnerability? Or does ImageResizer have some kind of protection built-in?
I kind of want to set the onlyAllowPresets to false, because it's a pain in the butt to deal with all the presets in such a large project.
I only know of one instance where this kind of attack was performed. If you're that valuable of a target, I'd suggest using a firewall (or CloudFlare) that offers DDOS protection.
An attack that targets cache-misses can certainly eat a lot of CPU, but it doesn't cause paging and destroy your disk queue length (bitmaps are locked to physical ram in the default pipeline). Cached images are still typically served with a reasonable response time, so impact is usually limited.
That said, run a test, fake an attack, and see what happens under your network/storage/cpu conditions. We're always looking to improve attack handling, so feedback from more environments is great.
Most applications or CMSes will have multiple endpoints that are storage or CPU-intensive (often a wildcard search). Not to say that this is good - it's not - but the most cost-effective layer to handle this often at the firewall or CDN. And today, most CMSes include some (often poor) form of dynamic image processing, so remember to test or disable that as well.
Request signing
If your image URLs are originating from server-side code, then there's a clean solution: sign the urls before spitting them out, and validate during the Config.Current.Pipeline.Rewrite event. We'd planned to have a plugin for this shipping in v4, but it was delayed - and we've only had ~3 requests for the functionality in the last 5 years.
The sketch for signing would be:
Sort querystring by key
Concatenate path and pairs
HMACSHA256 the result with a secret key
Append to end of querystring.
For verification:
Parse the query,
Remove the hmac
Sort query and concatenate path as before
HMACSHA256 the result and compare to the value we removed.
Raise an exception if it's wrong.
Our planned implementation would permit for 'whitelisted' variations - certain values that a signature would permit to be modified by the client - say for breakpoint-based width values. This would be done by replacing targeted key/value pairs with a serialized whitelist policy prior to signature. For validation, pairs targeted by a policy would be removed prior to signature verification, and policy enforcement would happen if the signature was otherwise a match.
Perhaps you could add more detail about your workflow and what is possible?
I am using mod security to look for specific values in post parameters and blocking the request if duplicate comes in. I am using mod security user collection to do just that. The problem is that my requests are long running so a single request can take in more than 5 minutes. The user collection i assume does not get written to disk until the first request gets processed. If during the execution of the first request another request comes in using the duplicate value for post parameter the second request does not gets blocked since the collection is not available yet. I need to avoid this situation. Can I use memory based shared collections across requests in mod security? Any other way? Snippet below:
SecRule ARGS_NAMES "uploadfilename" "id:400000,phase:2,nolog,setuid:%{ARGS.uploadfilename},initcol:USER=%{ARGS.uploadfilename},setvar:USER.duplicaterequests=+1,expirevar:USER.duplicaterequests=3600"
SecRule USER:duplicaterequests "#gt 1" "id:400001,phase:2,deny,status:409,msg:'Duplicate Request!'"
ErrorDocument 409 "<h1>Duplicate request!</h1><p>Looks like this is a duplicate request, if this is not on purpose, your original request is most likely still being processed. If this is on purpose, you'll need to go back, refresh the page, and re-submit the data."
ModSecurity is really not a good place to put this logic.
As you rightly state there is no guarantee when a collection is written, so even if collections were otherwise reliable (which they are not - see below), you shouldn't use them for absolutes like duplicate checks. They are OK for things like brute force or DoS checks where, for example, stopping after 11 or 12 checks rather than 10 checks isn't that big a deal. However for absolute checks, like stopping duplicates, the lack of certainty here means this is a bad place to do this check. A WAF to me should be an extra layer of defence, rather than be something you depend on to make your application work (or at least stop breaking). To me, if a duplicate request causes a real problem to the transactional integrity of the application, then those checks belong in the application rather than in the WAF.
In addition to this, the disk based way that collections work in ModSecurity, causes lots of problems - especially when multiple processes/threads try to access them at once - which make them unreliable both for persisting data, and for removing persisted data. Many folks on the ModSecurity and OWASP ModSecurity CRS mailing lists have seen errors in the log file when ModSecurity tried to automatically clean up collections, and so have seen collections files grow and grow until it starts to have detrimental effects on Apache. In general I don't recommend user collections for production usage - especially for web servers with any volume.
There was a memcache version of ModSecurity created that was created which stopped using the dusk based SDBM format which may have addressed a lot of the above issues however it was not completed, though it may be part of ModSecurity v3. I still disagree however that a WAF is the place to check this.
The web service that I want to run on AWS has to store and retrieve user data, present it to the user via a website, and needs to be able to parse the sitemaps of a few thousand sites every 10 min or so. Which components of AWS, such as S3, EC2, and CloudFront do I need to use. A short synopsis about the purpose of each component would be nice. :)
I particularly do not understand the purpose of the Simple Queue Service.
You might, for example, use EC2 (on-demand, scalable, VPS) to host the actual application and S3 (networked storage) to store the data. You would probably not need Cloudfront (geographically optimized content mirroring).
We use SQS (Simple Queue Service) to queue tasks we want performed asynchronously, i.e. without making the user wait for them to complete. As it turns out SQS becomes incredibly expensive if your site has even modest traffic, so we'll be handling queueing on one of our own boxes soon.
Another service you might want to look at is the Elastic Block Store (EBS), which provides persistent storage for an EC2 instance. The default storage that you get with an instance is not persisted if you shut down the instance, so I'd recommend storing all your critical data on EBS so that you can recover quickly if an instance goes down.
SimpleDB might also be useful for your service.
Have a look at the Wikipedia entry for AWS to learn more about each service.