I am using an S3 bucket to store a bunch of product images for a large web site. These images are being served through Cloudfront with the S3 bucket as the origin. I have noticed that Cloudfront does not put an expiration header on the image even though I have set the distribution behavior to customize the cache headers and set a long min, max, and default TTL in Cloudfront.
I understand that I can put an expiration on the S3 object, however this is going to be quite impractical as I have millions of images. I was hoping that cloudfront would do me the honors of adding this header for me, but it does not.
So my question is the only way to get this expiration header to apply it every S3 object, or perhaps I am missing something in Cloudfront that will do it for me?
CloudFront's TTL configuration only controls the amount of time CloudFront keeps the object in the cache.
It doesn't add any headers.
So, yes, you'll need to set these on the objects in S3.
Note that Cache-Control: is usually considered a better choice than Expires:.
A alternative to avoid updating the onjects is to configure a proxy server in EC2 in the same region as the bucket, and let the server add the headers as the responses pass through it.
Request: CloudFront >> Proxy >> S3
Response: S3 >> Proxy >> CloudFront
...for what it's worth.
Related
I have built a system where I have product templates. A brand will overwrite the template to create a product. Images can be uploaded to the template and be overwritten on the product. The product images are uploaded to the corresponding brand's S3 bucket. But on the product template images are uploaded to a generic S3 bucket.
Is there a way to make the brand's bucket fallback to the generic bucket if it receives a 404 or 403 with a file url. Similar to the hosted website redirect rules? These are just buckets with images so it wouldn't be a hosted website and I was hoping to avoid turning that on.
There is not a way to do this with S3 alone, but it can be done with CloudFront, in conjunction with two S3 buckets, configured in an origin group with appropriate origin failover settings, so that 403/404 errors from the first bucket cause CloudFront to make a follow-up request from the second bucket.
After you configure origin failover for a cache behavior, CloudFront does the following for viewer requests:
When there’s a cache hit, CloudFront returns the requested file.
When there’s a cache miss, CloudFront routes the request to the primary origin that you identified in the origin group.
When a status code that has not been configured for failover is returned, such as an HTTP 2xx or HTTP 3xx status code, CloudFront serves the requested content.
When the primary origin returns an HTTP status code that you’ve configured for failover, or after a timeout, CloudFront routes the request to the backup origin in the origin group.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html
This seems to be the desired behavior you're describing. It means, of course, that cache misses that need to fall back to the second bucket will require additional time to be served, but for cache hits, there won't be any delay since CloudFront only goes through the "try one, then try the other" on cache misses. It also means that you'll be paying for some traffic on the primary bucket for objects that aren't present, so it makes the most sense sense if the primary bucket will have the object more often than not.
This solution does not redirect the browser -- CloudFront follows the second path before returning a response -- so you'll want to be mindful of the Cache-Control settings you attach to the fallback objects when you upload them, since adding a (previously-absent) primary object after a fallback object is already fetched and cached (by either CloudFront or the browser) will not be visible until any cached objects expire.
I'm getting cloudfront endpoint redirecting to S3 with 307 Temporary Redirect. Is there a reason why this is happening?
I've tried creating website endpoint and changed the origin but no luck with same result.
The Temporary Request Redirection It's actually caused by the way S3 buckets behave when they are newly-created (thanks to #Michael-sqlbot) for clarifying this.
From the docs (Temporary Request Redirection)
Due to the distributed nature of Amazon S3, requests can be temporarily routed to the wrong facility. This is most likely to occur immediately after buckets are created or deleted. For example, if you create a new bucket and immediately make a request to the bucket, you might receive a temporary redirect, depending on the location constraint of the bucket.
Change your Origin Domain Name to bucketname.s3-region.amazonaws.com per the docs:
If you're using an Amazon CloudFront distribution with an Amazon S3 origin, CloudFront forwards requests to the default S3 endpoint (s3.amazonaws.com), which is in the us-east-1 Region. If you must access Amazon S3 within the first 24 hours of creating the bucket, you can change the Origin Domain Name of the distribution to include the regional endpoint of the bucket. For example, if the bucket is in us-west-2, you can change the Origin Domain Name from bucketname.s3.amazonaws.com to bucketname.s3-us-west-2.amazonaws.com.
I am trying to implement a solution on AWS which is as follows:
I have a crawler that will run once a day to index certain sites. I want to cache this data and expose it the the form of an API since after crawling, this data will not change for an entire day. After the crawler refetches, I want to invalidate and rebuild this cache to serve the updated data. I'm trying to use serverless architecture to build this.
Possible Solutions
It is clear that the crawler will run on AWS Lambda. What is unclear to me is how to manage the cache that will serve the data. Here are some solutions I thought of
S3 and Cloudfront for caching: After crawling, store the data in the form of .json files in S3 that will be cached using AWS Cloudfront. When the crawler refetches new data, it will rebuild these files and ask Cloudfront to invalidate the cache.
API Gateway DynamoDB: After Crawling store the data in DynamoDB which will be then served by API Gateway which is cached. The only problem here is how can I ask for this cache to be invalidated at the end of the day when the crawler re-crawls? Since the data will be static for a day, how can I not pay for the extra time that DynamoDB will be running (because if I implement caching on API Gateway, there will only one call to DynamoDB for caching after that it will be sitting idle for a day)
Is there any other way that I am missing?
Thanks!
You can store new data in different path in S3 that would include the date of creation. Maybe something like:
index_2017_08_11.json
Then there is no need to invalidate caches on the CloudFront side. Since to access these new objects you need to provide new URLs, old CloudFront cache won't be an issue. You can remove S3 files for a previous day using S3 TTL feature.
Another option is to set the Expires caching HTTP header to set when the data in cache should be invalidated:
The Expires header field lets you specify an expiration date and time
using the format specified in RFC 2616, Hypertext Transfer Protocol --
HTTP/1.1 Section 3.3.1, Full Date, for example: Sat, 27 Jun 2015
23:59:59 GMT
You can set this header in API Gateway to specify when an object should be invalidated.
Since the data will be static for a day, how can I not pay for the
extra time that DynamoDB will be running
If data is static, can you store it in S3 and use API Gateway to serve data from S3 instead of DynamoDB?
How to enable Keep Alive connection in AWS S3 or CloudFront? I uploaded images to S3 and found that the urls don't have keep alive connection. They cannot be cached by client application even I added cache-control headers to each image file.
From the tag wiki for Keep-Alive:
A feature of HTTP where the same connection is used for multiple
requests, speeding up downloading of web pages with multiple
resources.
I'm not aware of any relation that this has to cache behavior. I usually see mentions of Keep-Alive headers in relation to long-polling, which wouldn't make any sense to enable on S3.
I think you are incorrectly linking keep-alive headers with your browser's ability to cache static content. The cache-control headers should be all that is needed for caching of static content in the browser.
Are you verifying that the response from CloudFront includes the cache-control headers you have set on the S3 objects? Perhaps you need to invalidate the CloudFront cache after you updated the headers.
Related to your question I think the problem is in setting correct TTL(>0) to your origin/behaviours in Cloudfront.
Also AWS Cloudfront (from 30 March 2017) enables you to set up custom read and keep-alive timeouts for custom origins.
I'm using CloudFront CDN to simply cache my static contents in "Origin Pull" mode. The CloudFront origin is my website.
However I've encountered a CORS problem. My browser doesn't let my web pages load my fonts files from CloudFront ... The ironic thing about it is that those fonts were fetched and cached from my website in the first place :(
After googling this matter a bit, I noticed that all blogs/tutorials explain how to enable CORS on an S3 bucket used as the origin for CloudFront, and letting CloudFront forward the Access-Control-Allow-XXX headers from S3 to the client.
I don't need an S3 bucket and would like to keep it that way for the sake of simplicity, if possible.
Is it possible to enable CORS on CloudFront ? Even a quick and dirty solution, such as setting the access control header on all responses would be good enough.
Or what other alternatives do I have on CloudFront ? If the easiest other alternative is indeed to use an S3 bucket, what are the drawbacks (modifications to do on my website, service performance, and cost) ?