Caching and invalidating AWS Lambda response - amazon-s3

I am trying to implement a solution on AWS which is as follows:
I have a crawler that will run once a day to index certain sites. I want to cache this data and expose it the the form of an API since after crawling, this data will not change for an entire day. After the crawler refetches, I want to invalidate and rebuild this cache to serve the updated data. I'm trying to use serverless architecture to build this.
Possible Solutions
It is clear that the crawler will run on AWS Lambda. What is unclear to me is how to manage the cache that will serve the data. Here are some solutions I thought of
S3 and Cloudfront for caching: After crawling, store the data in the form of .json files in S3 that will be cached using AWS Cloudfront. When the crawler refetches new data, it will rebuild these files and ask Cloudfront to invalidate the cache.
API Gateway DynamoDB: After Crawling store the data in DynamoDB which will be then served by API Gateway which is cached. The only problem here is how can I ask for this cache to be invalidated at the end of the day when the crawler re-crawls? Since the data will be static for a day, how can I not pay for the extra time that DynamoDB will be running (because if I implement caching on API Gateway, there will only one call to DynamoDB for caching after that it will be sitting idle for a day)
Is there any other way that I am missing?
Thanks!

You can store new data in different path in S3 that would include the date of creation. Maybe something like:
index_2017_08_11.json
Then there is no need to invalidate caches on the CloudFront side. Since to access these new objects you need to provide new URLs, old CloudFront cache won't be an issue. You can remove S3 files for a previous day using S3 TTL feature.
Another option is to set the Expires caching HTTP header to set when the data in cache should be invalidated:
The Expires header field lets you specify an expiration date and time
using the format specified in RFC 2616, Hypertext Transfer Protocol --
HTTP/1.1 Section 3.3.1, Full Date, for example: Sat, 27 Jun 2015
23:59:59 GMT
You can set this header in API Gateway to specify when an object should be invalidated.
Since the data will be static for a day, how can I not pay for the
extra time that DynamoDB will be running
If data is static, can you store it in S3 and use API Gateway to serve data from S3 instead of DynamoDB?

Related

S3 Fallback Bucket

I have built a system where I have product templates. A brand will overwrite the template to create a product. Images can be uploaded to the template and be overwritten on the product. The product images are uploaded to the corresponding brand's S3 bucket. But on the product template images are uploaded to a generic S3 bucket.
Is there a way to make the brand's bucket fallback to the generic bucket if it receives a 404 or 403 with a file url. Similar to the hosted website redirect rules? These are just buckets with images so it wouldn't be a hosted website and I was hoping to avoid turning that on.
There is not a way to do this with S3 alone, but it can be done with CloudFront, in conjunction with two S3 buckets, configured in an origin group with appropriate origin failover settings, so that 403/404 errors from the first bucket cause CloudFront to make a follow-up request from the second bucket.
After you configure origin failover for a cache behavior, CloudFront does the following for viewer requests:
When there’s a cache hit, CloudFront returns the requested file.
When there’s a cache miss, CloudFront routes the request to the primary origin that you identified in the origin group.
When a status code that has not been configured for failover is returned, such as an HTTP 2xx or HTTP 3xx status code, CloudFront serves the requested content.
When the primary origin returns an HTTP status code that you’ve configured for failover, or after a timeout, CloudFront routes the request to the backup origin in the origin group.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html
This seems to be the desired behavior you're describing. It means, of course, that cache misses that need to fall back to the second bucket will require additional time to be served, but for cache hits, there won't be any delay since CloudFront only goes through the "try one, then try the other" on cache misses. It also means that you'll be paying for some traffic on the primary bucket for objects that aren't present, so it makes the most sense sense if the primary bucket will have the object more often than not.
This solution does not redirect the browser -- CloudFront follows the second path before returning a response -- so you'll want to be mindful of the Cache-Control settings you attach to the fallback objects when you upload them, since adding a (previously-absent) primary object after a fallback object is already fetched and cached (by either CloudFront or the browser) will not be visible until any cached objects expire.

Cloudfront Won't Set Expiration Header from S3 Origin

I am using an S3 bucket to store a bunch of product images for a large web site. These images are being served through Cloudfront with the S3 bucket as the origin. I have noticed that Cloudfront does not put an expiration header on the image even though I have set the distribution behavior to customize the cache headers and set a long min, max, and default TTL in Cloudfront.
I understand that I can put an expiration on the S3 object, however this is going to be quite impractical as I have millions of images. I was hoping that cloudfront would do me the honors of adding this header for me, but it does not.
So my question is the only way to get this expiration header to apply it every S3 object, or perhaps I am missing something in Cloudfront that will do it for me?
CloudFront's TTL configuration only controls the amount of time CloudFront keeps the object in the cache.
It doesn't add any headers.
So, yes, you'll need to set these on the objects in S3.
Note that Cache-Control: is usually considered a better choice than Expires:.
A alternative to avoid updating the onjects is to configure a proxy server in EC2 in the same region as the bucket, and let the server add the headers as the responses pass through it.
Request: CloudFront >> Proxy >> S3
Response: S3 >> Proxy >> CloudFront
...for what it's worth.

What is the difference between caching pages on S3 versus CloudFront

What is the difference between caching pages on S3 versus CloudFront ?
I'm currently using Cloudfront to cache pages previously generated by by (Tomcat server on EB) server and also images referenced in those pages, but for some reason CloudFront doesnt always seem to use the cache.
Page generation requires a number of webservice calls to another service and is computationally intensive but once a page is created it does not change for at a least a month. This why I want additional requests for the same page to use the CloudFront cache but failing that I thought that once the server creates a page it could store it on S3, and then if it received the same request again it could check S3 and if it existed serve it from S3. This would remove the redoing the webservice and computations.
The biggest difference is CloudFront is in more than 50 locations worldwide, so it will deliver content faster to viewers worldwide.
Actually, you don't have to choose - you can generate pages, put them to S3 for maximum cacheability, and then deliver through CloudFront for best viewer experience.

Does Amazon pass custom headers to origin?

I am using CloudFront to front requests to our service hosted outside of amazon. The service is protected and we expect an "Authorization" header to be passed by the applications invoking our service.
We have tried invoking our service from Cloud Front but looks like the header is getting dropped by cloud front. Hence the service rejects the request and client gets 401 forbidden response.
For some static requests, which do not need authorization, we are not getting any error and are getting proper response from CloudFront.
I have gone through CloudFront documentation and there is no specific information available on how headers are handled and hence was hoping that they will be passed as is, but looks like thats not the case. Any guidance from you folks?
The list of the headers CF drops or modifies can be found here
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#RequestCustomRemovedHeaders
CloudFront does drop Authorization headers by default and will not pass it to the origin.
If you would like certain headers to be sent to the origin, you can setup a whitelist of headers under CloudFront->Behavior Settings->Forward headers. Just select the headers that you would like to be forwarded and CloudFront will do the job for you. I have tested it this way for one of our location based services and it works like a charm.
One thing that I need to verify is if the Authorization header will be included in the cache key and if its safe to do that?? That is something you might want to watch out for as well.
It makes sense CF drops the Authorization header, just imagine 2 users asking for the same object, the first one will grant access, CF will cache the object, then the second user will get the object as it was previously cached by CloudFront.
Great news are using forward headers you can forward the Authorization header to the origin, that means the object will be cached more than once as the header value is part of the cache "key"
For exmple user A GETS private/index.html
Authorization: XXXXXXXXXXXXX
The object will be cached as private/index.html + XXXXXXXXXXXXX (this is the key to cahce the object in CF)
Now when the new request from a diferent user arrives to CloudFront
GET private/index.html
Authorization: YYYYYYYYYYYY
The object will be passed to the origin as the combinaiton of private/index.html + YYYYYYYYYYYY is not in CF cache.
Then Cf will be cached 2 diferent objects with the same name (but diferent hash combinaiton name).
In addition to specifying them under the Origin Behaviour section, you can also add custom headers to your origin configuration. In the AWS documentation for CloudFront custom headers:
If the header names and values that you specify are not already present in the viewer request, CloudFront adds them. If a header is present, CloudFront overwrites the header value before forwarding the request to the origin.
The benefit of this is that you can then use an All/wildcard setting for whitelisting your headers in the behaviour section.
It sounds like you are trying to serve up dynamic content from CloudFront (at least in the sense that the content is different for authenticated vs unauthenticated users) which is not really what it is designed to do.
CloudFront is a Content Distribution Network (CDN) for caching content at distributed edge servers so that the data is served near your clients rather than hitting your server each time.
You can configure CloudFront to cache pages for a short time if it changes regularly and there are some use cases where this is worthwhile (e.g. a high volume web site where you want to "micro cache" to reduce server load) but it doesn't sound like this is the way you are trying to use it.
In the case you describe:
The user will hit CloudFront for the page.
It won't be in the cache so CloudFront will try to pull a copy from the origin server.
The origin server will reply with a 401 so CloudFront will not cache it.
Even if this worked and headers were passed back and forth in some way, there is is simply no point in using CloudFront if every page is going to hit your server anyway; you would just make the page slower because of the extra round trip to your server.

Getting a pre-authenticated URL to an S3 bucket

I am attempting to use an S3 bucket as a deployment location for an internal, auto-updating application's files. It would be the location where the new version's files are dumped for the application to puck up on an update. Since this is an internal application, I was hoping to have the URL be private, but to be able to access it using only a URL. I was hoping to look into using third party auto updating software, which means I can't use the Amazon API to access it.
Does anyone know a way to get a URL to a private bucket on S3?
You probably want to use one of the available AWS Software Development Kits (SDKs), which all implement the respective methods to generate these URLs by means of the GetPreSignedURL() method (e.g. Java: generatePresignedUrl(), C#: GetPreSignedURL()):
The GetPreSignedURL operations creates a signed http request. Query
string authentication is useful for giving HTTP or browser access to
resources that would normally require authentication. When using query
string authentication, you create a query, specify an expiration time
for the query, sign it with your signature, place the data in an HTTP
request, and distribute the request to a user or embed the request in
a web page. A PreSigned URL can be generated for GET, PUT and HEAD
operations on your bucket, keys, and versions.
There are a couple of related questions already and e.g. Why is my S3 pre-signed request invalid when I set a response header override that contains a “+”? contains a working sample in C# (aside from the content type issue Ragesh is experiencing of course).
Good luck!