Cloudfront and Lambda#Edge - fetch from custom origin depending on user agent - amazon-s3

I'm serving my JavaScript app (SPA) by uploading it to S3 (default root object index.html with Cache-Control:max-age=0, no-cache, pointing to fingerprinted js/css assets) and configuring it as an origin to a CloudFront distribution. My domain name, let's say SomeMusicPlatform.com has a CNAME entry in Route53 containing the distribution URL. This is working great and all is well cached.
Now I want to serve a prerendered HTML version for purposes of bots and social network crawlers. I have set up a server that responds with a pre-rendered version of the JavaScript app (SPA) at the domain prerendered.SomeMusicPlatform.com.
What I'm trying to do in the lambda function is to detect the user agent, identify bots and serve them the prerendered version from my custom server (and not the JavaScript contents from S3 as I would normally serve to regular browsers).
I thought I could achieve this by using a Lambda#Edge: Using an Origin-Request Trigger to Change From an Amazon S3 Origin to a Custom Origin function that switches the origin to my custom prerender server in case it identifies a crawler bot in response headers (or, in the testing phase, with a prerendered=true query parameter).
The problem is that the Origin-Request trigger with the Lambda#Edge function is not triggering because CloudFront still has Default Root Object index.html cached and tends to return the content from the cached edge. I get X-Cache:RefreshHit from cloudfront by using both SomeMusicPlatform.com/?prerendered=true and SomeMusicPlatform.com, even though there is a Cache-Control:max-age=0, no-cache on the Default Root Object - index.html.
How can I keep the well-cached serving and low latency of my JavaScript SPA with CloudFront and add serving content from my custom prerender server just for crawler bots?

The problem with caching (getting the same hit when using either mywebsite.com/?prerendered=true or mywebsite.com) was solved by adding prerendered to the query whitelist in the cloudfront distribution. This means that CloudFront now correctly maintains both normal and prerendered version of the website content, depending on presence of the parameter (without the parameter cached content from S3 origin is served, and with the parameter cached content from my custom origin specified in the lambda function is served).
This was enough for the testing phase - to ensure the mechanism is working correctly. Then I followed Michael's advice and added another lambda function in the Viewer Request trigger which adds a custom header Is-Bot in case a bot is detected in User-Agent. Again, whitelisting was needed, this time for the custom header (to maintain caches for both origins depending on the custom header). The other lambda function later in the Origin Request trigger then decides which origin to use, depending on the Is-Bot header.

Related

S3 Fallback Bucket

I have built a system where I have product templates. A brand will overwrite the template to create a product. Images can be uploaded to the template and be overwritten on the product. The product images are uploaded to the corresponding brand's S3 bucket. But on the product template images are uploaded to a generic S3 bucket.
Is there a way to make the brand's bucket fallback to the generic bucket if it receives a 404 or 403 with a file url. Similar to the hosted website redirect rules? These are just buckets with images so it wouldn't be a hosted website and I was hoping to avoid turning that on.
There is not a way to do this with S3 alone, but it can be done with CloudFront, in conjunction with two S3 buckets, configured in an origin group with appropriate origin failover settings, so that 403/404 errors from the first bucket cause CloudFront to make a follow-up request from the second bucket.
After you configure origin failover for a cache behavior, CloudFront does the following for viewer requests:
When there’s a cache hit, CloudFront returns the requested file.
When there’s a cache miss, CloudFront routes the request to the primary origin that you identified in the origin group.
When a status code that has not been configured for failover is returned, such as an HTTP 2xx or HTTP 3xx status code, CloudFront serves the requested content.
When the primary origin returns an HTTP status code that you’ve configured for failover, or after a timeout, CloudFront routes the request to the backup origin in the origin group.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html
This seems to be the desired behavior you're describing. It means, of course, that cache misses that need to fall back to the second bucket will require additional time to be served, but for cache hits, there won't be any delay since CloudFront only goes through the "try one, then try the other" on cache misses. It also means that you'll be paying for some traffic on the primary bucket for objects that aren't present, so it makes the most sense sense if the primary bucket will have the object more often than not.
This solution does not redirect the browser -- CloudFront follows the second path before returning a response -- so you'll want to be mindful of the Cache-Control settings you attach to the fallback objects when you upload them, since adding a (previously-absent) primary object after a fallback object is already fetched and cached (by either CloudFront or the browser) will not be visible until any cached objects expire.

AWS S3 Static hosing: Routing rules doesn't work with cloudfront

I am using AWS S3 static web hosting for my VueJs SPA app. I have setup routing rules in S3 and it works perfectly fine when I access it using S3 static hosting url. But, I also have configured CloudFront to use it with my custom domain. Since single page apps need to be routed via index.html, I have setup custom error page in cloudfront to redirect 404 errors to index.html. So now routing rules I have setup in S3 no longer works.
What is the best way to get S3 routing rules to work along with CloudFront custom error page setup for SPA?
I think I am a bit late but here goes anyway,
Apparently you can't do that if you are using S3 REST_API endpoints (example-bucket.s3.amazonaws.com) as your origin for your CloudFront distribution, you have to use the S3 website url provided by S3 as the origin (example-bucket.s3-website-[region].amazonaws.com). Also, objects must be public you can't lock your bucket to the distribution by origin policy.
So,
Objects must be public.
S3 bucket website option must be turned on.
Distribution origin has to come from the S3 website url, not the rest api endpoint.
EDIT:
I was mistaking, actually, you can do it with the REST_API endpoint too, you only have to create a Custom Error Response inside your CloudFront distribution, probably only for the 404 and 403 error codes, set the "Customize Error Response" option to "yes", Response Page Path to "/index.html" and HTTP Response Code to "200". You can find that option inside your distribution and the error pages tab if you are using the console.

How to use Akamai infront of S3 buckets?

I have a static website that is currently hosted in apache servers. I have an akamai server which routes requests to my site to those servers. I want to move my static websites to Amazon S3, to get away from having to host those static files in my servers.
I created a S3 bucket in amazon, gave it appropriate policies. I also set up my bucket for static website hosting. It told me that I can access the site at
http://my-site.s3-website-us-east-1.amazonaws.com
I modified my akamai properties to point to this url as my origin server. When I goto my website, I get Http 504 errors.
What am i missing here?
Thanks
K
S3 buckets don't support HTTPS?
Buckets support HTTPS, but not directly in conjunction with the static web site hosting feature.
See Website Endpoints in the S3 Developer Guide for discussion of the feature set differences between the REST endpoints and the web site hosting endpoints.
Note that if you try to directly connect to your web site hosting endpoint with your browser, you will get a timeout error.
The REST endpoint https://your-bucket.s3.amazonaws.com will work for providing HTTPS between bucket and CDN, as long as there are no dots in the name of your bucket
Or if you need the web site hosting features (index documents and redirects), you can place CloudFront between Akamai and S3, encrypting the traffic inside CloudFront as it left the AWS network on its way to Akamai (it would still be in the clear from S3 to CloudFront, but this is internal traffic on the AWS network). CloudFront automatically provides HTTPS support on the dddexample.cloudfront.net hostname it assigns to each distribution.
I admit, it sounds a bit silly, initially, to put CloudFront behind another CDN but it's really pretty sensible -- CloudFront was designed in part to augment the capabilities of S3. CloudFront also provides Lambda#Edge, which allows injection of logic at 4 trigger points in the request processing cycle (before and after the CloudFront cache, during the request and during the response) where you can modify request and response headers, generate dynamic responses, and make external network requests if needed to implement processing logic.
I faced this problem currently and as mentioned by Michael - sqlbot, putting the CloudFront between Akamai and S3 Bucket could be a workaround, but doing that you're using a CDN behind another CDN. I strongly recommend you to configure the redirects and also customize the response when origin error directly in Akamai (using REST API endpoint in your bucket). You'll need to create three rules, but first, go to CDN > Properties and select your property, Edit New Version based on the last one and click on Add Rule in Property Configuration Settings section. The first rule will be responsible for redirect empty paths to index.html, create it just like the image below:
builtin.AK_PATH is an Akamai's variable. The next step is responsible for redirect paths different from the static ones (html, ico, json, js, css, jpg, png, gif, etc) to \index.html:
The last step is responsible for customize an error response when origin throws an HTTP error code (just like the CloudFront Error Pages). When the origin returns 404 or 403 HTTP status code, the Akamai will call the Failover Hostname Edge Server (which is inside the Akamai network) with the /index.html path. This setup will be triggered when refreshing pages in the browser and when the application has redirection links (which opens new tabs for example). In the Property Hostnames section, add a new hostname that will work as the Failover Hostname Edge Server, the name should has less than 16 characters, then, add the -a.akamaihd.net suffix to it (that's the Akamai pattern). For example: failover-a.akamaihd.net:
Finally, create a new empty rule just like the image below (type the hostname that you just created in the Alternate Hostname in This Property section):
Since you are already using Akamai as a CDN, you could simply use their NetStorage product line to achieve this in a simplified manner.
All you would need to do is to move the content from s3 to Akamai and it would take care of the rest(hosting, distribution, scaling, security, redundancy).
The origin settings on Luna control panel could simply point to the Netstorage FTP location. This will also remove the network latency otherwise present when accessing the S3 bucket from the Akamai Network.

Does Amazon pass custom headers to origin?

I am using CloudFront to front requests to our service hosted outside of amazon. The service is protected and we expect an "Authorization" header to be passed by the applications invoking our service.
We have tried invoking our service from Cloud Front but looks like the header is getting dropped by cloud front. Hence the service rejects the request and client gets 401 forbidden response.
For some static requests, which do not need authorization, we are not getting any error and are getting proper response from CloudFront.
I have gone through CloudFront documentation and there is no specific information available on how headers are handled and hence was hoping that they will be passed as is, but looks like thats not the case. Any guidance from you folks?
The list of the headers CF drops or modifies can be found here
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#RequestCustomRemovedHeaders
CloudFront does drop Authorization headers by default and will not pass it to the origin.
If you would like certain headers to be sent to the origin, you can setup a whitelist of headers under CloudFront->Behavior Settings->Forward headers. Just select the headers that you would like to be forwarded and CloudFront will do the job for you. I have tested it this way for one of our location based services and it works like a charm.
One thing that I need to verify is if the Authorization header will be included in the cache key and if its safe to do that?? That is something you might want to watch out for as well.
It makes sense CF drops the Authorization header, just imagine 2 users asking for the same object, the first one will grant access, CF will cache the object, then the second user will get the object as it was previously cached by CloudFront.
Great news are using forward headers you can forward the Authorization header to the origin, that means the object will be cached more than once as the header value is part of the cache "key"
For exmple user A GETS private/index.html
Authorization: XXXXXXXXXXXXX
The object will be cached as private/index.html + XXXXXXXXXXXXX (this is the key to cahce the object in CF)
Now when the new request from a diferent user arrives to CloudFront
GET private/index.html
Authorization: YYYYYYYYYYYY
The object will be passed to the origin as the combinaiton of private/index.html + YYYYYYYYYYYY is not in CF cache.
Then Cf will be cached 2 diferent objects with the same name (but diferent hash combinaiton name).
In addition to specifying them under the Origin Behaviour section, you can also add custom headers to your origin configuration. In the AWS documentation for CloudFront custom headers:
If the header names and values that you specify are not already present in the viewer request, CloudFront adds them. If a header is present, CloudFront overwrites the header value before forwarding the request to the origin.
The benefit of this is that you can then use an All/wildcard setting for whitelisting your headers in the behaviour section.
It sounds like you are trying to serve up dynamic content from CloudFront (at least in the sense that the content is different for authenticated vs unauthenticated users) which is not really what it is designed to do.
CloudFront is a Content Distribution Network (CDN) for caching content at distributed edge servers so that the data is served near your clients rather than hitting your server each time.
You can configure CloudFront to cache pages for a short time if it changes regularly and there are some use cases where this is worthwhile (e.g. a high volume web site where you want to "micro cache" to reduce server load) but it doesn't sound like this is the way you are trying to use it.
In the case you describe:
The user will hit CloudFront for the page.
It won't be in the cache so CloudFront will try to pull a copy from the origin server.
The origin server will reply with a 401 so CloudFront will not cache it.
Even if this worked and headers were passed back and forth in some way, there is is simply no point in using CloudFront if every page is going to hit your server anyway; you would just make the page slower because of the extra round trip to your server.

How can I hide a custom origin server from the public when using AWS CloudFront?

I am not sure if this exactly qualifies for StackOverflow, but since I need to do this programatically, and I figure lots of people on SO use CloudFront, I think it does... so here goes:
I want to hide public access to my custom origin server.
CloudFront pulls from the custom origin, however I cannot find documentation or any sort of example on preventing direct requests from users to my origin when proxied behind CloudFront unless my origin is S3... which isn't the case with a custom origin.
What technique can I use to identify/authenticate that a request is being proxied through CloudFront instead of being directly requested by the client?
The CloudFront documentation only covers this case when used with an S3 origin. The AWS forum post that lists CloudFront's IP addresses has a disclaimer that the list is not guaranteed to be current and should not be relied upon. See https://forums.aws.amazon.com/ann.jspa?annID=910
I assume that anyone using CloudFront has some sort of way to hide their custom origin from direct requests / crawlers. I would appreciate any sort of tip to get me started. Thanks.
I would suggest using something similar to facebook's robots.txt in order to prevent all crawlers from accessing all sensitive content in your website.
https://www.facebook.com/robots.txt (you may have to tweak it a bit)
After that, just point your app.. (eg. Rails) to be the custom origin server.
Now rewrite all the urls on your site to become absolute urls like :
https://d2d3cu3tt4cei5.cloudfront.net/hello.html
Basically all urls should point to your cloudfront distribution. Now if someone requests a file from https://d2d3cu3tt4cei5.cloudfront.net/hello.html and it does not have hello.html.. it can fetch it from your server (over an encrypted channel like https) and then serve it to the user.
so even if the user does a view source, they do not know your origin server... only know your cloudfront distribution.
more details on setting this up here:
http://blog.codeship.io/2012/05/18/Assets-Sprites-CDN.html
Create a custom CNAME that only CloudFront uses. On your own servers, block any request for static assets not coming from that CNAME.
For instance, if your site is http://abc.mydomain.net then set up a CNAME for http://xyz.mydomain.net that points to the exact same place and put that new domain in CloudFront as the origin pull server. Then, on requests, you can tell if it's from CloudFront or not and do whatever you want.
Downside is that this is security through obscurity. The client never sees the requests for http://xyzy.mydomain.net but that doesn't mean they won't have some way of figuring it out.
[I know this thread is old, but I'm answering it for people like me who see it months later.]
From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution.
1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain.
2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket.
3) Go to Behaviors and click Create Behavior:
Path Pattern: robots.txt
Origin: (your new bucket)
4) Set the robots.txt behavior at a higher precedence (lower number).
5) Go to invalidations and invalidate /robots.txt.
Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently.
Another domain/subdomain will also work in place of a bucket, but why go to the trouble.