Cloudfront static sites with versioning - amazon-s3

I like the idea of hosting a static site in S3 + Cloudfront.
Best practice seems to be to version files in S3. For example, for site version 2324, I'd put stuff in
s3://my-site-assets/2324/images/logo.jpg
The thing I'm having trouble with is how to version the actual pages. If a "hello" page is at
s3://my-site-assets/2324/hello.html
I would want visitors to https://my-site.com/hello.html to get the correct version.
Is this possible with a 100% static site? Right now, I'm doing something similar by versioning assets, but my pages are all served via EC2/Varnish/ELB. It seems quite heavyweight just for rewriting hello.html -> 2324/hello.html.

It is possible today with Lambda#Edge You have to do server side redirection to load latest versioned site. Since you're versioning your site, somewhere you must be maintaining the version. Use that number in your lambda#edge logic.
Request (https://my-site.com/hello.html) -> L#E (redirect here) -> CF -> S3 (and all the way back)
L#E logic: replace(base_url, base_url+/+${latest_version})
Reference doc on routing via Lambda#Edge: https://making.close.com/posts/redirects-using-cloudfront-lambda-edge

Related

ListBucketResult xml trying to show home page of site in S3 thorugh CloudFront

I created a bucket where I´m hosting my static website.
I set the properties to use it as static website hosting (which index document value index.html)
The URL was: http://mywebsitelearningcurve.s3-website-us-east-1.amazonaws.com (not currently up, just to explain)
I exposed it as public (permission).
Overview of my bucket
/images
/static
/asset-manifest.json
/favicon.ico
/index.html
/manifest.json
/service-worker.js
Using http://mywebsitelearningcurve.s3-website-us-east-1.amazonaws.com I could access to my site. However I decided to use CloudFront in front of my bucket.
I created a new distribution for WEB.
On Origin Domain Name I used mywebsitelearningcurve.s3.amazonaws.com
Origin ID: S3-mywebsitelearningcurve
In Viewer Protocol Policy I selected: Redirect HTTP to HTTPS.
Once it finished and I waited for a prudential time to propagate, I had the url https://d2qf2r44tssakh.cloudfront.net/ (not currently up, just to explain).
The issue:
When I tried to use https://d2qf2r44tssakh.cloudfront.net/ it showed me a xml
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>mywebsitelearningcurve</Name>
...
...
...
</ListBucketResult>
However, when I tried https://d2qf2r44tssakh.cloudfront.net/index.html it works properly.
I go through several tutos and post but I can´t still make it work. Anyone can provide help?
Thanks
I had the same problem today and was able to fix it by adding index.html to the Default Root Object in the distribution settings:
Optional. The object that you want CloudFront to return (for example,
index.html) when a viewer request points to your root URL
(http://www.example.com) instead of to a specific object in your
distribution (http://www.example.com/index.html).
i had 5 years prod experience on AWS with 5 certifications in place.
When it comes to s3 + cloudfront, i got always in troubles
I tried to automate that using Cloudformation, but Cloudformation does not support everything needed (.i.e. custom origin in cloudfront).
At the end, i relies only on terraform to automate this part:
https://github.com/riboseinc/terraform-aws-s3-cloudfront-website/blob/master/sample-site/main.tf
If you don't mind to use terraform, i highly recommend to jump there.

Rewriting static resources url in .htaccess for CDN

I have an existing live application based on php/Mysql/apache stack. A quick performance evaluation revealed that a CDN solution would help us gain a lot of speed. Planning to use cloudfront for the CDN.
The issue is existing code wasnt written with CDN in mind.
At the moment, our html outputs contain statuc resources under link tags and are referenced with "./images/test.png" etc...
is there anyway to identify these links just before sending the output and replace it to load from local CDN url.

React Router + AWS Backend, how to SEO

I am using React and React Router in my single page web application. Since I'm doing client side rendering, I'd like to serve all of my static files (HTML, CSS, JS) with a CDN. I'm using Amazon S3 to host the files and Amazon CloudFront as the CDN.
When the user requests /css/styles.css, the file exists so S3 serves it.
When the user requests /foo/bar, this is a dynamic URL so S3 adds a hashbang: /#!/foo/bar. This will serve index.html. On my client side I remove the hashbang so my URLs are pretty.
This all works great for 100% of my users.
All static files are served through a CDN
A dynamic URL will be routed to /#!/{...} which serves index.html (my single page application)
My client side removes the hashbang so the URLs are pretty again
The problem
The problem is that Google won't crawl my website. Here's why:
Google requests /
They see a bunch of links, e.g. to /foo/bar
Google requests /foo/bar
They get redirected to /#!/foo/bar (302 Found)
They remove the hashbang and request /
Why is the hashbang being removed? My app works great for 100% of my users so why do I need to redesign it in such a way just to get Google to crawl it properly? It's 2016, just follow the hashbang...
</rant>
Am I doing something wrong? Is there a better way to get S3 to serve index.html when it doesn't recognize the path?
Setting up a node server to handle these paths isn't the correct solution because that defeats the entire purpose of having a CDN.
In this thread Michael Jackson, top contributor to React Router, says "Thankfully hashbang is no longer in widespread use." How would you change my set up to not use the hashbang?
You can also check out this trick. You need to setup cloudfront distribution and then alter 404 behaviour in "Error Pages" section of your distribution. That way you can again domain.com/foo/bar links :)
I know this has been a few months old, but for anyone that came across the same problem, you can simply specify "index.html" as the error document in S3. Error document property can be found under bucket Properties => static Website Hosting => Enable website hosting.
Please keep in mind that, taking this approach means you will be responsible for handling Http errors like 404 in your own application along with other http errors.
The Hash bang is not recommended when you want to make SEO friendly website, even if its indexed in Google, the page will display only a little and thin content.
The best way to do your website is by using the latest trend and techniques which is "Progressive web enhancement" search for it on Google and you will find many articles about it.
Mainly you should do a separate link for each page, and when the user clicks on any page he will be redirected to this page using any effect you want or even if it single page website.
In this case, Google will have a unique link for each page and the user will have the fancy effect and the great UX.
EX:
Contact Us

Redirection issue following migration from Node.js app to static site on Amazon S3

I plan to migrate my personal blog presently using Node.js as a backend to Amazon S3, considering the fact that the content is pretty much always static.
One problem I noticed is that there's no way to do redirection or whatsoever on Amazon S3 (as far as I know).
Lets say I have this URL:
http://blogue.jpmonette.net/2013/06/11/hebergez-vos-applications-nodejs-grace-a-digitalocean
When I'll migrate it to Amazon, I'll have to create this folder hierarchy:
/2013/06/11/hebergez-vos-applications-nodejs-grace-a-digitalocean/
and then add the file index.html in it, containing the data.
Considering this, my URL will then be changed from:
http://blogue.jpmonette.net/2013/06/11/hebergez-vos-applications-nodejs-grace-a-digitalocean
to
http://blogue.jpmonette.net/2013/06/11/hebergez-vos-applications-nodejs-grace-a-digitalocean/
There's no way to redirect that right now using Amazon S3.
Also, anyone requesting http://blogue.jpmonette.net/2013/06/11/hebergez-vos-applications-nodejs-grace-a-digitalocean/index.html will obtain a file, and this is annoying in term of SEO.
Is there an available solution to prevent this behavior and preserve good SEO of my blog?
EDIT
And for people flagging it as not appropriate question, I'm looking here to make proper permanent redirection on Amazon S3, to make sure that visitors looking for articles in the future will find them. Please note here that visitor includes humans and robots.
It seems like we can create redirection rules this way (for a-propos to a-propos/):
<?xml version="1.0"?>
<RoutingRules>
<RoutingRule>
<Condition>
<KeyPrefixEquals>a-propos</KeyPrefixEquals>
</Condition>
<Redirect>
<ReplaceKeyWith>a-propos/</ReplaceKeyWith>
</Redirect>
</RoutingRule>
</RoutingRules>
Considering I have a ton of URL to redirect (83 in total), it seems impossible to do because there's a limit of redirection rule:
83 routing rules provided, the number of routing rules in a website configuration is limited to 50.
Other then that, the only option I see is to add this HTTP header x-amz-website-redirect-location to a file with the same name as your prior URL.
For the example above, create a file named a-propos and add the HTTP header x-amz-website-redirect-location and put a-propos/ as a value. This should work, but it takes forever to do.
Syntax can be found here:
http://docs.aws.amazon.com/AmazonS3/latest/dev/HowDoIWebsiteConfiguration.html#configure-bucket-as-website-routing-rule-syntax
Rule generator can be found here:
http://quiet-cove-8872.herokuapp.com/

How can I hide a custom origin server from the public when using AWS CloudFront?

I am not sure if this exactly qualifies for StackOverflow, but since I need to do this programatically, and I figure lots of people on SO use CloudFront, I think it does... so here goes:
I want to hide public access to my custom origin server.
CloudFront pulls from the custom origin, however I cannot find documentation or any sort of example on preventing direct requests from users to my origin when proxied behind CloudFront unless my origin is S3... which isn't the case with a custom origin.
What technique can I use to identify/authenticate that a request is being proxied through CloudFront instead of being directly requested by the client?
The CloudFront documentation only covers this case when used with an S3 origin. The AWS forum post that lists CloudFront's IP addresses has a disclaimer that the list is not guaranteed to be current and should not be relied upon. See https://forums.aws.amazon.com/ann.jspa?annID=910
I assume that anyone using CloudFront has some sort of way to hide their custom origin from direct requests / crawlers. I would appreciate any sort of tip to get me started. Thanks.
I would suggest using something similar to facebook's robots.txt in order to prevent all crawlers from accessing all sensitive content in your website.
https://www.facebook.com/robots.txt (you may have to tweak it a bit)
After that, just point your app.. (eg. Rails) to be the custom origin server.
Now rewrite all the urls on your site to become absolute urls like :
https://d2d3cu3tt4cei5.cloudfront.net/hello.html
Basically all urls should point to your cloudfront distribution. Now if someone requests a file from https://d2d3cu3tt4cei5.cloudfront.net/hello.html and it does not have hello.html.. it can fetch it from your server (over an encrypted channel like https) and then serve it to the user.
so even if the user does a view source, they do not know your origin server... only know your cloudfront distribution.
more details on setting this up here:
http://blog.codeship.io/2012/05/18/Assets-Sprites-CDN.html
Create a custom CNAME that only CloudFront uses. On your own servers, block any request for static assets not coming from that CNAME.
For instance, if your site is http://abc.mydomain.net then set up a CNAME for http://xyz.mydomain.net that points to the exact same place and put that new domain in CloudFront as the origin pull server. Then, on requests, you can tell if it's from CloudFront or not and do whatever you want.
Downside is that this is security through obscurity. The client never sees the requests for http://xyzy.mydomain.net but that doesn't mean they won't have some way of figuring it out.
[I know this thread is old, but I'm answering it for people like me who see it months later.]
From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution.
1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain.
2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket.
3) Go to Behaviors and click Create Behavior:
Path Pattern: robots.txt
Origin: (your new bucket)
4) Set the robots.txt behavior at a higher precedence (lower number).
5) Go to invalidations and invalidate /robots.txt.
Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently.
Another domain/subdomain will also work in place of a bucket, but why go to the trouble.