How to make sure web crawler works for site hosted on AWS S3 and uses AJAX

How to make sure web crawler works for site hosted on AWS S3 and uses AJAX - amazon-s3

Google webmaster guide explains that web server should handle requests for url that contains _escaped_fragment_ (The crawler modifies www.example.com/ajax.html#!mystate to www.example.com/ajax.html?_escaped_fragment_=mystate)
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=174992
My site is located on AWS S3 and I have no web server to handle such requests. How can I make sure the crawler gets feed and my site gets index?

S3 hosted sites are static html. No POST handling, no PHP renders, no nothing... So, Why do you care about Google indexing AJAX sites?
For a static website, simply upload well formed robots.txt and sitemap.xml files to your root path.

Related

AWS S3 Static hosing: Routing rules doesn't work with cloudfront

I am using AWS S3 static web hosting for my VueJs SPA app. I have setup routing rules in S3 and it works perfectly fine when I access it using S3 static hosting url. But, I also have configured CloudFront to use it with my custom domain. Since single page apps need to be routed via index.html, I have setup custom error page in cloudfront to redirect 404 errors to index.html. So now routing rules I have setup in S3 no longer works.
What is the best way to get S3 routing rules to work along with CloudFront custom error page setup for SPA?

I think I am a bit late but here goes anyway,
Apparently you can't do that if you are using S3 REST_API endpoints (example-bucket.s3.amazonaws.com) as your origin for your CloudFront distribution, you have to use the S3 website url provided by S3 as the origin (example-bucket.s3-website-[region].amazonaws.com). Also, objects must be public you can't lock your bucket to the distribution by origin policy.
So,
Objects must be public.
S3 bucket website option must be turned on.
Distribution origin has to come from the S3 website url, not the rest api endpoint.
EDIT:
I was mistaking, actually, you can do it with the REST_API endpoint too, you only have to create a Custom Error Response inside your CloudFront distribution, probably only for the 404 and 403 error codes, set the "Customize Error Response" option to "yes", Response Page Path to "/index.html" and HTTP Response Code to "200". You can find that option inside your distribution and the error pages tab if you are using the console.

How to use Akamai infront of S3 buckets?

I have a static website that is currently hosted in apache servers. I have an akamai server which routes requests to my site to those servers. I want to move my static websites to Amazon S3, to get away from having to host those static files in my servers.
I created a S3 bucket in amazon, gave it appropriate policies. I also set up my bucket for static website hosting. It told me that I can access the site at
http://my-site.s3-website-us-east-1.amazonaws.com
I modified my akamai properties to point to this url as my origin server. When I goto my website, I get Http 504 errors.
What am i missing here?
Thanks
K

S3 buckets don't support HTTPS?
Buckets support HTTPS, but not directly in conjunction with the static web site hosting feature.
See Website Endpoints in the S3 Developer Guide for discussion of the feature set differences between the REST endpoints and the web site hosting endpoints.
Note that if you try to directly connect to your web site hosting endpoint with your browser, you will get a timeout error.
The REST endpoint https://your-bucket.s3.amazonaws.com will work for providing HTTPS between bucket and CDN, as long as there are no dots in the name of your bucket
Or if you need the web site hosting features (index documents and redirects), you can place CloudFront between Akamai and S3, encrypting the traffic inside CloudFront as it left the AWS network on its way to Akamai (it would still be in the clear from S3 to CloudFront, but this is internal traffic on the AWS network). CloudFront automatically provides HTTPS support on the dddexample.cloudfront.net hostname it assigns to each distribution.
I admit, it sounds a bit silly, initially, to put CloudFront behind another CDN but it's really pretty sensible -- CloudFront was designed in part to augment the capabilities of S3. CloudFront also provides Lambda#Edge, which allows injection of logic at 4 trigger points in the request processing cycle (before and after the CloudFront cache, during the request and during the response) where you can modify request and response headers, generate dynamic responses, and make external network requests if needed to implement processing logic.

I faced this problem currently and as mentioned by Michael - sqlbot, putting the CloudFront between Akamai and S3 Bucket could be a workaround, but doing that you're using a CDN behind another CDN. I strongly recommend you to configure the redirects and also customize the response when origin error directly in Akamai (using REST API endpoint in your bucket). You'll need to create three rules, but first, go to CDN > Properties and select your property, Edit New Version based on the last one and click on Add Rule in Property Configuration Settings section. The first rule will be responsible for redirect empty paths to index.html, create it just like the image below:
builtin.AK_PATH is an Akamai's variable. The next step is responsible for redirect paths different from the static ones (html, ico, json, js, css, jpg, png, gif, etc) to \index.html:
The last step is responsible for customize an error response when origin throws an HTTP error code (just like the CloudFront Error Pages). When the origin returns 404 or 403 HTTP status code, the Akamai will call the Failover Hostname Edge Server (which is inside the Akamai network) with the /index.html path. This setup will be triggered when refreshing pages in the browser and when the application has redirection links (which opens new tabs for example). In the Property Hostnames section, add a new hostname that will work as the Failover Hostname Edge Server, the name should has less than 16 characters, then, add the -a.akamaihd.net suffix to it (that's the Akamai pattern). For example: failover-a.akamaihd.net:
Finally, create a new empty rule just like the image below (type the hostname that you just created in the Alternate Hostname in This Property section):

Since you are already using Akamai as a CDN, you could simply use their NetStorage product line to achieve this in a simplified manner.
All you would need to do is to move the content from s3 to Akamai and it would take care of the rest(hosting, distribution, scaling, security, redundancy).
The origin settings on Luna control panel could simply point to the Netstorage FTP location. This will also remove the network latency otherwise present when accessing the S3 bucket from the Akamai Network.

React Router + AWS Backend, how to SEO

I am using React and React Router in my single page web application. Since I'm doing client side rendering, I'd like to serve all of my static files (HTML, CSS, JS) with a CDN. I'm using Amazon S3 to host the files and Amazon CloudFront as the CDN.
When the user requests /css/styles.css, the file exists so S3 serves it.
When the user requests /foo/bar, this is a dynamic URL so S3 adds a hashbang: /#!/foo/bar. This will serve index.html. On my client side I remove the hashbang so my URLs are pretty.
This all works great for 100% of my users.
All static files are served through a CDN
A dynamic URL will be routed to /#!/{...} which serves index.html (my single page application)
My client side removes the hashbang so the URLs are pretty again
The problem
The problem is that Google won't crawl my website. Here's why:
Google requests /
They see a bunch of links, e.g. to /foo/bar
Google requests /foo/bar
They get redirected to /#!/foo/bar (302 Found)
They remove the hashbang and request /
Why is the hashbang being removed? My app works great for 100% of my users so why do I need to redesign it in such a way just to get Google to crawl it properly? It's 2016, just follow the hashbang...
</rant>
Am I doing something wrong? Is there a better way to get S3 to serve index.html when it doesn't recognize the path?
Setting up a node server to handle these paths isn't the correct solution because that defeats the entire purpose of having a CDN.
In this thread Michael Jackson, top contributor to React Router, says "Thankfully hashbang is no longer in widespread use." How would you change my set up to not use the hashbang?

You can also check out this trick. You need to setup cloudfront distribution and then alter 404 behaviour in "Error Pages" section of your distribution. That way you can again domain.com/foo/bar links :)

I know this has been a few months old, but for anyone that came across the same problem, you can simply specify "index.html" as the error document in S3. Error document property can be found under bucket Properties => static Website Hosting => Enable website hosting.
Please keep in mind that, taking this approach means you will be responsible for handling Http errors like 404 in your own application along with other http errors.

The Hash bang is not recommended when you want to make SEO friendly website, even if its indexed in Google, the page will display only a little and thin content.
The best way to do your website is by using the latest trend and techniques which is "Progressive web enhancement" search for it on Google and you will find many articles about it.
Mainly you should do a separate link for each page, and when the user clicks on any page he will be redirected to this page using any effect you want or even if it single page website.
In this case, Google will have a unique link for each page and the user will have the fancy effect and the great UX.
EX:
Contact Us

Access js and css files from API gateway and lambda

I have an API Gateway in AWS that calls a a lambda function that returns some html. That html is then properly rendered on the screen but without any styles or js files included. How do I get those to the client as well? Is there a better method than creating /js and /css GET endpoints on the API Gateway to go get those files? I was hoping I could just store them in S3 and they'd get autoloaded from there.

Store them on S3, and enable S3 static website hosting. Then include the correct URL to those assets in the HTML.

I put in the exact address of each js/css file I wanted to include in my html. You need to use https address, not the http address of the bucket. Each file has it's own https address which can be found by following Mark B's instructions above. Notably, going through the AWS admin console, navigate to the file in the S3 bucket, click the "Properties" button in the upper right, copy the "Link" field, and post that into the html file (which was also hosted in S3 in my case). Html looks like this:
<link href="https://s3-us-west-2.amazonaws.com/my-bucket-name/css/bootstrap.min.css" rel="stylesheet">
I don't have static website hosting enabled on the bucket. I don't have any CORS permissions allowing reading from a certain host.

Redirect /blog to / in Amazon CloudFront and S3

My existing blog built on octopress served from VPS. It has a following path structure
http://blog.example.com/blog/2014/12/26/title-of-the-blog/
Now I moved to jekyll and deployed in S3 and CloudFront. New structure of my blog is
http://blog.example.com/2014/12/26/title-of-the-blog/
There is no blog in the url path. My old links already available in Social networking sites(twitter, facebook) and search engine. If any traffic come from old links, I want to redirect the links with /blog/ prefix to /.
How can i do it in CloudFront or S3?

I think this might solve your problem: http://docs.aws.amazon.com/AmazonS3/latest/dev/how-to-page-redirect.html. Although if blog is pretty large you might consider using API to generate these redirects

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to make sure web crawler works for site hosted on AWS S3 and uses AJAX - amazon-s3

S3 hosted sites are static html. No POST handling, no PHP renders, no nothing... So, Why do you care about Google indexing AJAX sites? For a static website, simply upload well formed robots.txt and sitemap.xml files to your root path.

Related

AWS S3 Static hosing: Routing rules doesn't work with cloudfront

How to use Akamai infront of S3 buckets?

React Router + AWS Backend, how to SEO

Access js and css files from API gateway and lambda

Redirect /blog to / in Amazon CloudFront and S3

Categories

Resources