React Router + AWS Backend, how to SEO - amazon-s3

I am using React and React Router in my single page web application. Since I'm doing client side rendering, I'd like to serve all of my static files (HTML, CSS, JS) with a CDN. I'm using Amazon S3 to host the files and Amazon CloudFront as the CDN.
When the user requests /css/styles.css, the file exists so S3 serves it.
When the user requests /foo/bar, this is a dynamic URL so S3 adds a hashbang: /#!/foo/bar. This will serve index.html. On my client side I remove the hashbang so my URLs are pretty.
This all works great for 100% of my users.
All static files are served through a CDN
A dynamic URL will be routed to /#!/{...} which serves index.html (my single page application)
My client side removes the hashbang so the URLs are pretty again
The problem
The problem is that Google won't crawl my website. Here's why:
Google requests /
They see a bunch of links, e.g. to /foo/bar
Google requests /foo/bar
They get redirected to /#!/foo/bar (302 Found)
They remove the hashbang and request /
Why is the hashbang being removed? My app works great for 100% of my users so why do I need to redesign it in such a way just to get Google to crawl it properly? It's 2016, just follow the hashbang...
</rant>
Am I doing something wrong? Is there a better way to get S3 to serve index.html when it doesn't recognize the path?
Setting up a node server to handle these paths isn't the correct solution because that defeats the entire purpose of having a CDN.
In this thread Michael Jackson, top contributor to React Router, says "Thankfully hashbang is no longer in widespread use." How would you change my set up to not use the hashbang?

You can also check out this trick. You need to setup cloudfront distribution and then alter 404 behaviour in "Error Pages" section of your distribution. That way you can again domain.com/foo/bar links :)

I know this has been a few months old, but for anyone that came across the same problem, you can simply specify "index.html" as the error document in S3. Error document property can be found under bucket Properties => static Website Hosting => Enable website hosting.
Please keep in mind that, taking this approach means you will be responsible for handling Http errors like 404 in your own application along with other http errors.

The Hash bang is not recommended when you want to make SEO friendly website, even if its indexed in Google, the page will display only a little and thin content.
The best way to do your website is by using the latest trend and techniques which is "Progressive web enhancement" search for it on Google and you will find many articles about it.
Mainly you should do a separate link for each page, and when the user clicks on any page he will be redirected to this page using any effect you want or even if it single page website.
In this case, Google will have a unique link for each page and the user will have the fancy effect and the great UX.
EX:
Contact Us

Related

Is it possible to have GitHub Readme images follow redirects?

I'm trying to add a test coverage badge to the Readme of a private repository on GitHub. Our continuous integration process saves out the image to a secured Google Cloud Storage bucket that's not accessible to the public, and should remain that way.
Google's authorization layer is smart enough that if I go to the URL for the image, I'm automatically redirected to the resource with a valid auto-generated signed URL.
E.g., if I go to http://storage.cloud.google.com/secret-files/mysecretfile.png, then if I'm logged in and allowed to view it, I'm automatically redirected to something like https://blahblah-apidata.googleusercontent.com/download/storage/v1/b/secret-files/o/mysecretfile.png?key=verylongkey, where I can load the image.
This seemed perfect. Reference the canonical path in the GitHub Readme, authenticated users see the image, unauthenticated users are still blocked, we don't have to make the file public, and we don't have to do anything complicated.
Except that GitHub is proxying the image request, meaning that it will always be unauthenticated. My browser is loading something like https://camo.githubusercontent.com/mysecretimage.png.
Is there a clever way to work around this? Or do I need to go back to the drawing board?
All images on github.com are proxied using the Camo image proxy. There are a couple reasons for this:
It preserves the privacy of users. It isn't possible for a document to track users by directing them to a different site or using cookies to track them.
It means images can be cached and served at an appropriate size.
GitHub can have a very strict content security policy that does not allow loading from untrusted sites, which means that any sort of accidental security problem (like an XSS) is a lot less likely to work.
Note the last part. Even if you found some sneaky way to get another image URL to render properly in the website, your browser wouldn't load it because it violates the Content-Security-Policy header the site sent, and moreover, your browser would tattle about that to the reporting URL that GitHub provided.
So any image URL you provide will need to be readable by GitHub's image proxy and it won't be possible to serve different content to different users.

Rewriting static resources url in .htaccess for CDN

I have an existing live application based on php/Mysql/apache stack. A quick performance evaluation revealed that a CDN solution would help us gain a lot of speed. Planning to use cloudfront for the CDN.
The issue is existing code wasnt written with CDN in mind.
At the moment, our html outputs contain statuc resources under link tags and are referenced with "./images/test.png" etc...
is there anyway to identify these links just before sending the output and replace it to load from local CDN url.

How to make sure web crawler works for site hosted on AWS S3 and uses AJAX

Google webmaster guide explains that web server should handle requests for url that contains _escaped_fragment_ (The crawler modifies www.example.com/ajax.html#!mystate to www.example.com/ajax.html?_escaped_fragment_=mystate)
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=174992
My site is located on AWS S3 and I have no web server to handle such requests. How can I make sure the crawler gets feed and my site gets index?
S3 hosted sites are static html. No POST handling, no PHP renders, no nothing... So, Why do you care about Google indexing AJAX sites?
For a static website, simply upload well formed robots.txt and sitemap.xml files to your root path.

How to Inform Google For Page URL Modifications in Same Domain?

I am renewing my web page and changing the site structure. It was in Asp and now it will be in Asp.Net
So page URLs will be modified. And some pages will be removed, some will be added. But mostly, the content and page names are same, only URLs will change.
The site has SEO work in it and we want to loose it minimum.Site is registered in Analytics and Webmaster Tools.
Google searches will end up blank pages and I don't want to loose my rank.
So I'm looking for a way to inform Google about new page URLs. Domain is same, only URLs. For example: the home page was /default.asp and now /home.aspx
Is there a way to tell Google that a particular URL address or page name has changed?
If all that is changing are the page URLs, Google Analytics cannot "know" that a page is the same, just with diferent URL.
But, you could apply a customized pageview using the _trackPageView() method, giving it the original url as parameter.
If you choose to do this, you will have to exclude the line that uses the method in the original GA code and apply it elsewhere, or pass the parameter to it directly with the orignial URL. All this is done in each page.
You can also read more about the method here.
For IIS (Asp.Net) you want to look into the following to find out how to do 301 redirects:
Response.RedirectPermanent(...) for redirecting from a page
or
"IIS 7 Routing Module and web.config" to set up bulk redirecting
I'd also suggest you consider supporting Search Engine Friendly (SEF) URLs while your making the move. The Routing Module can help you there as well.
You need to implement some form of 301 (301 is key) redirects. This way when google or any other search hits the old page, the index is refreshed with the new page. Asp.net allows you to do these redirects even at the IIS level, and where I'd suggest that they live. You'll also want to submit an up to date site map on webmaster tools.
Edit: Here's a good link on the redirects, http://www.iis.net/ConfigReference/system.webServer/httpRedirect

Umbraco - use HTTPS for some pages

I'm building a site with Umbraco, and there are a couple of pages that need to be visited over HTTPS instead of HTTP (e.g. a login page).
I've seen a couple of macros that get put on the page that needs to use HTTPS, and essentially just check the protocol used and do a Response.Redirect with the correct protocol if necessary. This seems like a terrible way of achieving what seems to be a fairly basic requirement - ideally I'd want Umbraco to render any links to these pages as <a href="https://...", not do a redirect when the user goes to a page.
With these redirecting macros, there's also the possibility of a browser displaying a warning if the user's on an HTTPS page and navigates to a HTTP one. If the links are relative, the user will be redirected from HTTPS to HTTP, and the browser may warn about this.
Is there a way to achieve this without modifying any Umbraco framework code?
There's currently no built-in way to make a few pages in Umbraco return a https url.
The only way I can think of doing this at the moment is just by making sure that you set up your links correctly.
But there's no way of stopping people from entering the insecure link. That is where the redirects come in handy though, it will make sure you don't get to a secure page insecurely.
I would recommend running the whole site in https mode. In the past, performance would have been an objection to running your full site in https mode. However with modern servers, this really shouldn't be a problem any more.