Disallow crawling of the CDN site - apache

So I have a site http://www.example.com.
The JS/CSS/Images are served from a CDN - http://xxxx.cloudfront.net OR http://cdn.example.com; they are both the same things. Now the CDN just serves any type of file, including my PHP pages. Google somehow got crawling that CDN site as well; two site actually - from cdn.example.com AND from http://xxxx.cloudfront.net. Considering
I am NOT trying set up a subdomain OR a mirror site. If that happens, that is a side affect of me trying to set up a CDN.
CDN is some web server, not necessarily an Apache. I do not know what type of server would that be.
There is no request processing on CDN. it just fetches things from origin server. I think, you cannot put custom files out there on the CDN; it just fetches things from the origin server. Whatever you need to put on the CDN comes from the origin server.
How do I prevent the crawling of PHP pages?
Should I allow crawling of images from cdn.example.com OR from example.com? The links to images inside the HTML are all to cdn.example.com. If I allow crawling of images only from example.com, then there is practically nothing to crawl - there are no links to such images. If I allow crawling of images from cdn.example.com, then does it not leak away the SEO benefits?
Some alternatives that I considered, based on stackoverflow answers:
Write custom robot_cdn.txt and serve that custom robots_cdn.txt based on HTTP_HOST. This is as per many answers on the stack overflow.
Serve a new robots.txt from subdomain. As I explained above, I do not think that CDN can be treated like a subdomain.
Do 301 redirects when HTTP_HOST is cdn.example.com to www.example.com
Suggestions?
Questions related to this, e.g. How Disallow a mirror site (on sub-domain) using robots.txt?

You can put robots.txt in your root directory so that it will be served with cdn.-yourdomain-.com/robots.txt. In this robots.txt you can disallow all the crawlers with the below setting
User-agent: *
Disallow: /

Related

Which URL variations to add in to Analytics and Search Console?

I have a domain example.co.uk on an Apache web server that is secured with a letsencrypt ssl certificate. Currently it redirects all http requests to https. I have also setup redirects from non-www to www, meaning all traffic ends up at https://www.example.co.uk
So I have four variations of the URL that always end up at this location:
http://example.co.uk
https://example.co.uk
http://www.example.co.uk
https://www.example.co.uk
I am trying to set up Google Search Console and Analytics. My question is which URLs do I need to add in to the two? Currently I have all four variations set up in Search Console with a sitemap attached to them all, or do I only need to do this for one? I have told the https www URL to prefer www in search results, which changes it for all four variations.
In Analytics should I only add https://www.example.co.uk as this is where all the traffic ends up, or do I need to add all variations of the URL to see all the traffic?
Short answer: no, unless you are migrating an existing site to https for the first time.
If all requests for your site eventually redirect to https://www.example.co.uk via a permanent 301 status code, then there isn't any benefit to adding all the links in Google Search Console. This feature is useful if you have duplicate content, such as an http site that you can't redirect to your https version for some reason, or if you've just migrated your site to a different domain name or URL scheme. If you're migrating an existing site to https, you can track how many http pages are still indexed while watching your https pages get indexed separately.
Otherwise, if you add all four links, you'll only see pages on the https://www.example.co.uk site get indexed. The Search Console allows you to track your site in the Google index, and if you are using 301 redirects then Google should never index the non-http versions of your site.

How may i prevent search engines from crawling a subdomain on my website?

I have cPanel installed on my website.
I went to the Domains section on cPanel
I clicked on subdomains.
I assigned the subdomain name (e.g : personal.mywebsite.com )
It wanted me to assign document root folder also. I assigned mywebsite.com/personal
if i create robots.txt in my website root(e.g : website.com)
User-agent:
Disallow: /personal/
Can it also block personal.mywebsite.com?
what should i do?
thanks
When you want to block URLs on personal.example.com, visit http://personal.example.com/robots.txt (resp. https instead of http).
It doesn’t matter how your server organizes folders in the backend, it only matters which robots.txt is available when accessing this URL.

Single robots.txt file for all subdomains

I have a site (example.com) and have my robots.txt set up in the root directory. I have also multiple subdomains (foo.example.com, bar.example.com, and more to come in the future) whose robots.txt will all be identical as that of example.com. I know that I can place a file at the root of each subdomain but I'm wondering if it's possible to redirect the crawlers searching for robots.txt on any subdomain to example.com/robots.txt?
Sending a redirect header for your robots.txt file is not advised, nor is it officially supported.
Google's documentation specifically states:
Handling of robots.txt redirects to disallowed URLs is undefined and discouraged.
But the documentation does say redirect "will be generally followed". If you add your subdomains into Google Webmaster Tools and go to "Crawl > Blocked URLs" you can test your subdomain robots.txts that are 301 redirecting. It should come back as positively working.
However, with that said, I would strongly suggest that you just symlink the files into place and that each robots.txt file responds with a 200 OK at the appropriate URLs. This is much more inline with the original robots.txt specification, as well as, Google's documentation, and who knows exactly how bing / yahoo will handle it over time.

robots.txt which folders to disallow - SEO?

I am currently writing my robots.txt file and have some trouble deciding whether I should allow or disallow some folders for SEO purposes.
Here are the folders I have:
/css/ (css)
/js/ (javascript)
/img/ (images i use for the website)
/php/ (PHP which will return a blank page such as for example checkemail.php which checks an email address or register.php which puts data into a SQL database and sends an email)
/error/ (my error 401,403,404,406,500 html pages)
/include/ (header.html and footer.html I include)
I was thinking about disallowing only the PHP pages and let the rest.
What do you think?
Thanks a lot
Laurent
/css and /js -- CSS and Javascript files will probably be crawled by googlebot whether or not you have them in robots.txt. Google uses them to render your pages for site preview. Google has asked nicely that you not put them in robots.txt.
/img -- Googlebot may crawl this even when in robots.txt the same way as CSS and Javascript. Putting your images in robots.txt generally prevents them from being indexed in Google image search. Google image search may be a source of visitors to your site so you may wish to be indexed there.
/php -- sounds like you don't want spiders hitting the urls that perform actions. Good call to use robots.txt
/error -- If your site is set up correctly the spiders will probably never know what directory your error pages are served from. They generally get served at the url that has the error and the spider never sees their actual url. This isn't the case if you redirect to them, which isn't recommended practice anyway. As such, I would say there is no need to put them in robots.txt

Redirecting Pages: Names to Standard Address

I have WordPress installed in the root of a website, and recently enabled a custom permalink structure just for the sake of having good looking page URLs (only pages are used in this website, no posts at all — it's not a blog). Unfortunately this is causing some problems with other parts of the website, outside WordPress.
So I'd like to go the manual way: and redirect URLs like /my-page to /?page_id=32 just for a selected amount of pages. Is it possible to do that using the .htaccess file? What would the rules look like?
If you're redirecting pages from Wordpress to other URLs, you can use .htaccess. But it's probably easier to use a plugin to redirect rather than edit .htaccess.
See WordPress › Redirection « WordPress Plugins to easily set up redirects and log redirects, errors, and more.