Single robots.txt file for all subdomains - apache

I have a site (example.com) and have my robots.txt set up in the root directory. I have also multiple subdomains (foo.example.com, bar.example.com, and more to come in the future) whose robots.txt will all be identical as that of example.com. I know that I can place a file at the root of each subdomain but I'm wondering if it's possible to redirect the crawlers searching for robots.txt on any subdomain to example.com/robots.txt?

Sending a redirect header for your robots.txt file is not advised, nor is it officially supported.
Google's documentation specifically states:
Handling of robots.txt redirects to disallowed URLs is undefined and discouraged.
But the documentation does say redirect "will be generally followed". If you add your subdomains into Google Webmaster Tools and go to "Crawl > Blocked URLs" you can test your subdomain robots.txts that are 301 redirecting. It should come back as positively working.
However, with that said, I would strongly suggest that you just symlink the files into place and that each robots.txt file responds with a 200 OK at the appropriate URLs. This is much more inline with the original robots.txt specification, as well as, Google's documentation, and who knows exactly how bing / yahoo will handle it over time.

Related

How may i prevent search engines from crawling a subdomain on my website?

I have cPanel installed on my website.
I went to the Domains section on cPanel
I clicked on subdomains.
I assigned the subdomain name (e.g : personal.mywebsite.com )
It wanted me to assign document root folder also. I assigned mywebsite.com/personal
if i create robots.txt in my website root(e.g : website.com)
User-agent:
Disallow: /personal/
Can it also block personal.mywebsite.com?
what should i do?
thanks
When you want to block URLs on personal.example.com, visit http://personal.example.com/robots.txt (resp. https instead of http).
It doesn’t matter how your server organizes folders in the backend, it only matters which robots.txt is available when accessing this URL.

Disallow crawling of the CDN site

So I have a site http://www.example.com.
The JS/CSS/Images are served from a CDN - http://xxxx.cloudfront.net OR http://cdn.example.com; they are both the same things. Now the CDN just serves any type of file, including my PHP pages. Google somehow got crawling that CDN site as well; two site actually - from cdn.example.com AND from http://xxxx.cloudfront.net. Considering
I am NOT trying set up a subdomain OR a mirror site. If that happens, that is a side affect of me trying to set up a CDN.
CDN is some web server, not necessarily an Apache. I do not know what type of server would that be.
There is no request processing on CDN. it just fetches things from origin server. I think, you cannot put custom files out there on the CDN; it just fetches things from the origin server. Whatever you need to put on the CDN comes from the origin server.
How do I prevent the crawling of PHP pages?
Should I allow crawling of images from cdn.example.com OR from example.com? The links to images inside the HTML are all to cdn.example.com. If I allow crawling of images only from example.com, then there is practically nothing to crawl - there are no links to such images. If I allow crawling of images from cdn.example.com, then does it not leak away the SEO benefits?
Some alternatives that I considered, based on stackoverflow answers:
Write custom robot_cdn.txt and serve that custom robots_cdn.txt based on HTTP_HOST. This is as per many answers on the stack overflow.
Serve a new robots.txt from subdomain. As I explained above, I do not think that CDN can be treated like a subdomain.
Do 301 redirects when HTTP_HOST is cdn.example.com to www.example.com
Suggestions?
Questions related to this, e.g. How Disallow a mirror site (on sub-domain) using robots.txt?
You can put robots.txt in your root directory so that it will be served with cdn.-yourdomain-.com/robots.txt. In this robots.txt you can disallow all the crawlers with the below setting
User-agent: *
Disallow: /

robots.txt which folders to disallow - SEO?

I am currently writing my robots.txt file and have some trouble deciding whether I should allow or disallow some folders for SEO purposes.
Here are the folders I have:
/css/ (css)
/js/ (javascript)
/img/ (images i use for the website)
/php/ (PHP which will return a blank page such as for example checkemail.php which checks an email address or register.php which puts data into a SQL database and sends an email)
/error/ (my error 401,403,404,406,500 html pages)
/include/ (header.html and footer.html I include)
I was thinking about disallowing only the PHP pages and let the rest.
What do you think?
Thanks a lot
Laurent
/css and /js -- CSS and Javascript files will probably be crawled by googlebot whether or not you have them in robots.txt. Google uses them to render your pages for site preview. Google has asked nicely that you not put them in robots.txt.
/img -- Googlebot may crawl this even when in robots.txt the same way as CSS and Javascript. Putting your images in robots.txt generally prevents them from being indexed in Google image search. Google image search may be a source of visitors to your site so you may wish to be indexed there.
/php -- sounds like you don't want spiders hitting the urls that perform actions. Good call to use robots.txt
/error -- If your site is set up correctly the spiders will probably never know what directory your error pages are served from. They generally get served at the url that has the error and the spider never sees their actual url. This isn't the case if you redirect to them, which isn't recommended practice anyway. As such, I would say there is no need to put them in robots.txt

Redirecting Pages: Names to Standard Address

I have WordPress installed in the root of a website, and recently enabled a custom permalink structure just for the sake of having good looking page URLs (only pages are used in this website, no posts at all — it's not a blog). Unfortunately this is causing some problems with other parts of the website, outside WordPress.
So I'd like to go the manual way: and redirect URLs like /my-page to /?page_id=32 just for a selected amount of pages. Is it possible to do that using the .htaccess file? What would the rules look like?
If you're redirecting pages from Wordpress to other URLs, you can use .htaccess. But it's probably easier to use a plugin to redirect rather than edit .htaccess.
See WordPress › Redirection « WordPress Plugins to easily set up redirects and log redirects, errors, and more.

sitemap for multiple domains of same site

Here is the situation, i have a website that can be accessed from multiple domains, lets say www.domain1.com, www.domain2.net, www.domain3.com. the domains access the exact same code base, but depending on the domain, different CSS, graphics, etc are loaded.
everything works fine, but now my question is how do i deal with the sitemap.xml?
i wrote the sitemap.xml for the default domain (www.domain1.com), but what about when the site is accessed from the other domains? the content of the sitemap.xml will contain the wrong domain.
i read that i can add multiple sitemap files to robots.txt, so does that mean that i can for example create sitemap-domain2.net.xml and sitemap-domain3.com.xml (containing the links with the matching domains) and simply add them to robots.txt?
somehow i have doubts that this would work thus i turn to you experts to shed some light on the subject :)
thanks
You should use server-side code to send the correct sitemap based on the domain name for requests to /sitemap.xml
Apache rewrite rules for /robots.txt requests
If you're using Apache as a webserver, you can create a directory called robots and put a robots.txt for each website you run on that VHOST by using Rewrite Rules in your .htaccess file like this:
# URL Rewrite solution for robots.txt for multidomains on single docroot
RewriteCond %{REQUEST_FILENAME} !-d # not an existing dir
RewriteCond %{REQUEST_FILENAME} !-f # not an existing file
RewriteCond robots/%{HTTP_HOST}.txt -f # and the specific robots file exists
RewriteRule ^robots\.txt$ robots/%{HTTP_HOST}.txt [L]
NginX mapping for /robots.txt requests
When using NginX as a webserver (while taking yourdomain1.tld and yourdomain2.tld as example domains), you can achieve the same goal as post above with the following conditional variable (place this outside your server directive):
map $host $robots_file {
default /robots/default.txt;
yourdomain1.tld /robots/yourdomain1.tld.txt;
yourdomain2.tld /robots/yourdomain2.tld.txt;
}
This way you can use this variable in a try_files statement inside your server directive:
location = /robots.txt {
try_files /robots/$robots_file =404;
}
Content of /robots/*.txt
After setting up the aliases to the domain-specific robots.txt-files, add the sitemap to each of the robots files (e.g.: /robots/yourdomain1.tld.txt) using this syntax at the bottom of the file:
# Sitemap for this specific domain
Sitemap: https://yourdomain1.tld/sitemaps/yourdomain1.tld.xml
Do this for all domains you have, and you'll be set!
You have to make sure URLs in each XML sitemap match within domain/subdomain. But, if you really want, you can host all sitemaps on one domain look using "Sitemaps & Cross Submits"
I'm not an expert with this but I have a similar situation
for my situation is that I have one domain but with 3 sub-domain
so what happen is that each of the sub-domain contain the sitemap.xml
but since my case was different directory for each of the sub-domain
but I'm pretty sure that the sitemap.xml can be specify for which of each domain.
The easiest method that I have found to achieve that is to use an XML sitemap generator to create a sitemap for each domain name.
Place both the /sitemap.xml in the root directory of your domains or sub-domains.
Go to Google Search and create separate properties for each domain name.
Submit an appropriate sitemap to each domain in the Search Console. The submission will say show success.
I'm facing a similar situation for a project I'm working on right now. And Google Search Central actually have the following answer:
If you have multiple websites, you can simplify the process of creating and submitting sitemaps by creating one or more sitemaps that include URLs for all your verified sites, and saving the sitemap(s) to a single location. All sites must be verified in Search Console.
So it seems that as long as you have added the different domains as your properties in Google Search Console, at least Google will know how to deal with the rest, even if you upload sitemaps for the other domains to only one of your properties in the Google Search Console.
For my use case, I then use server side code to generate sitemaps where all the dynamic pages with English content end up getting a location on my .io domain, and my pages with German content end up with a location on the .de domain:
<url>
<loc>https://www.mydomain.io/page/some-english-content</loc>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>https://www.mydomain.de/page/some-german-content</loc>
<changefreq>weekly</changefreq>
</url>
And then Google handles the rest. See docs.