Use `robots.txt` in a multilingual site - seo

Having to manage a multilingual site, where users are redirected to a local version of site, like
myBurger.com/en // for users from US, UK, etc..
myBurger.com/fr // for users from France, Swiss, etc...
How should be organized the robots.txt file in pair with the sitemap?
myBurger.com/robots.txt // with - Sitemap: http://myBurger.com/??/sitemap
OR
myBurger.com/en/robots.txt // with - Sitemap: http://myBurger.com/en/sitemap
myBurger.com/fr/robots.txt // with - Sitemap: http://myBurger.com/fr/sitemap
kwnowing that en and fr sites are in fact independent entities not sharing common content, even if similar appearance.

You need to put one robots.txt at the top level.
The robots.txt file must be in the top-level directory of the host,
accessible though the appropriate protocol and port number.
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

Put the robots.txt at the root: myBurger.com/robots.txt and register your sitemaps in the robots.txt file using the sitemap: directive (see an example I maintain if necessary).

Related

Noindex Only One Subdomain

I am having difficulty finding information on how to completely noindex only one particular subdomain via htaccess (from my understanding, that's the best way?) and it is important for me that only that one subdomain and its files are never indexed or crawlable.
I have an Apache server that uses Plesk and the subdomain is for an email software we use for newsletter campaigns etc.
The subdomain is "mail" (e.g https://mail.test.com) and my goal is to only make "mail" noindex because for some reason the software has seo features that can wind up harming our general purpose etc.
Create robots.txt inside subdomain document root with the following content:
User-agent: *
Disallow: /

Single robots.txt file for all subdomains

I have a site (example.com) and have my robots.txt set up in the root directory. I have also multiple subdomains (foo.example.com, bar.example.com, and more to come in the future) whose robots.txt will all be identical as that of example.com. I know that I can place a file at the root of each subdomain but I'm wondering if it's possible to redirect the crawlers searching for robots.txt on any subdomain to example.com/robots.txt?
Sending a redirect header for your robots.txt file is not advised, nor is it officially supported.
Google's documentation specifically states:
Handling of robots.txt redirects to disallowed URLs is undefined and discouraged.
But the documentation does say redirect "will be generally followed". If you add your subdomains into Google Webmaster Tools and go to "Crawl > Blocked URLs" you can test your subdomain robots.txts that are 301 redirecting. It should come back as positively working.
However, with that said, I would strongly suggest that you just symlink the files into place and that each robots.txt file responds with a 200 OK at the appropriate URLs. This is much more inline with the original robots.txt specification, as well as, Google's documentation, and who knows exactly how bing / yahoo will handle it over time.

Avoid google indexing subdomains

As far as I searched for it, not able to find a proper answer for such kinda problem.
I have a few TLDs installed on the same cPanel account.
One of them is known as the main domain, and the rest are secondary domain.
cPanel automatically creates subdomains when you add a secondary domain somthing like;
http://secondary.maindomain.com
My problem is google indexed my pages both from 2 addresses.
Like:
secondary.com/blabla.html
secondary.maindomain.com/blabla.html
How can I remove those indexes from google? And
How can I avoid those subdomains being indexed for the future?
For this purpose you can add robots.txt to your document root path and add 'Disallow: ' to avoid any search engine or Google to index your files or directories.
For example to avoid indexing your subdomain in google add below entries in robots.txt and place robots.txt in document root path of you subdomain:
User-agent: Googlebot
Disallow: /
or for all search engines:
User-agent: *
Disallow: /

robots.txt which folders to disallow - SEO?

I am currently writing my robots.txt file and have some trouble deciding whether I should allow or disallow some folders for SEO purposes.
Here are the folders I have:
/css/ (css)
/js/ (javascript)
/img/ (images i use for the website)
/php/ (PHP which will return a blank page such as for example checkemail.php which checks an email address or register.php which puts data into a SQL database and sends an email)
/error/ (my error 401,403,404,406,500 html pages)
/include/ (header.html and footer.html I include)
I was thinking about disallowing only the PHP pages and let the rest.
What do you think?
Thanks a lot
Laurent
/css and /js -- CSS and Javascript files will probably be crawled by googlebot whether or not you have them in robots.txt. Google uses them to render your pages for site preview. Google has asked nicely that you not put them in robots.txt.
/img -- Googlebot may crawl this even when in robots.txt the same way as CSS and Javascript. Putting your images in robots.txt generally prevents them from being indexed in Google image search. Google image search may be a source of visitors to your site so you may wish to be indexed there.
/php -- sounds like you don't want spiders hitting the urls that perform actions. Good call to use robots.txt
/error -- If your site is set up correctly the spiders will probably never know what directory your error pages are served from. They generally get served at the url that has the error and the spider never sees their actual url. This isn't the case if you redirect to them, which isn't recommended practice anyway. As such, I would say there is no need to put them in robots.txt

Can a relative sitemap url be used in a robots.txt?

In robots.txt can I write the following relative URL for the sitemap file?
sitemap: /sitemap.ashx
Or do I have to use the complete (absolute) URL for the sitemap file, like:
sitemap: http://subdomain.domain.com/sitemap.ashx
Why I wonder:
I own a new blog service, www.domain.com, that allow users to blog on accountname.domain.com.
I use wildcards, so all subdomains (accounts) point to: "blog.domain.com".
In blog.domain.com I put the robots.txt to let search engines find the sitemap.
But, due to the wildcards, all user account share the same robots.txt file.Thats why I can't use the second alternative. And for now I can't use url rewrite for txt files. (I guess that later versions of IIS can handle this?)
According to the official documentation on sitemaps.org it needs to be a full URL:
You can specify the location of the Sitemap using a robots.txt file. To do this, simply add the following line including the full URL to the sitemap:
Sitemap: http://www.example.com/sitemap.xml
Google crawlers are not smart enough, they can't crawl relative URLs, that's why it's always recommended to use absolute URL's for better crawlability and indexability.
Therefore, you can not use this variation
> sitemap: /sitemap.xml
Recommended syntax is
Sitemap: https://www.yourdomain.com/sitemap.xml
Note:
Don't forgot to capitalise the first letter in "sitemap"
Don't forgot to put space after "Sitemap:"
Good technical & logical question my dear friend.
No in robots.txt file you can't go with relative URL of the sitemap; you need to go with the complete URL of the sitemap.
It's better to go with "sitemap: https://www.example.com/sitemap_index.xml"
In the above URL after the colon gives space.
I also like to support Deepak.