I want to disable crawling for my subdomains.
For example:
my main domain is maindomain.com
subdomain_one.com (add-on domain)
subdomain_two.com (add-on domain)
So I want to disable crawling for subdomain_one.maildomain.com.
I have used this in robot.txt:
User-agent: *
Disallow: /subdomain_one/
Disallow: /subdomain_two/
The file must be called robots.txt, not robot.txt.
If you want to disallow all bots to crawl your subdomain, you have to place a robots.txt file in the document root of this subdomain, with the following content:
User-agent: *
Disallow: /
Each host needs its own robots.txt. You can’t specify subdomains inside of the robots.txt, only beginnings of URL paths.
So if you want to block all files on http://sub.example.com/, the robots.txt must be accessible from http://sub.example.com/robots.txt.
It doesn’t matter how your sites are organized on the server-side, it only matters what is publicly accessible.
Related
I want my site to be indexed in search engines except few sub-directories. Following are my robots.txt settings:
robots.txt in the root directory
User-agent: *
Allow: /
Separate robots.txt in the sub-directory (to be excluded)
User-agent: *
Disallow: /
Is it the correct way or the root directory rule will override the sub-directory rule?
No, this is wrong.
You can’t have a robots.txt in a sub-directory. Your robots.txt must be placed in the document root of your host.
If you want to disallow crawling of URLs whose paths begin with /foo, use this record in your robots.txt (http://example.com/robots.txt):
User-agent: *
Disallow: /foo
This allows crawling everything (so there is no need for Allow) except URLs like
http://example.com/foo
http://example.com/foo/
http://example.com/foo.html
http://example.com/foobar
http://example.com/foo/bar
…
Yes there are
User-agent: *
Disallow: /
The above directive is useful if you are developing a new website and do not want search engines to index your incomplete website.
also,you can get advanced infos right here
You can manage them with robots.txt which sits in the root directory. Make sure to have allow patterns before your disallow patterns.
I just changed the DNS settings so the folder /forum is now a subdomain instead of a subdirectory. If I do a robots.txt file and say:
User-agent: *
Disallow: /forum
Will that disallow crawling for the subdirectory AND subdomain?
I want to disallow crawling of the subdirectory, but ALLOW crawling of the subdomain. Note: this is on shared hosting so both the subdirectory and subdomain can be visited. This is why I have this issue.
So, How can I only permit crawling for the subdomain?
It's the correct way, if you want to stop crawling. But note: If the URLs are already indexed, the won't be removed.
The way I would prefer is to set all pages to "noindex/follow" via meta tags or even better you the "canonical tag" to send the search engines traffic to the subdomain url
Into your
On a given URL like "http://www.yourdomain.com/directoryname/post-of-the-day" use
<link rel="canonical" href="http://directoyname.yourdomain.com/post-of-the-day" />
The latest URL will be the only one in SERPs
User-agent: *
Disallow:
Disallow: /admin
Disallow: /admin
Sitemap: http://www.myadress.com/ext/sm/Sitemap_114.xml
I've found this robots.txt file in one of my website's root folder. I don't know i made it or who.
I think this file does not allow any robots to admin folder. This is good.
But i wonder if this blocks all robots to all files in my website?
I've changed it with this file:
User-agent: *
Disallow: /admin
Allow: /
Sitemap: http://www.myadress.com/ext/sm/Sitemap_114.xml
ps: this website is not getting any index for a long time. was the old file problem?
The old script, other than the duplicate /Admin entry, was correct. There is no Allow: command, and an empty 'Disallow:' opens the site up to robots.
http://www.free-seo-news.com/all-about-robots-txt.htm
Specifically, check item #7 under 'Things you should avoid' and #1 under 'Tips and Tricks'
Possibly. I read your old file the same way you did - that root was being disallowed. Either way, your new robots file is set up how I expect that you want it.
should i then do
User-agent: *
Disallow: /
is it as simple as that?
or will that not crawl the files in the root either?
basically that is what i am after - crawling all the files/pages in the root, but not any of the folders at all
or am i going to have to specify each folder explicitly.. ie
disallow: /admin
disallow: /this
.. etc
thanks
nat
Your example will block all all the files in root.
There isn't a "standard" way to easily do what you want without specifying each folder explicitly.
Some crawlers however do support extensions that will allow you to do pattern matching. You could disallow all bots that don't support the pattern matching, but allow those that do.
For example
# disallow all robots
User-agent: *
Disallow: /
# let google read html and files
User-agent: Googlebot
Allow: /*.html
Allow: /*.pdf
Disallow: /
I have a situation where we have two code bases that need to stay intact..
example: http://example.com.
And a new site http://www.example.com.
The old site (no WWW) supports some legacy code and has the rule:
User-agent: *
Disallow: /
But in the new version (with WWW) there is no robots.txt.
Is Google looking to the old (no WWW) robots.txt file as its rule? And will adding
User-agent: *
Allow: /
to the (WWW) side override this?
Changing robots.txt on in the old codebase is not an option at this time.
No, the subdomain "www." and the subdomain "" are separate subdomains, and the robots.txt from one of them is not used for the other.