How to configure robots.txt file to block all but 2 directories - seo

I don't want any search search engines to index most of my website.
I do however want search engines to index 2 folders ( and their children ). This is what I set up, but I don't think it works, I see pages in Google that I wanted to hide:
Here's my robots.txt
User-agent: *
Allow: /archive/
Allow: /lsic/
User-agent: *
Disallow: /
What's the correct way to disallow all folders, except for 2 ?

I gave a tutorial about this on this forum here. And in Wikipedia here
Basically the first matching robots.txt pattern always wins:
User-agent: *
Allow: /archive/
Allow: /lsic/
Disallow: /
But I suspect it might be too late. Once the page is indexed it's pretty hard to remove it. The only way is to shift it to another folder or just password protect the folder. You should be able to do that in your hosts CPanel.

Related

robots.txt allow all except few sub-directories

I want my site to be indexed in search engines except few sub-directories. Following are my robots.txt settings:
robots.txt in the root directory
User-agent: *
Allow: /
Separate robots.txt in the sub-directory (to be excluded)
User-agent: *
Disallow: /
Is it the correct way or the root directory rule will override the sub-directory rule?
No, this is wrong.
You can’t have a robots.txt in a sub-directory. Your robots.txt must be placed in the document root of your host.
If you want to disallow crawling of URLs whose paths begin with /foo, use this record in your robots.txt (http://example.com/robots.txt):
User-agent: *
Disallow: /foo
This allows crawling everything (so there is no need for Allow) except URLs like
http://example.com/foo
http://example.com/foo/
http://example.com/foo.html
http://example.com/foobar
http://example.com/foo/bar
…
Yes there are
User-agent: *
Disallow: /
The above directive is useful if you are developing a new website and do not want search engines to index your incomplete website.
also,you can get advanced infos right here
You can manage them with robots.txt which sits in the root directory. Make sure to have allow patterns before your disallow patterns.

Prevent Search Spiders from accessing a Rails 3 nested resource with robots.txt

I'm trying to prevent Google, Yahoo et al from hitting my /products/ID/purchase page and am unsure on how to do it.
I currently block them from hitting sign in with the following:
User-agent: *
Disallow: /sign_in
Can I do something like the following?
User-agent: *
Disallow: /products/*/purchase
Or should it be:
User-agent: *
Disallow: /purchase
I assume you want to block /products/ID/purchase but allow /products/ID.
Your last suggestion would only block pages that start with "purchase":
User-agent: *
Disallow: /purchase
So this is not what you want.
You'd need your second suggestion:
User-agent: *
Disallow: /products/*/purchase
This would block all URLs that start with /products/, followed by any character(s), followed by /purchase.
Note: It uses the wildcard *. In the original robots.txt "specification", this is not a character with special meaning. However, some search engines extended the spec and use it as a kind of wildcard. So it should work for Google and probably some other search engines, but you can't bet that it would work with all the other crawlers/bots.
So your robots.txt could look like:
User-agent: *
Disallow: /sign_in
Disallow: /products/*/purchase
Also note that some search engines (including Google) might still list a URL in their search results (without title/snippet) although it is blocked in robots.txt. This might be the case when they find a link to a blocked page on a page that is allowed to be crawled. To prevent this, you'd have to noindex the document.
According to Google Disallow: /products/*/purchase should work.
But according to robotstxt.org this doesn't work.

What does this robots.txt mean? Doesn't it allow any robots?

User-agent: *
Disallow:
Disallow: /admin
Disallow: /admin
Sitemap: http://www.myadress.com/ext/sm/Sitemap_114.xml
I've found this robots.txt file in one of my website's root folder. I don't know i made it or who.
I think this file does not allow any robots to admin folder. This is good.
But i wonder if this blocks all robots to all files in my website?
I've changed it with this file:
User-agent: *
Disallow: /admin
Allow: /
Sitemap: http://www.myadress.com/ext/sm/Sitemap_114.xml
ps: this website is not getting any index for a long time. was the old file problem?
The old script, other than the duplicate /Admin entry, was correct. There is no Allow: command, and an empty 'Disallow:' opens the site up to robots.
http://www.free-seo-news.com/all-about-robots-txt.htm
Specifically, check item #7 under 'Things you should avoid' and #1 under 'Tips and Tricks'
Possibly. I read your old file the same way you did - that root was being disallowed. Either way, your new robots file is set up how I expect that you want it.

Blocking folders inbetween allowed content

I have a site with the following structure:
http://www.example.com/folder1/folder2/folder3
I would like to disallow indexing in folder1, and folder2.
But I would like the robots to index everything under folder3.
Is there a way to do this with the robots.txt?
For what I read I think that everything inside a specified folder is disallowed.
Would the following achieve my goal?
user-agent: *
Crawl-delay: 0
Sitemap: <Sitemap url>
Allow: /folder1/folder2/folder3
Disallow: /folder1/folder2/
Disallow: /folder1/
Allow: /
Yes, it works... however google has a tool to test your robots.txt file
you only need to go on google webmaster tools (https://www.google.com/webmasters/tools/)
and open the section "site configuration -> crawler access"
All you would need is:
user-agent: *
Crawl-delay: 0
Sitemap:
Allow: /folder1/folder2/folder3
Disallow: /folder1/
Allow: /
At least googlebot will see the more specific allowing of that one directory and disallow anything from folder1 and on. This is backed up by this post by a Google employee.
Line breaks in records are not allowed, so your original robots.txt should look like this:
user-agent: *
Crawl-delay: 0
Sitemap: <Sitemap url>
Allow: /folder1/folder2/folder3
Disallow: /folder1/folder2/
Disallow: /folder1/
Allow: /
Possible improvements:
Specifying Allow: / is superfluous, as it’s the default anyway.
Specifying Disallow: /folder1/folder2/ is superfluous, as Disallow: /folder1/ is sufficient.
As Sitemap is not per record, but for all bots, you could specify it as a separate block.
So your robots.txt could look like this:
User-agent: *
Crawl-delay: 0
Allow: /folder1/folder2/folder3
Disallow: /folder1/
Sitemap: http://example.com/sitemap
(Note that the Allow field is not part of the original robots.txt specification, so don’t expect all bots to understand it.)

robots.txt ignrore all folders but crawl all files in root

should i then do
User-agent: *
Disallow: /
is it as simple as that?
or will that not crawl the files in the root either?
basically that is what i am after - crawling all the files/pages in the root, but not any of the folders at all
or am i going to have to specify each folder explicitly.. ie
disallow: /admin
disallow: /this
.. etc
thanks
nat
Your example will block all all the files in root.
There isn't a "standard" way to easily do what you want without specifying each folder explicitly.
Some crawlers however do support extensions that will allow you to do pattern matching. You could disallow all bots that don't support the pattern matching, but allow those that do.
For example
# disallow all robots
User-agent: *
Disallow: /
# let google read html and files
User-agent: Googlebot
Allow: /*.html
Allow: /*.pdf
Disallow: /