Prevent Search Spiders from accessing a Rails 3 nested resource with robots.txt - ruby-on-rails-3

I'm trying to prevent Google, Yahoo et al from hitting my /products/ID/purchase page and am unsure on how to do it.
I currently block them from hitting sign in with the following:
User-agent: *
Disallow: /sign_in
Can I do something like the following?
User-agent: *
Disallow: /products/*/purchase
Or should it be:
User-agent: *
Disallow: /purchase

I assume you want to block /products/ID/purchase but allow /products/ID.
Your last suggestion would only block pages that start with "purchase":
User-agent: *
Disallow: /purchase
So this is not what you want.
You'd need your second suggestion:
User-agent: *
Disallow: /products/*/purchase
This would block all URLs that start with /products/, followed by any character(s), followed by /purchase.
Note: It uses the wildcard *. In the original robots.txt "specification", this is not a character with special meaning. However, some search engines extended the spec and use it as a kind of wildcard. So it should work for Google and probably some other search engines, but you can't bet that it would work with all the other crawlers/bots.
So your robots.txt could look like:
User-agent: *
Disallow: /sign_in
Disallow: /products/*/purchase
Also note that some search engines (including Google) might still list a URL in their search results (without title/snippet) although it is blocked in robots.txt. This might be the case when they find a link to a blocked page on a page that is allowed to be crawled. To prevent this, you'd have to noindex the document.

According to Google Disallow: /products/*/purchase should work.
But according to robotstxt.org this doesn't work.

Related

About Robots. txt

What if I leave the user agent empty in the robots.txt file.
It is basically like that:
User-agent: *
Disallow: /specific page
But What if it is like:
User-agent:
Disallow: /specific page
User-agent: * Specifies instructions for all robots of all search engines simultaneously
Or if you want to specify for a specific search engine, use
User-agent: Googlebot
So if you do not specify an asterisk, there is a chance that search bots will not understand what the designations after

Robots.txt disallowing folder but allowing subfolders

I'm trying to set it up where www.url.com/folder is disallowed, but www.url.com/folder/1 is allowed. I have it set up as follows:
User-agent: *
Disallow: /folder
Allow: /folder/*
which works when testing with the Google robots.txt tester, but if I look at the logs, I can see Googlebot hitting all of the urls other than /folder.
Am I missing something? Should allow go first?
I think this one should work:
User-agent: *
Disallow: /folder/$
Allow: /folder/*

Trying to disallow one page in subdirectory

I am trying to disallow one page in subdirectory
I am using that robots.txt code is it
User-Agent: *
Disallow:
Disallow: /form.aspx
but the form.aspx is in processfolder and my url is showing like
www.yoursite.com/process/form.aspx
so how can I disallow form.aspx in robots.txt.
The format which is given above robots.txt: is it right?
please guide
If you want to block http://example.com/process/form.aspx and allow everything else, you can use:
# robots.txt on <http://example.com/robots.txt>
User-agent: *
Disallow: /process/form.aspx
Note that this would also block URLs like http://example.com/process/form.aspx.foo, http://example.com/process/form.aspx/bar, etc.

Disallow all parameters in a specific url in robots.txt

I would like to disallow all parameters in a specific url.
If i add this rule :
Disallow: /*?*
It works but for all url
What i would like to do :
Disallow: /my-specific-url/*?*
But according to Google Webmaster Tools, this rule doesn’t work.
Your example looks like it should be working, but you do need to include the User-agent line. The following robots.txt file:
User-agent: *
Disallow: /my-specific-url/*?*
Will block the following URLs:
http://example.com/my-specific-url/?
http://example.com/my-specific-url/?a=b
but it will not block the following:
http://example.com/my-specific-url/
http://example.com/some-other-url/?a=b
Note that the trailing * is harmless but serves no useful purpose. A cleaner way to do exactly the same thing would be:
User-agent: *
Disallow: /my-specific-url/*?
Also note that wildcards are supported by the major search engines, but they are not supported by many other crawlers.
While you can't use regular expressions, you are allowed to use wildcards
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt#url-matching-based-on-path-values
Have you tried something like
Disallow: /my-specific-url/*var1=*

How to configure robots.txt file to block all but 2 directories

I don't want any search search engines to index most of my website.
I do however want search engines to index 2 folders ( and their children ). This is what I set up, but I don't think it works, I see pages in Google that I wanted to hide:
Here's my robots.txt
User-agent: *
Allow: /archive/
Allow: /lsic/
User-agent: *
Disallow: /
What's the correct way to disallow all folders, except for 2 ?
I gave a tutorial about this on this forum here. And in Wikipedia here
Basically the first matching robots.txt pattern always wins:
User-agent: *
Allow: /archive/
Allow: /lsic/
Disallow: /
But I suspect it might be too late. Once the page is indexed it's pretty hard to remove it. The only way is to shift it to another folder or just password protect the folder. You should be able to do that in your hosts CPanel.