Usage of 'Allow' in robots.txt - seo

Recently I saw a site's robots.txt as follows:
User-agent: *
Allow: /login
Allow: /register
I could find only Allow entries and no Disallow entries.
From this, I could understand robots.txt is nearly a blacklist file to Disallow pages to be crawled. So, Allow is used only to allow a sub part of domain which is already blocked with Disallow. Similar to this:
Allow: /crawlthis
Disallow: /
But, that robots.txt has no Disallow entries. So, does this robots.txt let Google crawl all the pages? Or, does it allow only the specified pages tagged with Allow?

You are right that this robots.txt file allows Google to crawl all the pages on the website. A thorough guide can be found here: http://www.robotstxt.org/robotstxt.html.
If you want googleBot to only be allowed to crawl the specified pages then correct format would be:
User Agent:*
Disallow:/
Allow: /login
Allow: /register
(I would normally disallow those specific pages though as they don't provide much value to searchers.)
It's important to note that the Allow command line only works with some robots (including Googlebot)

There is no point in having a robots.txt record that has Allow lines but no Disallow lines. Everything is allowed to be crawled by default anyway.
According to the original robots.txt specification (which doesn’t define Allow), it’s even invalid, as at least one Disallow line is required (bold emphasis mine):
The record starts with one or more User-agent lines, followed by one or more Disallow lines […]
At least one Disallow field needs to be present in a record.
In other words, a record like
User-agent: *
Allow: /login
Allow: /register
is equivalent to the record
User-agent: *
Disallow:
i.e., everything is allowed to be crawled, including (but not limited to) URLs with paths that start with /login and /register.

Related

Allow only Googlebot to index everything

I want to disallow all bots to crawl and index a site. Except Googlebot. I want to allow google to index the index (/) URL, but nothing else. Preferably in robots.txt.
Do you have any ideas on how to achieve this? Thanks!
You'll need to use a robot.txt.
Just create one in your public folder of the website and add;
User-agent: Googlebot
Allow: /
User-agent: *
Disallow: /
This will allow google crawler to index all pages and disallow all other crawlers for the whole website.
For details refer: https://support.google.com/webmasters/answer/6062596?hl=en
I know this could be considered off topic, but still thought I'd answer if this helps.
As John Conde suggested, try webmasters....

Robots.txt and sub-folders

Several domains are configured as add-ons to my primary hosting account (shared hosting).
The directory structure looks like this (primary domain is example.com):
public_html (example.com)
_sub
ex1 --> displayed as example-realtor.com
ex2 --> displayed as example-author.com
ex3 --> displayed as example-blogger.com
(the SO requirement to use example as the domain makes explanation more difficult - for example, sub ex1 might point to plutorealty and ex2 might point to amazon, or some other business sub-hosting with me. The point is that each ex# is a different company's website, so mentally substitute something normal and different for each "example")
Because these domains (ex1, ex2, etc) are add-on domains, they are accessible in two ways (ideally, the 2nd method is known only to me):
(1) http://example1.com
(2) http://example.com/_sub/ex1/index.php
Again, example1.com is a totally unrelated website/domain name from example.com
QUESTIONS:
(a) How will the site be indexed on search engines? Will both (1) and (2) show up in search results? It is undesireable for method 2 to show up in google)
(b) Should I put a robots.txt in public_html that disallows each folder in the _sub folder? Eg:
User-agent: *
Disallow: /_sub/
Disallow: /_sub/ex1/
Disallow: /_sub/ex2/
Disallow: /_sub/ex3/
(c) Is there a more common way to configure add-on domains?
This robots.txt would be sufficient, you don’t have to list anything that comes after /_sub/:
User-agent: *
Disallow: /_sub/
This would disallow bots (who honor the robots.txt) to crawl any URL whose path starts with /_sub/. But that doesn’t necessarily stop these bots to index your URL itself (e.g., list them in their search results).
Ideally you would redirect from http://example.com/_sub/ex1/ to http://example1.com/ with HTTP status code 301. It depends on your server how that works (for Apache, you could use a .htaccess). Then everyone ends up on the canonical URL for your site.
Do not Use Multi site features with Google. Google Ranking effect on Main domain also. If Black hat and also Spam generate sub directory sites.
My Suggestion If you need important site on Sub Categories then Put all Sub Domain noindex .
Robot.txt
User-agent: *
Disallow: /_sub/
Disallow: /_sub/ex1/
Disallow: /_sub/ex2/
Disallow: /_sub/ex3/

Prevent Google Indexing my Pagination System

I am wondering if there is a way to include in my robots.txt a line which stops Google from indexing any URL in my website, that contains specific text.
I have different sections, all of which contain different pages. I don't want Google to index page2, page3, etc, just the main page.
The URL structure I have is as follows:
http://www.domain.com/section
http://www.domain.com/section/page/2
http://www.domain.com/section/article_name
Is there any way to put in my robots.txt file a way to NOT index any URL containing:
/page/
Thanks in advance everyone!
User-agent: Googlebot
Disallow: http://www.domain.com/section/*
or depending on your requirement:
User-agent: Googlebot
Disallow: http://www.domain.com/section/page/*
Also you may use the Google Webmaster tools rather than the robots.txt file
Goto GWT / Crawl / URL Parameters
Add Parameter: page
Set to: No URLs
You can directly use
Disallow: /page

robots.txt - exclude any URL that contains "/node/"

How do I tell crawlers / bots not to index any URL that has /node/ pattern?
Following is since day one but I noticed that Google has still indexed a lot of URLs that has
/node/ in it, e.g. www.mywebsite.com/node/123/32
Disallow: /node/
Is there anything that states that do not index any URL that has /node/
Should I write something like following:
Disallow: /node/*
Update:
The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32
/node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content, how do I handle this? will this work?
Disallow: /*node
Thanks
Disallow: /node/ will disallow any url that starts with /node/ (after the host). The asterisk is not required.
So it will block www.mysite.com/node/bar.html, but will not block www.mysite.com/foo/node/bar.html.
If you want to block anything that contains /node/, you have to write Disallow: */node/
Note also that Googlebot can cache robots.txt for up to 7 days. So if you make a change to your robots.txt today, it might be a week before Googlebot updates its copy of your robots.txt. During that time, it will be using its cached copy.
Disallow: /node/* is exactly what you want to do. Search engines support wildcards in their robots.txt notation and the * characters means "any characters". See Google's notes on robots.txt for more.
update
An alternative way to make sure search engines stay out of a directory, and all directories below it, is to block them with the robots HTTP header. This can be done by placing the following in an htaccess file in your node directory:
Header set x-robots-tag: noindex
Your original Disallow was fine. Jim Mischel's comment seemed spot on and would cause me to wonder if it was just taking time for Googlebot to fetch the updated robots.txt and then unindex relevant pages.
A couple additional thoughts:
Your page URLs may appear in Google search results even if you've included it in robots.txt. See: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 ("...While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web."). To many people, this is counter-intuitive.
Second, I'd highly recommend verifying ownership of your site in Google Webmaster Tools (https://www.google.com/webmasters/tools/home?hl=en), then using tools such as Health->"Fetch as Google" to see real time diagnostics related to retrieving your page. (Does that result indicate that robots.txt is preventing crawling?)
I haven't used it, but Bing has a similar tool: http://www.bing.com/webmaster/help/fetch-as-bingbot-fe18fa0d . It seems well worthwhile to use diagnostic tools provided by Google, Bing, etc. to perform real-time diagnostics on the site.
This question is a bit old, so I hope you've solved the original problem.

How to get certain pages to not be indexed by search engines?

I did:
<meta name="robots" content="none"/>
Is that the best way to go about it, or is there a better way?
You can create a file called robots.txt in the root of your site. The format is this:
User-agent: (user agent string to match)
Disallow: (URL here)
Disallow: (other URL here)
...
Example:
User-agent: *
Disallow: /
Will make (the nice) robots not index anything on your site. Of course, some robots will completely ignore robots.txt, but for the ones that honor it, it's your best bet.
If you'd like more information on the robots.txt file, please see http://www.robotstxt.org/
You could use a robots.txt file to direct search engines which pages not to index.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449
Or use the meta noindex tag in each page.