How to get certain pages to not be indexed by search engines? - indexing

I did:
<meta name="robots" content="none"/>
Is that the best way to go about it, or is there a better way?

You can create a file called robots.txt in the root of your site. The format is this:
User-agent: (user agent string to match)
Disallow: (URL here)
Disallow: (other URL here)
...
Example:
User-agent: *
Disallow: /
Will make (the nice) robots not index anything on your site. Of course, some robots will completely ignore robots.txt, but for the ones that honor it, it's your best bet.
If you'd like more information on the robots.txt file, please see http://www.robotstxt.org/

You could use a robots.txt file to direct search engines which pages not to index.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449
Or use the meta noindex tag in each page.

Related

Allow only Googlebot to index everything

I want to disallow all bots to crawl and index a site. Except Googlebot. I want to allow google to index the index (/) URL, but nothing else. Preferably in robots.txt.
Do you have any ideas on how to achieve this? Thanks!
You'll need to use a robot.txt.
Just create one in your public folder of the website and add;
User-agent: Googlebot
Allow: /
User-agent: *
Disallow: /
This will allow google crawler to index all pages and disallow all other crawlers for the whole website.
For details refer: https://support.google.com/webmasters/answer/6062596?hl=en
I know this could be considered off topic, but still thought I'd answer if this helps.
As John Conde suggested, try webmasters....

robots.txt for disallowing Google to not to follow specific URLs

I have included robots.txt in the root directory of my application in order to tell Google bots that do not follow this http://www.test.com/example.aspx?id=x&date=10/12/2014 URL or the URL with the same extension but different query string values. For that I have used following piece of code:
User-agent: *
Disallow:
Disallow: /example.aspx/
But I found in the Webmaster Tools that Google is still following this page and has chached a number of URLs with the specified extension, is it something that query strings are creating problem because as far as I know that Google do not bother about query string, but just in case. Am I using it correctly or something else also needs to be done in order to achieve the task.
Your instruction is wrong :
Disallow: /example.aspx/
This is blocking all URLs in the direcory /example.aspx/
If you want to block all URLs of the file /example.aspx, use this instruction:
Disallow: /example.aspx
You can test it with Google Webmaster Tools.

Prevent Google Indexing my Pagination System

I am wondering if there is a way to include in my robots.txt a line which stops Google from indexing any URL in my website, that contains specific text.
I have different sections, all of which contain different pages. I don't want Google to index page2, page3, etc, just the main page.
The URL structure I have is as follows:
http://www.domain.com/section
http://www.domain.com/section/page/2
http://www.domain.com/section/article_name
Is there any way to put in my robots.txt file a way to NOT index any URL containing:
/page/
Thanks in advance everyone!
User-agent: Googlebot
Disallow: http://www.domain.com/section/*
or depending on your requirement:
User-agent: Googlebot
Disallow: http://www.domain.com/section/page/*
Also you may use the Google Webmaster tools rather than the robots.txt file
Goto GWT / Crawl / URL Parameters
Add Parameter: page
Set to: No URLs
You can directly use
Disallow: /page

robots.txt - exclude any URL that contains "/node/"

How do I tell crawlers / bots not to index any URL that has /node/ pattern?
Following is since day one but I noticed that Google has still indexed a lot of URLs that has
/node/ in it, e.g. www.mywebsite.com/node/123/32
Disallow: /node/
Is there anything that states that do not index any URL that has /node/
Should I write something like following:
Disallow: /node/*
Update:
The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32
/node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content, how do I handle this? will this work?
Disallow: /*node
Thanks
Disallow: /node/ will disallow any url that starts with /node/ (after the host). The asterisk is not required.
So it will block www.mysite.com/node/bar.html, but will not block www.mysite.com/foo/node/bar.html.
If you want to block anything that contains /node/, you have to write Disallow: */node/
Note also that Googlebot can cache robots.txt for up to 7 days. So if you make a change to your robots.txt today, it might be a week before Googlebot updates its copy of your robots.txt. During that time, it will be using its cached copy.
Disallow: /node/* is exactly what you want to do. Search engines support wildcards in their robots.txt notation and the * characters means "any characters". See Google's notes on robots.txt for more.
update
An alternative way to make sure search engines stay out of a directory, and all directories below it, is to block them with the robots HTTP header. This can be done by placing the following in an htaccess file in your node directory:
Header set x-robots-tag: noindex
Your original Disallow was fine. Jim Mischel's comment seemed spot on and would cause me to wonder if it was just taking time for Googlebot to fetch the updated robots.txt and then unindex relevant pages.
A couple additional thoughts:
Your page URLs may appear in Google search results even if you've included it in robots.txt. See: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 ("...While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web."). To many people, this is counter-intuitive.
Second, I'd highly recommend verifying ownership of your site in Google Webmaster Tools (https://www.google.com/webmasters/tools/home?hl=en), then using tools such as Health->"Fetch as Google" to see real time diagnostics related to retrieving your page. (Does that result indicate that robots.txt is preventing crawling?)
I haven't used it, but Bing has a similar tool: http://www.bing.com/webmaster/help/fetch-as-bingbot-fe18fa0d . It seems well worthwhile to use diagnostic tools provided by Google, Bing, etc. to perform real-time diagnostics on the site.
This question is a bit old, so I hope you've solved the original problem.

Dynamic robots.txt

Let's say I have a web site for hosting community generated content that targets a very specific set of users. Now, let's say in the interest of fostering a better community I have an off-topic area where community members can post or talk about anything they want, regardless of the site's main theme.
Now, I want most of the content to get indexed by Google. The notable exception is the off-topic content. Each thread has it's own page, but all the threads are listed in the same folder so I can't just exclude search engines from a folder somewhere. It has to be per-page. A traditional robots.txt file would get huge, so how else could I accomplish this?
This will work for all well-behaving search engines, just add it to the <head>:
<meta name="robots" content="noindex, nofollow" />
If using Apache I'd use mod-rewrite to alias robots.txt to a script that could dynamically generate the necessary content.
Edit: If using IIS you could use ISAPIrewrite to do the same.
You can implement it by substituting robots.txt with dynamic script generating the output.
With Apache You could make simple .htaccess rule to acheive that.
RewriteRule ^robots\.txt$ /robots.php [NC,L]
Simlarly to #James Marshall's suggestion - in ASP.NET you could use an HttpHandler to redirect calls to robots.txt to a script which generated the content.
Just for that thread , make sure your head contains a noindex meta tag. Thats one more way to tell search engines not to crawl your page other than blocking in robots.txt
Just keep in mind that a robots.txt disallow will NOT prevent Google from indexing pages that have links from external sites, all it does is prevent crawling internally. See http://www.webmasterworld.com/google/4490125.htm or http://www.stonetemple.com/articles/interview-matt-cutts.shtml.
You can disallow search engines to read or index your content by restricting robot meta tags. In this way, spider will consider your instructions and will index only such pages that you want.
block dynamic webpage by robots.txt use this code
User-agent: *
Disallow: /setnewsprefs?
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&