Block specific file types from google search - block

I want to block XML files from Google bot except sitemap.XML. I am using Lazyest Gallery for my WordPress image gallery. Every gallery folder have a XML file containing the details of images. The problem is, now Google index those XML files instead of galleries. My site search also showing XML files instead of albums.
will
Disallow: /*/*.xml$
work?
I have excluded feeds by adding
Disallow: /*/rss/$
to my robots.txt

To block all files of a certain type the simplest way is:
Disallow: /*.xml$
Disallow: /*.XML$
Robots.txt is case sensitive, thus the two entries (you can leave 1 out if you know they are all one case). Now to make sure we aren't blocking the sitemap.xml we need to allow it first:
Allow: /sitemap.xml
Disallow: /*.xml$
Disallow: /*.XML$
There is also a sitemap directive in robots.txt to reference the location of the sitemap, so we can add that too:
Allow: /sitemap.xml
Disallow: /*.xml$
Disallow: /*.XML$
Sitemap: http://example.com/sitemap.xml

Related

Usage of 'Allow' in robots.txt

Recently I saw a site's robots.txt as follows:
User-agent: *
Allow: /login
Allow: /register
I could find only Allow entries and no Disallow entries.
From this, I could understand robots.txt is nearly a blacklist file to Disallow pages to be crawled. So, Allow is used only to allow a sub part of domain which is already blocked with Disallow. Similar to this:
Allow: /crawlthis
Disallow: /
But, that robots.txt has no Disallow entries. So, does this robots.txt let Google crawl all the pages? Or, does it allow only the specified pages tagged with Allow?
You are right that this robots.txt file allows Google to crawl all the pages on the website. A thorough guide can be found here: http://www.robotstxt.org/robotstxt.html.
If you want googleBot to only be allowed to crawl the specified pages then correct format would be:
User Agent:*
Disallow:/
Allow: /login
Allow: /register
(I would normally disallow those specific pages though as they don't provide much value to searchers.)
It's important to note that the Allow command line only works with some robots (including Googlebot)
There is no point in having a robots.txt record that has Allow lines but no Disallow lines. Everything is allowed to be crawled by default anyway.
According to the original robots.txt specification (which doesn’t define Allow), it’s even invalid, as at least one Disallow line is required (bold emphasis mine):
The record starts with one or more User-agent lines, followed by one or more Disallow lines […]
At least one Disallow field needs to be present in a record.
In other words, a record like
User-agent: *
Allow: /login
Allow: /register
is equivalent to the record
User-agent: *
Disallow:
i.e., everything is allowed to be crawled, including (but not limited to) URLs with paths that start with /login and /register.

Robots.txt and sub-folders

Several domains are configured as add-ons to my primary hosting account (shared hosting).
The directory structure looks like this (primary domain is example.com):
public_html (example.com)
_sub
ex1 --> displayed as example-realtor.com
ex2 --> displayed as example-author.com
ex3 --> displayed as example-blogger.com
(the SO requirement to use example as the domain makes explanation more difficult - for example, sub ex1 might point to plutorealty and ex2 might point to amazon, or some other business sub-hosting with me. The point is that each ex# is a different company's website, so mentally substitute something normal and different for each "example")
Because these domains (ex1, ex2, etc) are add-on domains, they are accessible in two ways (ideally, the 2nd method is known only to me):
(1) http://example1.com
(2) http://example.com/_sub/ex1/index.php
Again, example1.com is a totally unrelated website/domain name from example.com
QUESTIONS:
(a) How will the site be indexed on search engines? Will both (1) and (2) show up in search results? It is undesireable for method 2 to show up in google)
(b) Should I put a robots.txt in public_html that disallows each folder in the _sub folder? Eg:
User-agent: *
Disallow: /_sub/
Disallow: /_sub/ex1/
Disallow: /_sub/ex2/
Disallow: /_sub/ex3/
(c) Is there a more common way to configure add-on domains?
This robots.txt would be sufficient, you don’t have to list anything that comes after /_sub/:
User-agent: *
Disallow: /_sub/
This would disallow bots (who honor the robots.txt) to crawl any URL whose path starts with /_sub/. But that doesn’t necessarily stop these bots to index your URL itself (e.g., list them in their search results).
Ideally you would redirect from http://example.com/_sub/ex1/ to http://example1.com/ with HTTP status code 301. It depends on your server how that works (for Apache, you could use a .htaccess). Then everyone ends up on the canonical URL for your site.
Do not Use Multi site features with Google. Google Ranking effect on Main domain also. If Black hat and also Spam generate sub directory sites.
My Suggestion If you need important site on Sub Categories then Put all Sub Domain noindex .
Robot.txt
User-agent: *
Disallow: /_sub/
Disallow: /_sub/ex1/
Disallow: /_sub/ex2/
Disallow: /_sub/ex3/

robots.txt for disallowing Google to not to follow specific URLs

I have included robots.txt in the root directory of my application in order to tell Google bots that do not follow this http://www.test.com/example.aspx?id=x&date=10/12/2014 URL or the URL with the same extension but different query string values. For that I have used following piece of code:
User-agent: *
Disallow:
Disallow: /example.aspx/
But I found in the Webmaster Tools that Google is still following this page and has chached a number of URLs with the specified extension, is it something that query strings are creating problem because as far as I know that Google do not bother about query string, but just in case. Am I using it correctly or something else also needs to be done in order to achieve the task.
Your instruction is wrong :
Disallow: /example.aspx/
This is blocking all URLs in the direcory /example.aspx/
If you want to block all URLs of the file /example.aspx, use this instruction:
Disallow: /example.aspx
You can test it with Google Webmaster Tools.

Prevent Google Indexing my Pagination System

I am wondering if there is a way to include in my robots.txt a line which stops Google from indexing any URL in my website, that contains specific text.
I have different sections, all of which contain different pages. I don't want Google to index page2, page3, etc, just the main page.
The URL structure I have is as follows:
http://www.domain.com/section
http://www.domain.com/section/page/2
http://www.domain.com/section/article_name
Is there any way to put in my robots.txt file a way to NOT index any URL containing:
/page/
Thanks in advance everyone!
User-agent: Googlebot
Disallow: http://www.domain.com/section/*
or depending on your requirement:
User-agent: Googlebot
Disallow: http://www.domain.com/section/page/*
Also you may use the Google Webmaster tools rather than the robots.txt file
Goto GWT / Crawl / URL Parameters
Add Parameter: page
Set to: No URLs
You can directly use
Disallow: /page

How to get certain pages to not be indexed by search engines?

I did:
<meta name="robots" content="none"/>
Is that the best way to go about it, or is there a better way?
You can create a file called robots.txt in the root of your site. The format is this:
User-agent: (user agent string to match)
Disallow: (URL here)
Disallow: (other URL here)
...
Example:
User-agent: *
Disallow: /
Will make (the nice) robots not index anything on your site. Of course, some robots will completely ignore robots.txt, but for the ones that honor it, it's your best bet.
If you'd like more information on the robots.txt file, please see http://www.robotstxt.org/
You could use a robots.txt file to direct search engines which pages not to index.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449
Or use the meta noindex tag in each page.