What is the meaning of /*+* in robots.txt file? - robot

I have a question regarding robots.txt file.
Disallow: Blog/*+*
What does that mean?

In theory it would stop a robot that chose to respect it from accessing any part of the website that began with Blog/+ ; however, the bot doesn't have to respect it, and since it isn't starting with a directory indicating slash there is no telling how people's robots will deal with it.
from : http://www.robotstxt.org/orig.html
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

Related

Allow certain parameter in robots.txt

I have this in my robots.txt and that needs to stay there:
Disallow: /*?
However I also need Google to index pages that have ?amp at the end of the url. Like this:
www.domain.com/product-name?amp=1
Is there a way to allow those in robots.txt, but also keep the Disallow mentioned earlier?
To quote Google's documentation:
At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry trumps the less specific (shorter) rule. In case of conflicting rules, including those with wildcards, the least restrictive rule is used.
This means if allow ?amp but disallow folders above it, it should follow the more specific rule first and allow amp pages, but disallow anything higher in hierarchy.

Robots.txt disallow by regex

On my website I have a page for the cart, that is: http://www.example.com/cart and another for the cartoons: http://www.example.com/cartoons. How should I write on my robots.txt file to ignore only the cart page?
The cart page does not accept an ending slash on the URL, so if I do:
Disallow: /cart, it will ignore /cartoon too.
I don't know if it's possible and it will be correctly parsed by the spider bots something like /cart$. I dont want to force Allow: /cartoon because may be another pages with the same prefix.
In the original robots.txt specification, this is not possible. It neither supports Allow nor any characters with special meaning inside a Disallow value.
But some consumers support additional things. For example, Google gives a special meaning to the $ sign, where it represents the end of the URL path:
Disallow: /cart$
For Google, this will block /cart, but not /cartoon.
Consumers that don’t give this special meaning will interpret $ literally, so they will block /cart$, but not /cart or /cartoon.
So if using this, you should specify the bots in User-agent.
Alternative
Maybe you are fine with crawling but just want to prevent indexing? In that case you could use meta-robots (with a noindex value) instead of robots.txt. Supporting bots will still crawl the /cart page (and follow links, unless you also use nofollow), but they won’t index it.
<!-- in the <head> of the /cart page -->
<meta name="robots" content="noindex" />
You could explicitly allow and disallow both paths. More specific paths will take a higher precedent if they are longer in length:
disallow: /cart
allow: /cartoon
More info is available at: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

Robots.txt Specific Exclusion

Currently my robots.txt is the following
#Sitemaps
Sitemap: http://www.baopals.com.com/sitemap.xml
#Disallow select URLs
User-agent: *
Disallow: /admin/
Disallow: /products/
My products have a lot of duplicate content as I pull over data from taobao.com and automatically translate it resulting in a lot of duplicate and low quality names which is why I just disallow the whole thing. However I manually change the titles on certain products and re-save them to the database and showcase them on the homepage with proper translations they just still get saved back to /products/ and are lost forever when I remove them from the homepage.
I'm wondering if it would be possible to allow the products that I save to the homepage with the updated translations still be indexed by google or am I forced to change the directory of the manually updated products?
Some bots (including the Googlebot) support the Allow field. This allows you to specify paths that should be allowed to crawl anyway.
So you would have to add an Allow line for each product that you want to get crawled.
User-agent: *
Disallow: /admin/
Disallow: /products/
Allow: /products/foo-bar-1
Allow: /products/foo-foo-2
Allow: /products/bar-foo
But instead of disallowing crawling of your product pages, you might want to disallow indexing. Then a bot is still allowed to visit your pages and follow links, but it won’t add the pages to its search index.
Add <meta name="robots" content="noindex" /> to each product page (in the head), and remove it (or change it to index) for each product page you want to get indexed. There’s also a corresponding HTTP header, if that’s easier for you.

Disallow query strings in robots.txt for only one url

so I have one url, chickens.com/hatching that has potential query strings it could be indexed with, i.e. chickens.com/hatching?type=fast. I would definitely like to keep the base url, chickens.com/hatching indexed, but no query parameters. I would like query parameters indexed on other pages, just not this one, so a catchall for all pages will not work. Secondarily, I am rewriting urls to remove trailing slashes, would this catch chickens.com/hatching/?type=fast as well as chickens.com/hatching?type=fast??
Does this work as a solution to my issue?
Disallow: /hatching?*
I have heard this only works for google crawlers... is there a more robust solution for all crawlers?
Thanks for any help! It is greatly appreciated.
User-agent: *
Disallow: /hatching?
Disallow: /hatching/
This robots.txt will block all URLs whose path starts with /hatching? or /hatching/, so for example:
/hatching?
/hatching?foo=bar
/hatching/
/hatching/foo
/hatching/?foo=bar
It’s only using features from the original robots.txt specification, so all conforming bots should be able to understand this.

robots txt file syntax can I dis allow all then only allow some sites

Can you disallow all and then allow specific sites only. I am aware one approach is to disallow specific sites and allow all. Its is valid to do the reverse: E.G:
User-agent: *
Disallow: /
Allow: /siteOne/
Allow: /siteTwo/
Allow: /siteThree/
To simply disallow all and then allow sites seems much more secure than to all all and them have to think about all the places you dont want them to crawl.
could this method above be responsible for the sites description saying 'A description for this result is not available because of this site's robots.txt – learn more.' in the organic ranking on Google's home page
UPDATE - I have gone into Google webmaster tools > Crawl > robots.txt tester. At first when I entered siteTwo/default.asp it said Blocked and highlighted the 'Disallow: /' line. After leaving and re visiting the tool it now says Allowed. Very weird. So if this says Allowed I wonder why it gived the message above in the description for the site?
UPDATE2 - The example of the robots.txt file above should have said dirOne, dirTwo and not siteOne, siteTwo. Two great links to know all about robot.txt are unor's robot.txt specification in the accepted answer below and the robots exclusion standard is also a must read. This is all explained in these two pages. In summary yes you can disallow and them allow BUT always place the disallow last.
(Note: You don’t disallow/allow crawling of "sites" in the robots.txt, but URLs. The value of Disallow/Allow is always the beginning of a URL path.)
The robots.txt specification does not define Allow.
Consumers following this specification would simply ignore any Allow fields. Some consumers, like Google, extend the spec and understand Allow.
For those consumers that don’t know Allow: Everything is disallowed.
For those consumers that know Allow: Yes, your robots.txt should work for them. Everything’s disallowd, except those URLs matched by the Allow fields.
Assuming that your robots.txt is hosted at http://example.org/robots.txt, Google would be allowed to crawl the following URLs:
http://example.org/siteOne/
http://example.org/siteOne/foo
http://example.org/siteOne/foo/
http://example.org/siteOne/foo.html
Google would not be allowed to crawl the following URLs:
http://example.org/siteone/ (it’s case-sensitive)
http://example.org/siteOne (missing the trailing slash)
http://example.org/foo/siteOne/ (not matching the beginning of the path)