Allow certain parameter in robots.txt - seo

I have this in my robots.txt and that needs to stay there:
Disallow: /*?
However I also need Google to index pages that have ?amp at the end of the url. Like this:
www.domain.com/product-name?amp=1
Is there a way to allow those in robots.txt, but also keep the Disallow mentioned earlier?

To quote Google's documentation:
At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry trumps the less specific (shorter) rule. In case of conflicting rules, including those with wildcards, the least restrictive rule is used.
This means if allow ?amp but disallow folders above it, it should follow the more specific rule first and allow amp pages, but disallow anything higher in hierarchy.

Related

Robots.txt disallow by regex

On my website I have a page for the cart, that is: http://www.example.com/cart and another for the cartoons: http://www.example.com/cartoons. How should I write on my robots.txt file to ignore only the cart page?
The cart page does not accept an ending slash on the URL, so if I do:
Disallow: /cart, it will ignore /cartoon too.
I don't know if it's possible and it will be correctly parsed by the spider bots something like /cart$. I dont want to force Allow: /cartoon because may be another pages with the same prefix.
In the original robots.txt specification, this is not possible. It neither supports Allow nor any characters with special meaning inside a Disallow value.
But some consumers support additional things. For example, Google gives a special meaning to the $ sign, where it represents the end of the URL path:
Disallow: /cart$
For Google, this will block /cart, but not /cartoon.
Consumers that don’t give this special meaning will interpret $ literally, so they will block /cart$, but not /cart or /cartoon.
So if using this, you should specify the bots in User-agent.
Alternative
Maybe you are fine with crawling but just want to prevent indexing? In that case you could use meta-robots (with a noindex value) instead of robots.txt. Supporting bots will still crawl the /cart page (and follow links, unless you also use nofollow), but they won’t index it.
<!-- in the <head> of the /cart page -->
<meta name="robots" content="noindex" />
You could explicitly allow and disallow both paths. More specific paths will take a higher precedent if they are longer in length:
disallow: /cart
allow: /cartoon
More info is available at: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

robots txt file syntax can I dis allow all then only allow some sites

Can you disallow all and then allow specific sites only. I am aware one approach is to disallow specific sites and allow all. Its is valid to do the reverse: E.G:
User-agent: *
Disallow: /
Allow: /siteOne/
Allow: /siteTwo/
Allow: /siteThree/
To simply disallow all and then allow sites seems much more secure than to all all and them have to think about all the places you dont want them to crawl.
could this method above be responsible for the sites description saying 'A description for this result is not available because of this site's robots.txt – learn more.' in the organic ranking on Google's home page
UPDATE - I have gone into Google webmaster tools > Crawl > robots.txt tester. At first when I entered siteTwo/default.asp it said Blocked and highlighted the 'Disallow: /' line. After leaving and re visiting the tool it now says Allowed. Very weird. So if this says Allowed I wonder why it gived the message above in the description for the site?
UPDATE2 - The example of the robots.txt file above should have said dirOne, dirTwo and not siteOne, siteTwo. Two great links to know all about robot.txt are unor's robot.txt specification in the accepted answer below and the robots exclusion standard is also a must read. This is all explained in these two pages. In summary yes you can disallow and them allow BUT always place the disallow last.
(Note: You don’t disallow/allow crawling of "sites" in the robots.txt, but URLs. The value of Disallow/Allow is always the beginning of a URL path.)
The robots.txt specification does not define Allow.
Consumers following this specification would simply ignore any Allow fields. Some consumers, like Google, extend the spec and understand Allow.
For those consumers that don’t know Allow: Everything is disallowed.
For those consumers that know Allow: Yes, your robots.txt should work for them. Everything’s disallowd, except those URLs matched by the Allow fields.
Assuming that your robots.txt is hosted at http://example.org/robots.txt, Google would be allowed to crawl the following URLs:
http://example.org/siteOne/
http://example.org/siteOne/foo
http://example.org/siteOne/foo/
http://example.org/siteOne/foo.html
Google would not be allowed to crawl the following URLs:
http://example.org/siteone/ (it’s case-sensitive)
http://example.org/siteOne (missing the trailing slash)
http://example.org/foo/siteOne/ (not matching the beginning of the path)

What is the meaning of /*+* in robots.txt file?

I have a question regarding robots.txt file.
Disallow: Blog/*+*
What does that mean?
In theory it would stop a robot that chose to respect it from accessing any part of the website that began with Blog/+ ; however, the bot doesn't have to respect it, and since it isn't starting with a directory indicating slash there is no telling how people's robots will deal with it.
from : http://www.robotstxt.org/orig.html
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

vBulletin forum under multiple domains

Hope someone will give me a hand with this problem I have. So here it goes.
There is a website with integrated vBulletin forum inside. The forum is accessible through
https://site.de/forum domain. The main site itself has many other domains based on locale. That is to say, there is a https://site.ch, https://site.it, https://site.at, etc (each one is in corresponding language).
Now there is a need to have this forum under at least 2 of this additional domains. I mean, there should be https://site.ch/forum domain, wich will contain the same forum, but with some differences in style and, of course, will have working inside-forum links with it's own domain (site.ch). The whole system is to be SEO-ed also.
So now my question is how to achieve this? I know there are some sort of plugins to manage multi-domain access, but they are not supported and are still in beta version.
At first, how to setup the forum to work under multiple domains?
And then, maybe I need to manually change some code to set the $vbulletin->options['bburl'] that is used to generate the links inside forum?
And the last one, how do I make all this search engine optimized??
You're asking numerous questions, you might get better results if you created a separate question for each of:
1) How to use one forum directory for multiple domains? (with the vbulletin tag and the tag for the web server you are using)
2) How to set the language based on the current domain in vbulletin? (with the vbulletin tag and one or more of these tags: localized, locale, multi-language, multilanguage)
3) Best practices for duplicate content presented in multiple languages on multiple domains (with the seo and vbulletin tags)
Some Answers:
1) If you're using the apache web server, you could add something like this to your httpd.conf file:
Alias /forums /var/www/...xxx.../forum_directory // use the path to your forum directory, no trailing slash
<Directory /var/www/...xxx.../forum_directory>
Order allow,deny
Allow from all
</Directory>
Then in the vbulletin ACP, change the setting for your basepath URL to "No":
Admin Control Panel -> Site Name / URL / Contact Details -> Always use Forum URL as Base Path
2) There are a few plugins that detect the language used by the browser and set vBulletin to use that language:
Language Detection
Set forum-language automatic to browser-language for first-time-visitors
3) SEO covers many things, but to deal with having duplicate content on multiple domains you can look at the Google Webmaster Central Blog.
This posting is helpful:
Working with multi-regional websites
A section from the post: Dealing with duplicate content on global websites
Websites that provide content for different regions and in different languages sometimes create content that is the same or similar but available on different URLs. This is generally not a problem as long as the content is for different users in different countries. While we strongly recommend that you provide unique content for each different group of users, we understand that this may not always be possible for all pages and variations from the start. There is generally no need to "hide" the duplicates by disallowing crawling in a robots.txt file or by using a "noindex" robots meta tag. However, if you're providing the same content to the same users on different URLs (for instance, if both "example.de/" and "example.com/de/" show German language content for users in Germany), it would make sense to choose a preferred version and to redirect (or use the "rel=canonical" link element) appropriately.
I don't have anything on the other search engines.

Does the user agent string have to be exactly as it appears in my server logs?

When using a Robots.txt file, does the user agent string have to be exactly as it appears in my server logs?
For example when trying to match GoogleBot, can I just use googlebot?
Also, will a partial-match work? For example just using Google?
At least for googlebot, the user-agent is non-case-sensitive. Read the 'Order of precedence for user-agents' section:
https://code.google.com/intl/de/web/controlcrawlindex/docs/robots_txt.html
(As already answered in another question)
In the original robots.txt specification (from 1994), it says:
User-agent
[…]
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
[…]
But if/which parsers work like that is another question. Your best bet would be to look for the documentation of the bots you want to add. You’ll typically find the agent identifier string in it, e.g.:
Bing:
We want webmasters to know that bingbot will still honor robots.txt directives written for msnbot, so no change is required to your robots.txt file(s).
DuckDuckGo:
DuckDuckBot is the Web crawler for DuckDuckGo. It respects WWW::RobotRules […]
Google:
The Google user-agent is (appropriately enough) Googlebot.
Internet Archive:
User Agent archive.org_bot is used for our wide crawl of the web. It is designed to respect robots.txt and META robots tags.
…
robots.txt is case-sensitive, although Google is more conservative than other bots, and may accept its string either way, other bots may not.
Also, will a partial-match work? For example just using Google?
In theory, yes. However, in practise it seems to be specific partial-matches or "substrings" (as mentioned in #unor's answer) that match. These specific "substrings" appear to be referred to as "tokens". And often it must be an exact match for these "tokens".
With regards to the standard Googlebot, this only appears to match Googlebot (case-insensitive). Any lesser partial-match, such as Google, fails to match. Any longer partial-match, such as Googlebot/1.2, fails to match. And using the full user-agent string (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) also fails to match. (Although there is technically more than one user-agent for the Googlebot anyway, so matching on the full user-agent string would not be recommended anyway - even if it did work.)
These tests were performed with Google's robots.txt tester.
Reference:
Google Crawlers - Includes User agent "tokens" (to be used in robots.txt)
Google's robots.txt tester
Yes, the user agent has to be an exact match.
From robotstxt.org: "globbing and regular expression are not supported in either the User-agent or Disallow lines"