Does the user agent string have to be exactly as it appears in my server logs? - seo

When using a Robots.txt file, does the user agent string have to be exactly as it appears in my server logs?
For example when trying to match GoogleBot, can I just use googlebot?
Also, will a partial-match work? For example just using Google?

At least for googlebot, the user-agent is non-case-sensitive. Read the 'Order of precedence for user-agents' section:
https://code.google.com/intl/de/web/controlcrawlindex/docs/robots_txt.html

(As already answered in another question)
In the original robots.txt specification (from 1994), it says:
User-agent
[…]
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
[…]
But if/which parsers work like that is another question. Your best bet would be to look for the documentation of the bots you want to add. You’ll typically find the agent identifier string in it, e.g.:
Bing:
We want webmasters to know that bingbot will still honor robots.txt directives written for msnbot, so no change is required to your robots.txt file(s).
DuckDuckGo:
DuckDuckBot is the Web crawler for DuckDuckGo. It respects WWW::RobotRules […]
Google:
The Google user-agent is (appropriately enough) Googlebot.
Internet Archive:
User Agent archive.org_bot is used for our wide crawl of the web. It is designed to respect robots.txt and META robots tags.
…

robots.txt is case-sensitive, although Google is more conservative than other bots, and may accept its string either way, other bots may not.

Also, will a partial-match work? For example just using Google?
In theory, yes. However, in practise it seems to be specific partial-matches or "substrings" (as mentioned in #unor's answer) that match. These specific "substrings" appear to be referred to as "tokens". And often it must be an exact match for these "tokens".
With regards to the standard Googlebot, this only appears to match Googlebot (case-insensitive). Any lesser partial-match, such as Google, fails to match. Any longer partial-match, such as Googlebot/1.2, fails to match. And using the full user-agent string (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) also fails to match. (Although there is technically more than one user-agent for the Googlebot anyway, so matching on the full user-agent string would not be recommended anyway - even if it did work.)
These tests were performed with Google's robots.txt tester.
Reference:
Google Crawlers - Includes User agent "tokens" (to be used in robots.txt)
Google's robots.txt tester

Yes, the user agent has to be an exact match.
From robotstxt.org: "globbing and regular expression are not supported in either the User-agent or Disallow lines"

Related

robots txt file syntax can I dis allow all then only allow some sites

Can you disallow all and then allow specific sites only. I am aware one approach is to disallow specific sites and allow all. Its is valid to do the reverse: E.G:
User-agent: *
Disallow: /
Allow: /siteOne/
Allow: /siteTwo/
Allow: /siteThree/
To simply disallow all and then allow sites seems much more secure than to all all and them have to think about all the places you dont want them to crawl.
could this method above be responsible for the sites description saying 'A description for this result is not available because of this site's robots.txt – learn more.' in the organic ranking on Google's home page
UPDATE - I have gone into Google webmaster tools > Crawl > robots.txt tester. At first when I entered siteTwo/default.asp it said Blocked and highlighted the 'Disallow: /' line. After leaving and re visiting the tool it now says Allowed. Very weird. So if this says Allowed I wonder why it gived the message above in the description for the site?
UPDATE2 - The example of the robots.txt file above should have said dirOne, dirTwo and not siteOne, siteTwo. Two great links to know all about robot.txt are unor's robot.txt specification in the accepted answer below and the robots exclusion standard is also a must read. This is all explained in these two pages. In summary yes you can disallow and them allow BUT always place the disallow last.
(Note: You don’t disallow/allow crawling of "sites" in the robots.txt, but URLs. The value of Disallow/Allow is always the beginning of a URL path.)
The robots.txt specification does not define Allow.
Consumers following this specification would simply ignore any Allow fields. Some consumers, like Google, extend the spec and understand Allow.
For those consumers that don’t know Allow: Everything is disallowed.
For those consumers that know Allow: Yes, your robots.txt should work for them. Everything’s disallowd, except those URLs matched by the Allow fields.
Assuming that your robots.txt is hosted at http://example.org/robots.txt, Google would be allowed to crawl the following URLs:
http://example.org/siteOne/
http://example.org/siteOne/foo
http://example.org/siteOne/foo/
http://example.org/siteOne/foo.html
Google would not be allowed to crawl the following URLs:
http://example.org/siteone/ (it’s case-sensitive)
http://example.org/siteOne (missing the trailing slash)
http://example.org/foo/siteOne/ (not matching the beginning of the path)

Duplicate content and international sites clarification

Something is not clear, here is my case:
i want to have have the same content for us and uk people,
could i safely avoid duplicate content with thoses url:
www.example.us/info.html (hosted on us server)
www.example.co.uk/info.html (hosted on uk server)
from google :
Websites that provide content for different regions and in different languages sometimes create content that is the same or similar but available on different URLs. This is generally not a problem as long as the content is for different users in different countries. While we strongly recommend that you provide unique content for each different group of users, we understand that this might not always be possible. There is generally no need to "hide" the duplicates by disallowing crawling in a robots.txt file or by using a "noindex" robots meta tag. However, if you're providing the same content to the same users on different URLs (for instance, if both example.de/ and example.com/de/ show German language content for users in Germany), you should pick a preferred version and redirect (or use the rel=canonical link element) appropriately. In addition, you should follow the guidelines on rel-alternate-hreflang to make sure that the correct language or regional URL is served to searchers.
Seems not clear for me, what do you think about my case ?!
flau
Go for hreflang. When implemented properly, you will avoid all duplicate content issues.
if you're providing the same content to the same users on different URLs (for instance, if both example.de/ and example.com/de/ show German language content for users in Germany), you should pick a preferred version and redirect (or use the rel=canonical link element) appropriately. In addition, you should follow the guidelines on rel-alternate-hreflang to make sure that the correct language or regional URL is served to searchers
That covers your scenario:
Choose one as your preferred URL for the US and make it redirect (or use canonical), and
Follow hreflang guidelines: https://support.google.com/webmasters/answer/189077?hl=en

If I respond to requests for robots.txt with HTTP code 418 AKA "I'm a teapot", will this make search engines dislike me?

I have a very simple webapp that runs within HTML5's Canvas that doesn't have any public files that need to be indexed by search engines (beyond the front-page HTML file that includes calls to all the necessary resources). As such, I don't really need robots.txt file, since they'll just see the public files and that's it.
Now, as a joke, I'd like to return an HTTP-418 AKA "I'm a tea pot" response every time a web-crawler asks for robots.txt. However, if this will end up screwing me over in terms of my location in search results, then this is not a joke that would be very worthwhile for me.
Does anybody know anything about how different web-crawlers will respond to non-standard (though in this case it technically is standard) HTTP codes?
Also, on a more serious note, is there any reason to have a robots.txt file that says "everything is indexable!" instead of just not having a file?
Having a blank robots.txt file will also tell crawlers that you want all of your content indexed. There is an allow directive for robots.txt but it is non-standard and should not be relied upon. This is good to do because it keeps 404 errors from piling up in your access logs whenever a search engine tries to request a non-existent robots.txt from your site.
Sending out non-standard HTTP codes is not a good idea as you have absolutely no idea how search engines will respond to it. If they don't accept it they may use a 404 header as a fallback and that's obviously not what you want to happen. Basically, this is a bad place to make a joke.

How to track all website activity and filtering web robot data

I'm doing a very rudimentary tracking of page views by logging url, referral codes, sessions, times etc but finding it's getting bombarded with robots (Google, Yahoo etc). I'm wondering what an effective way is to filter out or not log these statistics?
I've experimented with robot IP lists etc but this isn't foolproof.
Is there some kind of robots.txt, htaccess, PHP server-side code, javascript or other method(s) that can "trick" robots or ignore non-human interaction?
Just to add - a technique you can employ within your interface would be to use Javascript to encapsulate the actions that lead to certain user-interaction view/counter increments, for a very rudimentary example, a robot will(can) not follow:
Chicken Farms
function viewItem(id)
{
window.location.href = 'www.example.com/items?id=' + id + '&from=userclick';
}
To make those clicks easier to track, they might yield a request such as
www.example.com/items?id=4&from=userclick
That would help you reliably track how many times something is 'clicked', but it has obvious drawbacks, and of course it really depends on what you're trying to achieve.
It depends on what you what to achieve.
If you want search bots to stop visiting certain paths/pages you can include them in robots.txt. The majority of well-behaving bots will stop hitting them.
If you want bots to index these paths but you don't want to see them in your reports then you need to implement some filtering logic. E.g. all major bots have a very clear user-agent string (e.g. Googlebot/2.1). You can use these strings to filter these hits out from your reporting.
Well the robots will all use a specific user-agent, so you can just disregard those requests.
But also, if you just use a robots.txt and deny them from visiting; well that will work too.
Don't redescover the weel!
Any statistical tool at the moment filters robots request. You can install AWSTATS (open source) even if you have a shared hosting. If you won't to install a software in your server you can use Google Analytics adding just a script at the end of your pages. Both solutions are very good. In this way you only have to log your errors (500, 404 and 403 are enough).

URL scheme for a multi-version web app

I'm looking for the best URL schema to use for a web app that has multiple versions, namely several languages and a simplified version for use by mobile phones - both aspects can be combined, so there's an English regular and mobile version, a German regular and mobile version, etc.
Goals (in order of importance):
User-friendliness
Search engine friendliness
Ease of development
Aspects to consider:
How should the URLs look like?
How should the user navigate between versions?
How much logic should there be to automatically decide on a version?
I'll describe my concept so far below, maybe some of you have better ideas.
My current concept:
When a new user arrives, the app decides, based on cookies (see below), the Accept-Language: header and the user agent string (used to identify mobile browsers) which version to show, but does not reflect this in the URL (no redirects)
It defaults to the non-simplified English version
There are prominently displayed icons (flags, a stylized mobile phone) to choose other versions
When the user explicitly chooses a different version, this is reflected both in a changed URL and a browser cookie
The URL schema is / for the "automatic" version, /en/, /de/, etc. for the language version, /mobile/ for the simplified version, /normal/ for the non-simplified one, and combinations thereof i.e. /mobile/en/ and /normal/de/
mod_rewrite is used to strip these URL prefixes and convert them to GET parameters for the app to parse
robots.txt disallows /mobile/ and /normal/
Advantages:
The different language versions are all indexed separately by search engines
Cookies help, but are not necessary
There'S a good chance that people will see the version that's ideal for them without having to make any choice
The user can always explicitly choose which version he wants (this makes the /normal/ URL necessary)
Each version has an URL which will display exactly that version when passed to others
/mobile/ and /normal/ are ignored by search engines; they would only be duplicate content.
Disadvantages:
Requires heavy use of mod_rewrite, which I find rather cryptic
Users could send their current URL to someone and that person, when visiting it, could end up seeing a different version, which could cause confusion
There is still duplicate content between / and /en/ - I can't disallow / in robots.txt - should I trust the search engines not to penalize me for exact duplicate content on the same domain, or disallow /en/ and accept that people coming to / via a search engine may see a different version than what they found in the search engine?
I suggest subdomains, personally.
I wouldn't include the mobile at all - use the useragent to determine this, and possibly a cookie incase the user wants to view the full site on their mobile (think how Flickr and Google do it). But for languages, yes - primary language at http://mydomain.com/, secondary languages at i.e. http://de.mydomain.com/ or http://fr.mydomain.com/
I am unclear why you would want to incorporate any kind of what you call versioning information, such as accept-language or user-agent, specific designation in the URL scheme. The URL scheme should be indicative of the content only. The server should investigate the various request headers to determine how to retrieve and/or format the response.