Searching for specific information on the robots.txt, I stumbled upon a Yandex help page‡ on this topic. It suggests that I could use the Host directive to tell crawlers my preferred mirror domain:
User-Agent: *
Disallow: /dir/
Host: www.example.com
Also, the Wikipedia article states that Google too understands the Host directive, but there wasn’t much (i.e. none) information.
At robotstxt.org, I didn’t find anything on Host (or Crawl-delay as stated on Wikipedia).
Is it encouraged to use the Host directive at all?
Are there any resources at Google on this robots.txt specific?
How is compatibility with other crawlers?
‡ At least since the beginning of 2021, the linked entry does not deal with the directive in question any longer.
The original robots.txt specification says:
Unrecognised headers are ignored.
They call it "headers" but this term is not defined anywhere. But as it’s mentioned in the section about the format, and in the same paragraph as User-agent and Disallow, it seems safe to assume that "headers" means "field names".
So yes, you can use Host or any other field name.
Robots.txt parsers that support such fields, well, support them.
Robots.txt parsers that don’t support such fields must ignore them.
But keep in mind: As they are not specified by the robots.txt project, you can’t be sure that different parsers support this field in the same way. So you’d have to check every supporting parser manually.
Related
Assume a bad actor scripts access to an Apache server to probe for vulnerabilities. With Fail2Ban we can catch some number of 404's and ban the IP. Now assume a single web page has a bad local reference to a CSS, JS, or image file. Repeated hits by the same legitimate site visitor will result in some number of 404s, and possibly an IP ban.
Is there a good way to separate these local requests from remote so that we don't ban the valued visitor?
I know all requests are remote, in that a page gets returned to a browser and the content of the page triggers more requests for assets. The thing is, how do we know the difference between that kind of page load pattern, and a script query for the same resource?
If we do know that a request is coming in based on a link that we just generated, we could do a 302 redirect rather than returning a 404, thus avoiding the banning process.
The HTTP Referer header can be used. If the Refer is the same origin as the requested page, or the same as the local site FQDN then we should not ban. But that header can be spoofed. So is this a good tool to use?
I'm thinking cookies can be used, or a session nonce, where a request might come in for assets from a page without a current session cookie. But I don't know if something like that is a built-in feature.
The best solution is obviously to make sure that all pages generated on a site include a valid reference back to the site, but we all know that's not possible. Some CMS add version info to files, or they adjust image paths to include an image size based on the client device/size. Any of these generated headers might simply be wrong until we can find and fix the code that creates them. Between the time we deploy something faulty and the time we fix it, I'm concerned about accidentally banning legitimate visitors with Fail2Ban (and other tools) that do not factor in where the request originates.
Is there another solution to this challenge? Thanks!
how do we know the difference between that kind of page load pattern
You don't in normal case (at least without some white- or black-list).
But you know URI- or paths segments, file extensions etc which would be rather never a target of such attack vectors, which you can ignore.
Some CMS add version info to files, or they adjust image paths to include an image size based on the client device/size.
But you surely knows the prefixes that where correct, so an RE allowing some paths segments would be possible. For instance this one:
# regex ignoring site and cms paths:
^<HOST> -[^"]*\"[A-Z]{3,}\s+/(?!site/|cms/)\S+ HTTP/[^"]+" 40\d\s\d+
will ignore this one:
192.0.2.1 - - [02/Mar/2021:18:01:06] "GET /site/style.css?ver=1.0 HTTP/1.1" 404 469
and match this one:
192.0.2.1 - - [02/Mar/2021:18:01:06] "GET /xampp/phpmyadmin/scripts/setup.php HTTP/1.1" 404 469
Similar you can write an regex with negative lookahead to ignore certain extensions like .css or .js or arguments like ?ver=1.0.
Another possibility would be to make a special fallback location logging completely worse requests in special log-file (not into access or error logs), like described in wiki :: Best practice so this way it would be possible to consider evildoers with definitely wrong URIs did not matching any proper location which can be handled by web server.
Or simply disable logging of 404 in known as valid locations (paths, prefixes, extensions whatever).
To ensure or completely avoid false positives you can firstly increase maxretry or reduce findtime and observe it a bit (so evildoers with too many attempts going banned and legitimate users with "broken" requests causing 404 but with not so large count of them will be still ignored). So you can cumulate whole list of "valid" 404 request of your application (in order to write more precise regex or filter it in some locations).
I'm creating a simple crawler that will scrape from a list of pre-defined sites. My simple question: are there any http headers that the crawler should specifically use? What's considered required, and desirable to have defined?
You should at least specify a custom user agent (as done here by StormCrawler) so that the webmasters of the sites you are crawling can see that you are robot and contact you if needed.
More importantly, your crawler should follow the robots.txt directives, throttle the frequency of requests to the sites, etc... which leads me to the following question : why not reuse and customise an existing open source crawler like StormCrawler, Nutch or Scrapy instead of reinventing the wheel?
It's good to tell who you are and your intentions, and how to get a hold of you. I remember from running a site and looking at the access.log for Apaceh that the following info actually had a mission (as some of the ones listed in StromCrawler code):
Agent name - just the brand name of your crawler
Version of your agent software - If issues with earlier versions of the agent, good to see that it's an evolved version
URL to info about agent - A link to an info-page about the crawler. More info on purpose, technical buildup etc. Also a place to get in contact with the people behind a bot.
If you check out Request fields, you'll find two of interest: User-Agent and from. The second one is the email address, but last I checked it doesn't appear in the access.log for Apache2. The User-Agent for automated agents should contain name, version and URL to a page with more info about the agent. It also common to have the word "bot" in your agent name.
Today I stumbled upon a folder on my web host called 'error.log'. I thought I'd take a look.
I see multiple 'file does not exist' errors - there are three types of entries:
robots.txt
missing.html
apple-touch-icon-precomposed.png
I have some guesses about what these files are used for, but would like to know definitively:
What are the files in question?
Should I add them to my server?
What prompts an error log to be written for these? Is it someone explicitly requesting them? If so, who and how?
A robots.txt file is read by web crawlers/robots to allow/disallow it from scraping resources on your server. However, it's not mandatory for a robot to read this file, but the nice ones do. There are some further examples at http://en.wikipedia.org/wiki/Robots.txt An example file may look like and would reside in the web root directory:
User-agent: * # All robots
Disallow: / # Do not enter website
or
User-Agent: googlebot # For this robot
Disallow: /something # do not enter
The apple-touch-icon-precomposed.png is explained https://stackoverflow.com/a/12683605/722238
I believe the usage of missing.html is used by some as a customized 404 page. It's possible that a robot may be configured to scrape this file, hence the requests for it.
You should add a robots.txt file if you want to control the resources a robot will scrape off your server. As said before, it's not mandatory for a robot to read this file.
If you wanted to add the other two files to remove the error messages you could, however, I don't believe it is necessary. There is nothing to say that joe_random won't make a request on your server for /somerandomfile.txt in which case you will get another error message for another file that doesn't exist. You could then just redirect them to a customized 404 page.
How can a client detect if a server is using mod_rewrite? Now I know that some mod_rewrite rules are not very obvious. But some are, such as "SEO Friendly Urls". What types of behavior is impossible unless a server is running mod_rewrite?
What types of behavior is impossible unless a server is running mod_rewrite?
The real answer is "none". In theory, any URL could be formed by actual files or directories, including the classical "SEO friendly" URLs.
There is only circumstantial evidence:
The best indication that I can think of is when the entire site structure consists of URLs without .htm .php .html file extensions:
http://domain.com/slugs/house-warming-party
to exclude the possibility of that URL being a directory, request
http://domain.com/slugs/house-warming-party/index.htm
http://domain.com/slugs/house-warming-party/index.html
http://domain.com/slugs/house-warming-party/index.php
http://domain.com/slugs/house-warming-party/index.asp
... whatever other extensions there are .....
if those requests all fail, it is very likely that the site is using mod_rewrite. However if they succeed, as #Gumbo says, it could also be the MultiViews option fixing the request. Either way, this is nowhere near safe!
Depending on what your use case is, you could also try to deduct things from the CMS used on the site. Wordpress with mod_rewrite turned on will show a different URL structure than with it turned off. The same holds true for most other CMSes. But of course, this is also a highly imperfect approach.
The use of HTML resources with a .html/.htm/.php ending would point slightly against the use of mod_rewrite, but you can never be sure.
The use of the PATHINFO variable (also known as poor man's mod_rewrite) would point somewhat strongly against the use of mod_rewrite:
http://example.com/index.php/slugs/house-warming-party
In conclusion, mod_rewrite (like most URL-rewriting tools) is supposed to be a module transparent to the outside world. I know of no sure-fire way to detect it from outside, and there may well be none.
I am using Yslow as a simple speed benchmarking tool and I came across a really confusing concept. The E-tag
So the main problem is : How do I configure E-tags? my grade in yslow says:
There are 19 components with misconfigured ETags
* http://thehotelinventory.com/media/js/jquery.min.js
* http://thehotelinventory.com/media/js/jquery.colorbox.min.js
* http://thehotelinventory.com/media/js/easyslider.min.js
* http://thehotelinventory.com/media/js/jquery.tools.min.js
* http://thehotelinventory.com/media/js/custom.min.js
* http://thehotelinventory.com/media/js/jquery.validate.min.js
* http://thehotelinventory.com/media/images/colorbox/loading_background.png
* http://thehotelinventory.com/media/images/productheaderbg.jpg
* http://thehotelinventory.com/media/images/buttons/field-bg. //etc
I browsed through the developer.yahoo.com guidelines on website optimization yet I can't really understand the thing with e-tags
This page shows how to disable ETags for IIS and this page shows how to do it for Apache.
Assuming you are running Apache...
You can set up a simple ETag like this:
FileETag MTime Size
If you have multiple servers, you want to disable ETags.
FileETag None
Put the above code in your httpd.conf (if you have access), otherwise you can put it in .htaccess.
Think of E-Tags as a sort of hash. When a browser makes a request for a resource, it sends along the E-tag of the file version it has cached. If the server decides that the files are similar enough (there are "strong" and "weak" versions of E-Tags so it's not always a simple comparison check) it will send a "304 Not Modified" response to the client, rather than the resource itself. This translates into a speed boost, since it prevents bandwidth from being wasted on unchanged files.
E-Tags are sent via HTTP headers.
There's a good example of E-Tags at work (and also how to disable them for Apache) here:
http://www.askapache.com/htaccess/apache-speed-etags.html
By removing the ETag header, you disable caches and browsers from being able to validate files, so they are forced to rely on your Cache-Control and Expires header.
Add these lines to .htaccess:
<ifModule mod_headers.c>
Header unset ETag
</ifModule>
FileETag None
Go straight to the source, YSlow provides guidance on all of it's advice, including how to configure ETags.
The best way to configure your ETags is to remove them. For static files, far-future expiration dates are a much better approach.
The way to remove them depends on the web server you're using. For IIS 7, it can be done with a simple HttpModule.
Entity tags are a feature of the HTTP protocol, see http://www.ietf.org/rfc/rfc2616.txt
Entity tags are used for comparing two or more entities from the same
requested resource. HTTP/1.1 uses entity tags in the ETag (section
14.19), If-Match (section 14.24), If-None-Match (section 14.26), and
If-Range (section 14.27) header fields. The definition of how they
are used and compared as cache validators is in section 13.3.3. An
entity tag consists of an opaque quoted string, possibly prefixed by
a weakness indicator.
wikipedia is the man's best friend:)
http://en.wikipedia.org/wiki/HTTP_ETag
Basically a hash as ShZ said, that should be unique or almost for a file.