Crawler headers - http-headers

I'm creating a simple crawler that will scrape from a list of pre-defined sites. My simple question: are there any http headers that the crawler should specifically use? What's considered required, and desirable to have defined?

You should at least specify a custom user agent (as done here by StormCrawler) so that the webmasters of the sites you are crawling can see that you are robot and contact you if needed.
More importantly, your crawler should follow the robots.txt directives, throttle the frequency of requests to the sites, etc... which leads me to the following question : why not reuse and customise an existing open source crawler like StormCrawler, Nutch or Scrapy instead of reinventing the wheel?

It's good to tell who you are and your intentions, and how to get a hold of you. I remember from running a site and looking at the access.log for Apaceh that the following info actually had a mission (as some of the ones listed in StromCrawler code):
Agent name - just the brand name of your crawler
Version of your agent software - If issues with earlier versions of the agent, good to see that it's an evolved version
URL to info about agent - A link to an info-page about the crawler. More info on purpose, technical buildup etc. Also a place to get in contact with the people behind a bot.
If you check out Request fields, you'll find two of interest: User-Agent and from. The second one is the email address, but last I checked it doesn't appear in the access.log for Apache2. The User-Agent for automated agents should contain name, version and URL to a page with more info about the agent. It also common to have the word "bot" in your agent name.

Related

List of served files in apache

I am doing some reverse engineering on a website.
We are using LAMP stack under CENTOS 5, without any commercial/open source framework (symfony, laravel, etc). Just plain PHP with an in-house framework.
I wonder if there is any way to know which files in the server have been used to produce a request.
For example, let's say I am requesting http://myserver.com/index.php.
Let's assume that 'index.php' calls other PHP scripts (e.g. to connect to the database and retrieve some info), it also includes a couple of other html files, etc
How can I get the list of those accessed files?
I already tried to enable the server-status directive in apache, and although it is working I can't get what I want (I also passed the 'refresh' parameter)
I also used lsof -c httpd, as suggested in other forums, but it is producing a very big output and I can't find what I'm looking for.
I also read the apache logs, but I am only getting the requests that the server handled.
Some other users suggested to add the PHP directives like 'self', but that means I need to know which files I need to modify to include that directive beforehand (which I don't) and which is precisely what I am trying to find out.
Is that actually possible to trace the internal activity of the server and get those file names and locations?
Regards.
Not that I tried this, but it looks like mod_log_config is the answer to my own question

Missing files listed in error.log

Today I stumbled upon a folder on my web host called 'error.log'. I thought I'd take a look.
I see multiple 'file does not exist' errors - there are three types of entries:
robots.txt
missing.html
apple-touch-icon-precomposed.png
I have some guesses about what these files are used for, but would like to know definitively:
What are the files in question?
Should I add them to my server?
What prompts an error log to be written for these? Is it someone explicitly requesting them? If so, who and how?
A robots.txt file is read by web crawlers/robots to allow/disallow it from scraping resources on your server. However, it's not mandatory for a robot to read this file, but the nice ones do. There are some further examples at http://en.wikipedia.org/wiki/Robots.txt An example file may look like and would reside in the web root directory:
User-agent: * # All robots
Disallow: / # Do not enter website
or
User-Agent: googlebot # For this robot
Disallow: /something # do not enter
The apple-touch-icon-precomposed.png is explained https://stackoverflow.com/a/12683605/722238
I believe the usage of missing.html is used by some as a customized 404 page. It's possible that a robot may be configured to scrape this file, hence the requests for it.
You should add a robots.txt file if you want to control the resources a robot will scrape off your server. As said before, it's not mandatory for a robot to read this file.
If you wanted to add the other two files to remove the error messages you could, however, I don't believe it is necessary. There is nothing to say that joe_random won't make a request on your server for /somerandomfile.txt in which case you will get another error message for another file that doesn't exist. You could then just redirect them to a customized 404 page.

Secure/second access for web development - deny view for public; allow developer

Sorry for the strange title, I can't find better description to my question.
I'm building some websites with a team of 4 persons - 2 developers and 2 testers. The developers build the page on a local apache/mysql server. Every now and then they upload a snapshot of what they have done to a dedicated server, that serves the files with htaccess basic authentification to the testers.
Are there better solutions for this workflow? I would like to have more security for this whole thing. The snapshots of the website often show debug-/development info, that shouldn't be seen by public eyes.
Something like a different port of the apache server... ? Any suggestions?
I think other way is to use Git or some other versioning system for deployment, so only new code will be added and you can disable showing of these debug informations permanently in some file which will not be overwritten.
OR
You can use some cloud service like getpantheon.com (for Drupal). It could provide you good environment for testing.

A web application that lets users choose a domain name for the website they are about to create?

I want to create a web application that allows users to sign up, register a domain name and create their own website. This will be done in Ubuntu 9.10, Apache 2, Mysql 5 and Php 5.
At the moment, the only area of development I'm uncertain about is the domain name registration and mapping it to the web application.
I'm going to postpone developing the web interface that lets users register domains because I don't have the slightest idea how to do it. For the time being, I'll let an employee register the domain name on the user's behalf. I'll automate the process in te future (any advice on this matter would be appreciated). The employee will also input the registered domain name into my CMS, which will also update the Apache VirtualHost files with new domain information. I will have a cron job reload Apache every 5 minutes to capture the virtualhost changes.
Does this sound like the right approach? Will what I'm about to do be very disruptive to the server? Can anyone offer suggestions or point out issues I need to be aware of?
Additional details
the documentroot will remain the same at /var/www/public_html/websitemaker/ for all domains. I'll track user settings and styles based on the PHP's $_SERVER variable
I don't believe restarting apache every 5 minutes is the way to go as it won't be good for scaling.
One option would be to use logic grab the the domain name used to access the site. Verify that against your list of accounts in MySQL. If there is a match then load the users site and if not then behave like normal or send to error page.
As for registering domain names you will need to create (or use and existing) a script implenting an API to the registrar of your choice. They will provide the ability to check if a domain is available or not and to register it assigning it specific DNS values (plus other options as well) all in real time.
I think what you're looking for is Apache with mass virtual hosting so that you don't have to restart/reload Apache every 5 mins. Any specific questions about this would be more appropriate for Serverfault.

exim configuration - accept all mail

I've just set up exim on my ubuntu computer. At the moment it will only accept email for accounts that exist on that computer but I would like it to accept all email (just because I'm interested). Unfortunately there seem to be a million exim related config files, and I'm not having much success finding anything on google.
Is there an introduction to exim for complete beginners?
Thanks.
There's a mailing list at http://www.exim.org/maillist.html. The problem you will face as an Ubuntu user is that there's always been a slight tension between Debian packagers/users and the main Exim user base because Debian chose to heavily customize their configuration. Their reasons for customizing it are sound, but it results in Debian users showing up on the main mailing list asking questions using terms that aren't recognizable to non-Debian users. Debian runs its own exim-dedicated help list (I don't have the address handy, but it's in the distro docs). Unfortunately this ends up causing you a problem because Ubuntu adopted all these packages from Debian, but doesn't support them in the same way as Debian does, and Debian packagers seem to feel put upon to be asked to support these Ubuntu users.
So, Ubuntu user goes to main Exim list and is told to ask their packager for help. So they go to the Debian lists and ask for help and may or may not be helped.
Now, to answer your original question, there are a ton of ways to do what you ask, and probably the best way for you is going to be specific to the Debian/Ubuntu configurations. However, to get you started, you could add something like this to your routers:
catchall:
driver = redirect
domains = +local_domains
data = youraddress#example.com
If you place that after your general alias/local delivery routers and before any forced-failure routers, that will redirect all mail to any unhandled local_part at any domain in local_domains to youraddress#example.com.
local_domain is a domain list defined in the standard exim config file. If you don't have it or an equivalent, you can replace it with a colon-delimited list of local domains, like "example.com:example.net:example.foo"
One of the reasons it's hard to get up to speed with Exim is that you can literally do anything with it (literally, someone on the list proved the expansion syntax is turing complete a few years ago, IIRC). So, for instance, you could use the above framework to look the domains up out of a file, to apply regular expressions against the local_parts to catch, save the mail to a file instead of redirecting to an address, put it in front of the routers and use "unseen" to save copies of all mail, etc. If you really want to administer an Exim install, I strongly recommend reading the documentation from cover to cover, it's really, really good once you get a toe hold.
Good luck!