I have a list of URLs and wish to find whether they redirect to some other place or not, and if it does, what is their final location. This I am doing by sending HEAD requests to these URLs.
The list contains links to certain hosts which disallow my bot (any bot in general) in robots.txt.
My question is, in order to be polite-
should I follow robots.txt for HEAD requests too, and stop requesting these hosts ?
if there is a crawl delay mentioned in robots.txt, should I obey it for these HEAD requests ?
is there a web-service that can do this job for me and return the final URLs for a batch of input URLs ?
You should always abide by robots.txt, even for HEAD requests. If you don't do so not only are you violating the politeness preferences of the website but you're risking having your IP blocked permanently from the website. A simple HEAD request to a restricted and non-humanly-accessible directory/page on a website can put you on the operator's ban list.
should I follow robots.txt for HEAD requests too, and stop requesting these hosts ?
You should follow robots.txt or if you're already banned, then stop requesting those hosts.
if there is a crawl delay mentioned in robots.txt, should I obey it for these HEAD requests ?
Yes.
is there a web-service that can do this job for me and return the final URLs for a batch of input URLs ?
I don't know of any, but perhaps you can adopt an existing crawler to do that. What programming language do you prefer?
Related
I have a site (example.com) and have my robots.txt set up in the root directory. I have also multiple subdomains (foo.example.com, bar.example.com, and more to come in the future) whose robots.txt will all be identical as that of example.com. I know that I can place a file at the root of each subdomain but I'm wondering if it's possible to redirect the crawlers searching for robots.txt on any subdomain to example.com/robots.txt?
Sending a redirect header for your robots.txt file is not advised, nor is it officially supported.
Google's documentation specifically states:
Handling of robots.txt redirects to disallowed URLs is undefined and discouraged.
But the documentation does say redirect "will be generally followed". If you add your subdomains into Google Webmaster Tools and go to "Crawl > Blocked URLs" you can test your subdomain robots.txts that are 301 redirecting. It should come back as positively working.
However, with that said, I would strongly suggest that you just symlink the files into place and that each robots.txt file responds with a 200 OK at the appropriate URLs. This is much more inline with the original robots.txt specification, as well as, Google's documentation, and who knows exactly how bing / yahoo will handle it over time.
I am currently writing my robots.txt file and have some trouble deciding whether I should allow or disallow some folders for SEO purposes.
Here are the folders I have:
/css/ (css)
/js/ (javascript)
/img/ (images i use for the website)
/php/ (PHP which will return a blank page such as for example checkemail.php which checks an email address or register.php which puts data into a SQL database and sends an email)
/error/ (my error 401,403,404,406,500 html pages)
/include/ (header.html and footer.html I include)
I was thinking about disallowing only the PHP pages and let the rest.
What do you think?
Thanks a lot
Laurent
/css and /js -- CSS and Javascript files will probably be crawled by googlebot whether or not you have them in robots.txt. Google uses them to render your pages for site preview. Google has asked nicely that you not put them in robots.txt.
/img -- Googlebot may crawl this even when in robots.txt the same way as CSS and Javascript. Putting your images in robots.txt generally prevents them from being indexed in Google image search. Google image search may be a source of visitors to your site so you may wish to be indexed there.
/php -- sounds like you don't want spiders hitting the urls that perform actions. Good call to use robots.txt
/error -- If your site is set up correctly the spiders will probably never know what directory your error pages are served from. They generally get served at the url that has the error and the spider never sees their actual url. This isn't the case if you redirect to them, which isn't recommended practice anyway. As such, I would say there is no need to put them in robots.txt
We are setting up an internal banner system, and we want to track links clicked. So within our website, we link to
http://example.com/forward/33
Which in fact forwards to
http://example.com/article/145
But by passing the forward page, we can record some statistics. Now, for SEO purposes I would guess a 301 forward would be best (we'd be using a PHP header), so that search engines consider this in fact a link to the final internal page, and not the forwarder page. Is this the recommended approach? Is there a problem with having tons of 301 within your website? And is there anything else to be taken into account when forwarding internal links?
301 redirect would be your best bet. However, with everything else I would use in moderation.
We're doing a whitelabel site, which mustn't be google indexed.
Does anyone know a tool to check if the googlebot will index a given url ?
I've put <meta name="robots" content="noindex" /> on all pages, so it shouldn't be indexed - however I'd rather be 110% certain by testing it.
I know I could use robots.txt, however the problem with robots.txt is as follows:
Our mainsite should be indexed, and it's the same application on the IIS (ASP.Net) as the whitelabel site - the only difference is the url.
I cannot modify the robots.txt depending on the incoming url, but I can add a meta tag to all pages from my code-behind.
You should add a Robots.txt to your site.
However, the only perfect way to prevent search engines from indexing a site is to require authentication. (Some spiders ignore Robots.txt)
EDIT: You need to add an handler for Robots.txt to serve different files depending on the Host header.
You'll need to configure IIS to send the Robots.txt request through ASP.Net; the exact instructions depend on the IIS version.
Google Webmasters Tools (google.com/webmasters/tools) will (other than permitting you to upload a sitemap) do a test crawl of your site and tell you what they crawled, how it rates for certain queries, and what they will crawl and what not.
The test crawl isn't automatically included in google results, anyway if you're trying to hide sensitive data from the prying eyes of Google you cannot count on that alone: put some authentication on the line of fire, no matter what.
I have been noticing on my trackers that bots are visiting my site ALOT. Should I change or edit my robots.txt or change something? Not sure if thats good, because they are indexing or what?
Should i change or edit my robots.txt or change something?
Depends on the bot. Some bots will dutifully ignore robots.txt.
We had a similar problem 18 months ago with the google AD bot because our customer was purchasing Soooo many ads.
Google AD bots will (as documented) ignore wildcard (*) exclusions, but listen to explicit ignores.
Remember, bots that honor robots.txt will just not crawl your site. This is undesirable if you want them to get access to your data for indexing.
A better solution is to throttle or supply static content to the bots.
Not sure if thats good, because they are indexing or what?
They could be indexing/scraping/stealing. All the same really. What I think you want is to throttle their http request processing based on UserAgents. How to do this depends on your web server and app container.
As suggested in other answers, if the bot is malicious, then you'll need to either find the UserAgent pattern and send them 403 forbiddens. Or, if the malicious bots dynamically change user agent strings you have a two further options:
White-list UserAgents - e.g. create a user agent filter that only accepts certain user agents. This is very imperfect.
IP banning - the http header will contain the source IP. Or, if you're getting DOS'd (denial of service attack), then you have bigger problems
I really don't think changing the robots.txt is going to help, because only GOOD bots abide by it. All other ignore it and parse your content as they please. Personally I use http://www.codeplex.com/urlrewriter to get rid of the undesirable robots by responding with a forbidden message if they are found.
The spam bots don't care about robots.txt. You can block them with something like mod_security (which is a pretty cool Apache plugin in its own right). Or you could just ignore them.
You might have to use .htaccess to deny some bots to screw with your logs.
See here : http://spamhuntress.com/2006/02/13/another-hungry-java-bot/
I had lots of Java bots crawling my site, adding
SetEnvIfNoCase User-Agent ^Java/1. javabot=yes
SetEnvIfNoCase User-Agent ^Java1. javabot=yes
Deny from env=javabot
made them stop. Now they only get 403 one time and that's it :)
I once worked for a customer who had a number of "price comparison" bots hitting the site all of the time. The problem was that our backend resources were scarce and cost money per transaction.
After trying to fight off some of these for some time, but the bots just kept changing their recognizable characteristics. We ended up with the following strategy:
For each session on the server we determined if the user was at any point clicking too fast. After a given number of repeats, we'd set the "isRobot" flag to true and simply throttle down the response speed within that session by adding sleeps. We did not tell the user in any way, since he'd just start a new session in that case.