I am looking for a full list of User-Agents of BOTS (crawlers, spiders, twitter bots, etc).
Do you know anything?
Thanks
Check this list:
http://www.botsvsbrowsers.com/category/1/index.html
It contains a total number of 4768 bot user agents.
The other way around to accomplishing bot detection is to use the reverse approach in a white-list way, that is, check if the user agent is not a bot, then anything else is a bot. :-)
To compile a comprehensive list of non bot user agents you can use the lists at http://www.user-agents.org/ and http://www.botsvsbrowsers.com/.
Long story short: you can't, there's no silver bullet. Any bot could set their user-agent string to anything from 'googlebot' to 'spamalot'.
You can see it yourself, all you need to do is go to the first site Shinnok pointed, and start counting all those Googlebot/2.X bots listed in there. You block them, they change the name of the bot to random gibberish and so on. In the end you'll end up with a 10k bots list that will decrease your users load times when you try to verify if they're a bot or not.
Related
I am starting to see a few of these requests in my Apache logs. They seem to come in pairs; first a request for /notified-Notify_AUP followed by a request for /verify-Notify_AUP.
The requests come with a google search referrer pointing to my site. The requests seem to come from legit companies -- of course anything can be hacked.
I have never heard of these files, unlike so many of the other fishing expeditions aimed at all of our sites. Is this something new or are these legit and I am supposed to be providing some sort of reply?
Thanks,
Boggle
I finally found out that this is an attack on ProxySG. Since I do not have a ProxySG box, I can safely ignore this problem.
I'm completely stumped on an SEO issue and could really use some direction from an expert. We built a website recently, http://www.ecovinowines.net and because it is all about wine, we set up an age verification that requires the user to click before entering the site. By using cookies, we prevent the user from accessing any page in the site before clicking the age verification link. It's been a couple of months since launching the site so I thought I'd check out some keywords on google. I just typed in the name of the website to see what pages would be indexed and it is only showing the age verification pages. From the googling I've done, apparently nothing behind the age verification will be visible to the google bots because they ignore cookies.
Is there no safe workaround for this? I checked out New Belgium's site, which uses a similar age verification link, and all of it's pages seem to be getting indexed. Once you click on one of it's links from google, it redirects the user to the age verification page. Are they not using cookies? Or how might they be getting around the cookie bot issue.
Do a test for Google bot's User agent and allow access if it matches. You might want to let other search engines through too...
Googlebot/2.1 (+http://www.google.com/bot.html)
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
ia_archiver
Semi-official response from the Google:
This topic comes up periodically for sites (alcohol, porn, etc.) that
need to serve an age verification notice on every page. What we
recommend in this case is to serve it via JavaScript. That way users
can see the age verification any time they try to access your content,
but search engines that don't run JavaScript won't see the warning and
will instead be able to see your content.
http://groups.google.com/group/Google_Webmaster_Help-Tools/browse_thread/thread/3ce6c0b3cc72ede8/
I think a more modern technique would be to render all the content normally, then obscure it with a Javascript overlay.
I had a quick look at New Belgium and it's not clear what they're doing. Further investigation needed.
Assuming you are using PHP, something like this will do the job. You'd need to add other bots if you want them.
$isBot = strpos($_SERVER[‘HTTP_USER_AGENT’],"Googlebot");
Then you can toggle your age verification on or off based on that variable.
There is one IP (from China) which is trying to download my entire website. It downloads all my pages and loads the server significantly (I have more than 500 000 pages). Looking at the access logs I can tell it's definitely not a Google bot or any other search engine bot.
Temporarily I've banned it (using iptables rules), but it's not a solution for me, because some of my real users also have the same IP, so they are also banned and cannot acces the website.
Is there any way to prevent such kind of "user activity"? Maybe a mechanism which implements captcha if you try to request more than 5 requests a second or something?
P.S. I'm using Yii framework (PHP).
Any suggestions are greatly appreciated.
thank you!
You have answered your own question!
Make captcha appear if the request exceeds certain number per second or per minute!
You should use CCaptchaAction to implement, like this.
I guess the best way to monitor for suspicious user activity is really user session, CWebUser's getState()/setState(). Store current request time in user session, compare it to several previous values, show captcha if user makes requests too often.
Create new component, preload it via CWebApplication::$preload and check user activity in components init() function. This way you'll be able to turn bot check on and off easily.
I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!
Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.
Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).
I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.
You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.
Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.
You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.
I'm doing a very rudimentary tracking of page views by logging url, referral codes, sessions, times etc but finding it's getting bombarded with robots (Google, Yahoo etc). I'm wondering what an effective way is to filter out or not log these statistics?
I've experimented with robot IP lists etc but this isn't foolproof.
Is there some kind of robots.txt, htaccess, PHP server-side code, javascript or other method(s) that can "trick" robots or ignore non-human interaction?
Just to add - a technique you can employ within your interface would be to use Javascript to encapsulate the actions that lead to certain user-interaction view/counter increments, for a very rudimentary example, a robot will(can) not follow:
Chicken Farms
function viewItem(id)
{
window.location.href = 'www.example.com/items?id=' + id + '&from=userclick';
}
To make those clicks easier to track, they might yield a request such as
www.example.com/items?id=4&from=userclick
That would help you reliably track how many times something is 'clicked', but it has obvious drawbacks, and of course it really depends on what you're trying to achieve.
It depends on what you what to achieve.
If you want search bots to stop visiting certain paths/pages you can include them in robots.txt. The majority of well-behaving bots will stop hitting them.
If you want bots to index these paths but you don't want to see them in your reports then you need to implement some filtering logic. E.g. all major bots have a very clear user-agent string (e.g. Googlebot/2.1). You can use these strings to filter these hits out from your reporting.
Well the robots will all use a specific user-agent, so you can just disregard those requests.
But also, if you just use a robots.txt and deny them from visiting; well that will work too.
Don't redescover the weel!
Any statistical tool at the moment filters robots request. You can install AWSTATS (open source) even if you have a shared hosting. If you won't to install a software in your server you can use Google Analytics adding just a script at the end of your pages. Both solutions are very good. In this way you only have to log your errors (500, 404 and 403 are enough).