How do I make sure my website can block automation scripts, bots? - automation

I'd like to make sure that my website blocks automation tools like Selenium and QTP. Is there a way to do that ?
What settings on a website is Selenium bound to fail with ?

With due consideration to the comments on the original question asking "why on earth would you do this?", you basically need to follow the same strategy that any site uses to verify that a user is actually human. Methods such as asking users to authenticate or enter text from images or the like will probably work, but this will likely have the effect of blocking google crawlers and everything else.
Doing anything based on user agent strings or anything like that is mostly useless. Those are trivial to fake.
Rate-limiting connections or similar might have limited effectiveness, but it seems like you're going to inadvertently block any web crawlers too.

While this questions seems to be strange it is funny, so I tried to investigate possibilities
Besides adding a CAPTCHA which is the best and the only ultimate solution, you can block Selenium by adding the following JavaScript to your pages (this example will redirect to the Google page, but you can do anything you want):
<script>
var loc = window.parent.location.toString();
if (loc.indexOf("RemoteRunner.html")!=-1) {
// It is run in Selenium RC, so do something
document.location="http://www.google.com";
}
</script>
I do not know how can you block other automation tools and I am not sure if this will not block Selenium IDE

to be 100% certain that no automated bots/scripts can be run against your websites, don't have a website online. This will meet your requirement with certainty.
CAPTCHA are easy to break if not cheap, thanks to crowdsourcing and OCR methods.
Proxies can be found in the wild for free or bulk are available at extremely low costs. Again, useless to limit connection rates or detect bots.
One possible approach can be in your application logic, implement ways to increase time and cost for access to the site by having things like phone verification, credit card verification. Your website will never get off the ground because nobody will trust your site at it's infancy.
Solution: Do not put your website online and expect to be able to effectively eliminate bots and scripts from running.

Related

Hiding a page part from Google, does it hurt SEO?

We all know that showing inexistent stuff to Google bots is not allowed and will hurt the search positioning but what about the other way around; showing stuff to visitors that are not displayed for Google bots?
I need to do this because I have photo pages each with the short title and the photo along with textarea containing the embed HTML code. googlebot is taking the embed code and putting it at the page description on its search results which is very ugly.
Please advise.
When you start playing with tricks like that, you need to consider several things.
... showing stuff to visitors that are not displayed for Google bots.
That approach is a bit tricky.
You can certainly check User-agents to see if a visitor is Googlebot, but Google can add any number of new spiders with different User-agents, which will index your images in the end. You will have to constantly monitor that.
Testing of each code release your website will have to check "images and Googlebot" scenario. That will extend testing phase and testing cost.
That can also affect future development - all changes will have to be done with "images and Googlebot" scenario in mind which can introduce additional constraints to your system.
Personally I would choose a bit different approach:
First of all review if you can use any methods recommended by Google. Google provides a few nice pages describing that problem e.g. Blocking Google or Block or remove pages using a robots.txt file.
If that is not enough, maybe restructuring of you HTML would help. Consider using JavaScript to build some customer facing interfaces.
And whatever you do, try to keep it as simple as possible, otherwise very complex solutions can turn around and bite you.
It is very difficult to give you very good advise without knowledge of your system, constraints and strategy. But I hope my answer will help you out to choose good architecture / solution for your system.
Boy, you want more.
Google does not because of a respect therefore judge you cheat, he needs a review, as long as your purpose to the user experience, the common cheating tactics, Google does not think you cheating.
just block these pages with robots.txt and you`ll be fine, it is not cheating - that's why they came with solution like that in the first place

reCAPTCHA vs other captcha systems

What is a good reason to choose reCAPTCHA over a well known and tested captcha generator on the server. Is it just philanthropy (helping with digitizing texts) or are there other good reasons.
reCAPTCHA is rather neat. Not only does it stop spammers but it helps digitize books. Each word that appears in the captcha has actually been scanned in from a book but sometimes the character recognition is off so the computer my save some gibberish of a sentence without knowing any better.
See the image off their site:
By making people type in what they think the word is, it helps create a digital copy of the book or word that was scanned with accuracy while at the same time checking what the user submit, comparing it to other's submissions, and determining if the user is human or not.
For that reason I use reCAPTCHA. I'm not just selfishly protecting my site, I'm providing a service for others.
Not only that but it's fairly simple to implement and provided by a reliable company (Google).
The question was "why should I use it"; that question must include "why shouldn't I use it", so some criticisms:
Recaptcha volunteers your users to be OCR monkeys, without bothering to ask their opinion.
It requires that you advertise recaptcha in the captcha widget, which isn't always appropriate.
It's a web service, which means there's no hard guarantee it'll still exist a week or a year or two years from now. (Google has crippled or removed public, widely-used APIs in the past, such as their translation API.)
It only supports web pages, loading everything with scripts and iframes. It doesn't have a proper API, so if you ever want to have an iOS or Android app that logs into your system, and need to show a captcha there, you'll be out of luck.
You have no control over the complexity of the generated captcha. Captchas always have a tradeoff between how hard they are to read and how difficult they are to OCR. There are no knobs to adjust, based on how important stopping robots is to your use case. If they decide to make the captchas much harder to read (which they've done at times), and this becomes a nuisance to your users, there's nothing you can do about it.
reCAPTCHA is quite good. Most other generators are broken easily while reCAPTCHA usually gets good scores.
Another good thing is that it has the accessiblity button so that it would read the text.
This is an old threat but I would just like to confirm that in my case we used reCAPTCHA on a number of Drupal 6 websites in combination with the Honeypot module. We did that to stop automated spam user registrations.
I presume these user accounts were being created automatically by desktop applications such as SEnuke XCr and XRumer with the aim of then posting spam. They create the user account but they rarely do anything further but I found it annoying. Further reading on this subject can be found here: How to prevent spam user registrations? (links to an article on Drupal.org).
I can confirm that the above reduced my spam user registrations from a little over 100 a day to none at all.
We need to register our IP address on which server would be running. Its seems some what risky. So we might be required to change registration work flow in case of use of reCAPTCHA.

Negative Captchas - help me understand spam bots better

I have to decide a technique to prevent spam bots from registering my site. In this question I am mainly asking about negative captchas.
I came to know about many weaknesses of bots but want to know more. I read somewhere that majority of bots do not render/support javascript. Why is it so? How do I test that the visiting program can't evaluate javascript?
I started with this question Need suggestions/ideas for easy-to-use but secure captchas
Please answer to that question if you have some good captcha ideas.
Then I got ideas about negative captchas here
http://damienkatz.net/2007/01/negative_captch.html
But Damien has written that though this technique likely won't work on big community sites (for long), it will work just fine for most smaller sites.
So, what are the chances of somebody making site-specific bots? I assume my site will be a very popular one. How much safe this technique will be considering that?
Negative captchas using complex honeypot implementations here described here
http://nedbatchelder.com/text/stopbots.html
Does anybody know how easily can it be implemented? Are there some plugins available?
Thanks,
Sandeepan
I read somewhere that majority of bots do not render/support javascript. Why is it so?
Simplicity of implementation — you can read web page source and post forms with just dozen lines of code in high-level languages. I've seen bots that are ridiculously bad, e.g. parsing HTML with regular expressions and getting ../ in URLs wrong. But it works well enough apparently.
However, running JavaScript engine and implementing DOM library is much more complex task. You have to deal with scripts that do while(1);, that depend on timers, external resources, CSS, sniff browsers and do lots of crazy stuff. The amount of work you need to do quickly starts looking like writing a full browser engine.
It's also computationally much much expensive, so probably it's not as profitable for spammers — they can have dumb bot that silently spams 100 pages/second, or fully-featured one that spams 2 pages/second and hogs victim's computer like a typical web browser would.
There's middle ground in implementing just a simple site-specific hack, like filling in certain form field if known script pattern is noticed in the page.
So, what are the chances of somebody making site-specific bots? I assume my site will be a very popular one. How much safe this technique will be considering that?
It's a cost/benefit trade-off. If you have high pagerank, lots of visitors or something of monetary value, or useful for spamming, then some spammer might notice you and decide workaround is worth his time. OTOH if you just have a personal blog or small forum, there's million others unprotected waiting to be spammed.
How do I test that the visiting program can't evaluate javascript?
Create a hidden field with some fixed value, then write a js which increments or changes it and you will see in the response..

What should I check before I release a web application?

I am nearly finished a web application. I need to test it and find the security issues before it release. Is there any methods / guideline to do this kind of testing? Or is there any tools to help me check my application is ready to go online? Thank you.  
I would say:
check that there are no warnings or errors even in strict mode (error report).
In case you store any sensitive data (as passwords, credit cards, etc.) be sure they are encrypted with non-standard algorithms. Use SSL and try to be somehow paranoid with it.
Set your database with specific accesses by action and hosts, and do not use root account.
Perform exhaustive testing (use unit test when possible). Involve as many people you can.
Test it under the main browsers (Firefox, Chrome, Opera, Safari, IE) and if have time in others.
Validate all your HTML/CSS against standards (W3C). (recommendable)
Depends on the platform you are using, there are profilers which can help you identify bottlenecks in your code. (can be done in later stages).
Tune settings for your web server / script language.
Be sure it is search-engine friendly.
Pray once is online :)
This is not a complete list as it depends in:
which language/platform/web server you are using.
what kind of application you developed (social, financial, management, etc.)
who will use that application (the entirely world, an specific company, your family or just you).
are you going to sell it? then you must have at least most of the previous points.
is your application using very sensitive information (as credit cards)? if so, you should pay for some professional (company?) to check your code, settings and methods.
This is just my opinion, take it as it is. I would also like to hear what other people suggests.
Good Luck
As well as what's already been suggested, depending on what type of application it is, you can use a vulnerability scanner to scan your application for any vulnerabilities that could lead to hackers gaining entry.
There are quite a few good scanners out there, but note when using them that the results may or may not be 100%. It's hard to say.
For a list of scanners, commercial and free, see: http://projects.webappsec.org/Web-Application-Security-Scanner-List
For more information on scanners: http://en.wikipedia.org/wiki/Web_Application_Security_Scanner
Good luck.
Here you can find a practical checklist to use before launching a website
http://launchlist.net/
And here is a list of all the stuff you forgot to test
http://www.thebraidytester.com/downloads/YouAreNotDoneYet.pdf

Checklist for testing a new site

What are the most common things to test in a new site?
For instance to prevent exploits by bots, malicious users, massive load, etc.?
And just as importantly, what tools and approaches should you use?
(some stress test tools are really expensive/had to use, do you write your own? etc)
Common exploits that should be checked for.
Edit: the reason for this question is partially from being in SO beta, however please refrain from SO beta discussion, SO beta got me thinking about my own site and good thing too. This is meant to be a checklist for things that I, you, or someone else hasn't thought of before.
Try and break your own site before someone else does. Your web site is basically a publicly accessible API that allows access to a database and other backend systems. Test the URLs as if they were any other API. I like to start by cataloging all URLs that have some sort of permenant affect on the state of the system - this is easy if you are doing Ruby on Rails development or trying to follow a RESTful design pattern. For each of those URLs, try running a GET, POST, PUT or DELETE HTTP methods with different parameters so that you can ensure that you're only giving access to what you want to give access to.
This of course is in addition to obvious: Functional testing, Load Testing, SQL Injection, XSS etc.
Turn off javascript and make sure your site can still be navigated.
Even if you want to ignore the small but significant number of people who have it disabled, this will impact search engines as well.
YSlow can give you a quick analysis of different metrics.
What do friendly bots see (eg: Google); check using Google Webmaster Tools;
Regarding tools for running functional tests of a web pages, I've found that Selenium IDE to be useful.
The Firefox (version 2 only compatible at the moment) plug in lets your capture almost all web events, and save them and replay them in the same browser.
In conjunction with another Firefox https://addons.mozilla.org/en-US/firefox/addon/1843"> Firebug
you can create some very powerful tests.
If you want to set up Selenium Remote Control
you can then convert the Selenium IDE tests into nUnit tests, which you can run automatically.
I use cruise control and run these web tests as part of a daily build.
The nice thing about using Selenium remote control is that it can run the same functional tests on multiple browsers and operating systems, something that you can't do with the IDE.
Although the web tests will take ages to run, there is an version of Selenium called Selenium Grid that lets you use any old hardware you have spare to run the tests in parallel as part of a computing grid. Not tried this myself, but it sounds interesting.
All of the above is open source and free which helped me convince management to use if :-)
For checking the cross browser and cross platform look of your site, browershots.org is maybe the best free tool that can safe a lot of time and costs.
There's seperate stages for this one.
Firstly there's the technical testing, where you check all technical functionality:
SQL injections
Cross-site Scripting (XSS)
load times
stress levels
Then there's the phase where you have someone completely computer-illiterate sit down and ask them to find something. Not only does it show you where there's flaws in your navigational logic (I find that developers look upon things way differently than 'other people') but they're also guaranteed to find some way to break your site.