Another answer to the CAPTCHA problem? [closed] - captcha

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Most sites at least employ server access log checking and banning along with some kind of bot prevention measure like a CAPTCHA (those messed-up text images).
The problem with CAPTCHAs is that they poss a threat to the user experience. Luckily they now come with user friendly features like refresh and audio versions.
Anyway, like linux vs windows, it isn't worth the time of a spammer to customize and/or build a script to handle a custom CAPTCHA example that only pertains to one site. Therefore, I was wondering if there might be better ways to handle the whole CAPTCHA thing.
In A Better CAPTCHA Peter Bromberg mentions that one way would be to convert the image to HTML and display it embedded in the page. On http://shiflett.org/ Chris simply asks users to type his name into an input. Examples like this are ways to simplifying the CAPTCHA experience while decreasing the value for spammers. Does anyone know of more good examples I could use or see any problem with the embedded image idea?

Image presented as HTML table is just a technical speed bump. There's no difficulty in extraction of pixels from such document.
IMHO CAPTCHA puts focus on a wrong thing – you're not interested whether there's a human on the other side. You wouldn't like human to spam you either. So take a step back and focus on spam:
Analyze text (look for spammy keywords, use bayesian filtering)
Analyze links (blacklist spammy domains – SURBL, LinkSleeve)
Look at traffic patterns and block floods
There's no single perfectly accurate method, but you can use few of them and weight the result to get pretty close.
Have a look at source code of Sblam! (it's a completely transparent server-side comment spam filter).

Alternatives to captchas are going to be to consider the problem from other angles. The reason for this is because captchas are built around the idea that a human and computer actor can be distinguished. As Artificial intelligence progresses, this will always become an increasingly difficult problem as the gap between computer and human users shrinks.
The technique used here on slashdot is for other users of the site to act as gatekeepers, marking abuse and removing offending posts before they become noticeable to a wide audience.
Another technique is to detect spam-like posts directly, using the same technology used to filter spam from email. Obviously it isn't 100% effective for email, and wont be for other uses, either, but if you can filter out 75% of the spam with very few false positives being filtered, then other techniques will only have to deal with the remaining 25%.
Keep a log of spam-related activity, so that you can track trends about offending ip addresses, content of posts, claimed user agent, and so forth, so that you can block abusive users at a routing level.
In nearly all cases, your users would rather put up with the slight inconvenience of abuse prevention, than the huge inconvenience of a major spam problem.
Ultimately, the arms race between you and spammers is one of cost-benefit. Initially, it will cost spammers close to nothing to spam your site, but you can change that to make it very difficult. Even if they continue to spam your site, the benefit they recieve will never grow beyond a few innocent users falling for their schemes. Once the cost of spamming rises sharply above the benefit, the spammers will go away.
Another way to benefit from that is to allow advertising on your site. Make it inexpensive (but not free, of course) and easy for legitimate advertisers to post responsible marketing material for your users to see. Would be spammers may find that it is a better deal to just pay you a few dollars and get their offering seen than to pursue clandestine methods.
Obviously most spammers won't fit in this category, since that is often more about getting your users to fall victim to malware exploits. You can do your part for that by encouring users to use modern, up to date browsers or plugins so that they become less vulnerable to those same exploits.

This article describes a technique based on hashed field names (changing with each page view) with some of them being honeypot fields (i.e. the request is rejected if they're filled) that are hidden from human users via various techniques.
Basically, it relies on spam scripts not being sophisticated enough to determine which form fields are actually visible. In a way, that is a CAPTCHA, since in order to solve it reliably, not only would they have to implement HTML, CSS and JavaScript fully, they'd also have to recognize when a field is too small to see, colored the same as the background, hidden behind another field, placed outside the browser's viewport, etc.
It's the same basic problem that makes Web Standards a farce: there is no algorithm to determine whether a webpage "looks right" - only a human can decide that.

seen this?
It's a system with cute pictures instead of captcha ;)
But I still think honeypots are a better solution - they're so cheap&easy&invisible

I really think that Dinah hit the nail on the head. The fact seems to be that the beauty of the whole CAPTCHA setup is that there is no standard. Standardizing would only help the market to be more profitable.
Therefore it seems that the best way to handle the CAPTCHA problem is to come up with a fairly hard system for bots to catch that is NOT used by anyone else on the planet. It could be a question system, a very custom image creator, or even a mix of JS calls that only browsers respect.
By the time that your site is big enough for spammers to care you should have the budget to rethink your CAPTCHA setup and optimize it much more. In the mean time we should be monitoring our server logs and banning bad agents, refers, and IP's.
In my case I created a CAPTCHA image that I believe is very different from any other CAPTCHA I have seen. This should do fine for now along side my Apache logs + htaccess banning and Aksimet checking. Maybe I should spend time on a reporting feature as well.

although not a true image captcha, good turing test is asking users a random question - common options are: is ice hot or cold? 5+2= ..? etc.

Related

reCAPTCHA vs other captcha systems

What is a good reason to choose reCAPTCHA over a well known and tested captcha generator on the server. Is it just philanthropy (helping with digitizing texts) or are there other good reasons.
reCAPTCHA is rather neat. Not only does it stop spammers but it helps digitize books. Each word that appears in the captcha has actually been scanned in from a book but sometimes the character recognition is off so the computer my save some gibberish of a sentence without knowing any better.
See the image off their site:
By making people type in what they think the word is, it helps create a digital copy of the book or word that was scanned with accuracy while at the same time checking what the user submit, comparing it to other's submissions, and determining if the user is human or not.
For that reason I use reCAPTCHA. I'm not just selfishly protecting my site, I'm providing a service for others.
Not only that but it's fairly simple to implement and provided by a reliable company (Google).
The question was "why should I use it"; that question must include "why shouldn't I use it", so some criticisms:
Recaptcha volunteers your users to be OCR monkeys, without bothering to ask their opinion.
It requires that you advertise recaptcha in the captcha widget, which isn't always appropriate.
It's a web service, which means there's no hard guarantee it'll still exist a week or a year or two years from now. (Google has crippled or removed public, widely-used APIs in the past, such as their translation API.)
It only supports web pages, loading everything with scripts and iframes. It doesn't have a proper API, so if you ever want to have an iOS or Android app that logs into your system, and need to show a captcha there, you'll be out of luck.
You have no control over the complexity of the generated captcha. Captchas always have a tradeoff between how hard they are to read and how difficult they are to OCR. There are no knobs to adjust, based on how important stopping robots is to your use case. If they decide to make the captchas much harder to read (which they've done at times), and this becomes a nuisance to your users, there's nothing you can do about it.
reCAPTCHA is quite good. Most other generators are broken easily while reCAPTCHA usually gets good scores.
Another good thing is that it has the accessiblity button so that it would read the text.
This is an old threat but I would just like to confirm that in my case we used reCAPTCHA on a number of Drupal 6 websites in combination with the Honeypot module. We did that to stop automated spam user registrations.
I presume these user accounts were being created automatically by desktop applications such as SEnuke XCr and XRumer with the aim of then posting spam. They create the user account but they rarely do anything further but I found it annoying. Further reading on this subject can be found here: How to prevent spam user registrations? (links to an article on Drupal.org).
I can confirm that the above reduced my spam user registrations from a little over 100 a day to none at all.
We need to register our IP address on which server would be running. Its seems some what risky. So we might be required to change registration work flow in case of use of reCAPTCHA.

Captcha's + Differnet Possibilities

I wanted to run some captcha possibities past people to see if they are easily by passed by bots etc.
What if colors were used - eg: there is a string of 10 characters are you ask people to type the red characters of where there are 5? Easy to bypass?
I've noticed a captcha on plentyoffish that involves typing in the characters under the circles. This seems a touch more complex - would this be more challenging for bots?
The other idea I was thinking was putting the requirement in an image as well meaning like in no. 1 above - you can put "type the red characters" in an image and this could change with different colors. Any value here?
Interested in what people think.
cheers
Colours are easy to bypass. A bot just takes the red channel and gets the answer. It is even easier than choosing between many possible solutions. The same applies to any noise that has another colour than the letters the user needs to find.
Symbols that don't touch the letters are very easy to ignore. Why would a bot even look at those circles that probably always stay at the same position? (valid but wasn't asked here)
Identifying circles or other symbols is easier than identifying letters, if one can do the latter, a simple symbol is no challenge.
I think captchas are used too frequently in places where they aren't the best tool. For instance, are you trying to prevent registration spam? Why use a captcha rather than email validation?
What are your intentions and have you considered alternatives to the (relatively ineffective) captcha technology?
As a side note, if you have to use them, I prefer KittyAuth myself :) http://thepcspy.com/kittenauth/#5
Color blind people will have trouble separating red from green letters. People who have trouble reading and understanding descriptions, or have other disabilities may have trouble reading the captchas too.
In some of these, the texts are so mangled that almost everyone has a hard time reading them.
I think captcha's, if used at all, should be quite easy to read. The one with the dots and triangles is doable, although it's a matter of time before someone writes an algorithm to hack them. It is very easy for computers to read this kind too.
The best way to deal with this, is increase moderation. Make your site so that it isn't rewarding to spam it at all. Don't make it the problem of your users.
Also, if you're gonna use captcha's, it may be better to build something yourself than to use common libraries. I've found that these are easier hacked, probably because it is more rewarding to write a captcha solver for something that is used by thhousands of sites.
No matter which CAPTCHA you construct, spammers will find a way to work around it, given enough incentive. Large CAPTCHA services like reCAPTCHA, for instance, get bypassed by outsourcing solving them to cheap labor in India(source).
If you run a small site, your best bet is to make your own mini-CAPTCHA, which asks a simple question. If it isn't a standard question, isn't a standard CAPTCHA module and isn't a large site, it isn't worth it for the spammers to automate bypassing it.
I've been working on a community site for an organization at my university, and we've had trouble with spammers registering, despite us using every CAPTCHA module in the book. As soon as we made our own simple one-question CAPTCHA, all spam stopped. The key to preventing this sort of spam often lies in uniqueness.

Negative Captchas - help me understand spam bots better

I have to decide a technique to prevent spam bots from registering my site. In this question I am mainly asking about negative captchas.
I came to know about many weaknesses of bots but want to know more. I read somewhere that majority of bots do not render/support javascript. Why is it so? How do I test that the visiting program can't evaluate javascript?
I started with this question Need suggestions/ideas for easy-to-use but secure captchas
Please answer to that question if you have some good captcha ideas.
Then I got ideas about negative captchas here
http://damienkatz.net/2007/01/negative_captch.html
But Damien has written that though this technique likely won't work on big community sites (for long), it will work just fine for most smaller sites.
So, what are the chances of somebody making site-specific bots? I assume my site will be a very popular one. How much safe this technique will be considering that?
Negative captchas using complex honeypot implementations here described here
http://nedbatchelder.com/text/stopbots.html
Does anybody know how easily can it be implemented? Are there some plugins available?
Thanks,
Sandeepan
I read somewhere that majority of bots do not render/support javascript. Why is it so?
Simplicity of implementation — you can read web page source and post forms with just dozen lines of code in high-level languages. I've seen bots that are ridiculously bad, e.g. parsing HTML with regular expressions and getting ../ in URLs wrong. But it works well enough apparently.
However, running JavaScript engine and implementing DOM library is much more complex task. You have to deal with scripts that do while(1);, that depend on timers, external resources, CSS, sniff browsers and do lots of crazy stuff. The amount of work you need to do quickly starts looking like writing a full browser engine.
It's also computationally much much expensive, so probably it's not as profitable for spammers — they can have dumb bot that silently spams 100 pages/second, or fully-featured one that spams 2 pages/second and hogs victim's computer like a typical web browser would.
There's middle ground in implementing just a simple site-specific hack, like filling in certain form field if known script pattern is noticed in the page.
So, what are the chances of somebody making site-specific bots? I assume my site will be a very popular one. How much safe this technique will be considering that?
It's a cost/benefit trade-off. If you have high pagerank, lots of visitors or something of monetary value, or useful for spamming, then some spammer might notice you and decide workaround is worth his time. OTOH if you just have a personal blog or small forum, there's million others unprotected waiting to be spammed.
How do I test that the visiting program can't evaluate javascript?
Create a hidden field with some fixed value, then write a js which increments or changes it and you will see in the response..

What are the common sense SEO practices that aren't dodgy or crap? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
In SEO there are a few techniques that have been flagged that need to avoided at all costs. These are all techniques that used to be perfectly acceptable but are now taboo. Number 1: Spammy guest blogging: Blowing up a page with guest comments is no longer a benefit. Number 2: Optimized Anchors: These have become counterproductive, instead use safe anchors. Number 3: Low Quality Links: Often sites will be flooded with hyperlinks that take you to low quality Q&A sites, don’t do this. Number 4: Keyword Heavy Content: Try and avoid too many of these, use longer well written sections more liberally. Number 5: Link-Back Overuse: Back links can be a great way to redirect to your site but over saturation will make people feel trapped
Content, Content, CONTENT! Create worthwhile content that other people will want to link to from their sites.
Google has the best tools for webmasters, but remember that they aren't the only search engine around. You should also look into Bing and Yahoo!'s webmaster tool offerings (here are the tools for Bing; here for Yahoo). Both of them also accept sitemap.xml files, so if you're going to make one for Google, then you may as well submit it elsewhere as well.
Google Analytics is very useful for helping you tweak this sort of thing. It makes it easy to see the effect that your changes are having.
Google and Bing both have very useful SEO blogs. Here is Google's. Here is Bing's. Read through them--they have a lot of useful information.
Meta keywords and meta descriptions may or may not be useful these days. I don't see the harm in including them if they are applicable.
If your page might be reached by more than one URL (i.e., www.mysite.com/default.aspx versus mysite.com/default.aspx versus www.mysite.com/), then be aware that that sort of thing sometimes confuses search engines, and they may penalize you for what they perceive as duplicated content. Use the link rel="canoncial" element to help avoid this problem.
Adjust your site's layout so that the main content comes as early as possible in the HTML source.
Understand and utilize your robots.txt and meta robots tags.
When you register your domain name, go ahead and claim it for as long of a period of time as you can. If your domain name registration is set to expire ten years from now rather than one year from now, search engines will take you more seriously.
As you probably know already, having other reputable sites that link to your site is a good thing (as long as those links are legitimate).
I'm sure there are many more tips as well. Good luck!
In addition to having quality content, content should be added/updated regularly. I believe that Google (an likely others) will have some bias toward the general "freshness" of content on your site.
Also, try to make sure that the content that the crawler sees is as close as possible to what the user will see (can be tricky for localized pages). If you're careless, your site may be be blacklisted for "bait-and-switch" tactics.
Don't implement important text-based
sections in Flash - Google will
probably not see them and if it does,
it'll screw it up.
Google can Index Flash. I don't know how well but it can. :)
A well organized, easy to navigate, hierarchical site.
There are many SEO practices that all work and that people should take into consideration. But fundamentally, I think it's important to remember that Google doesn't necessarily want people to be using SEO. More and more, google is striving to create a search engine that is capable of ranking websites based on how good the content is, and solely on that. It wants to be able to see what good content is in ways in which we can't trick it. Think about, at the very beginning of search engines, a site which had the same keyword on the same webpage repeated 200 times was sure to rank for that keyword, just like a site with any number of backlinks, regardless of the quality or PR of the sites they come from, was assured Google popularity. We're past that now, but is SEO is still , in a certain way, tricking a search engine into making it believe that your site has good content, because you buy backlinks, or comments, or such things.
I'm not saying that SEO is a bad practice, far from that. But Google is taking more and more measures to make its search results independant of the regular SEO practices we use today. That is way I can't stress this enough: write good content. Content, content, content. Make it unique, make it new, add it as often as you can. A lot of it. That's what matters. Google will always rank a site if it sees that there is a lot of new content, and even more so if it sees content coming onto the site in other ways, especially through commenting.
Common sense is uncommon. Things that appear obvious to me or you wouldn't be so obvious to someone else.
SEO is the process of effectively creating and promoting valuable content or tools, ensuring either is totally accessible to people and robots (search engine robots).
The SEO process includes and is far from being limited to such uncommon sense principles as:
Improving page load time (through minification, including a trailing slash in URLs, eliminating unnecessary code or db calls, etc.)
Canonicalization and redirection of broken links (organizing information and ensuring people/robots find what they're looking for)
Coherent, semantic use of language (from inclusion and emphasis of targeted keywords where they semantically make sense [and earn a rankings boost from SE's] all the way through semantic permalink architecture)
Mining search data to determine what people are going to be searching for before they do, and preparing awesome tools/content to serve their needs
SEO matters when you want your content to be found/accessed by people -- especially for topics/industries where many players compete for attention.
SEO does not matter if you do not want your content to be found/accessed, and there are times when SEO is inappropriate. Motives for not wanting your content found -- the only instances when SEO doesn't matter -- might vary, and include:
Privacy
When you want to hide content from the general public for some reason, you have no incentive to optimize a site for search engines.
Exclusivity
If you're offering something you don't want the general public to have, you need not necessarily optimize that.
Security
For example, say, you're an SEO looking to improve your domain's page load time, so you serve static content through a cookieless domain. Although the cookieless domain is used to improve the SEO of another domain, the cookieless domain need not be optimized itself for search engines.
Testing In Isolation
Let's say you want to measure how many people link to a site within a year which is completely promoted with AdWords, and through no other medium.
When One's Business Doesn't Rely On The Web For Traffic, Nor Would They Want To
Many local businesses or businesses which rely on point-of-sale or earning their traffic through some other mechanism than digital marketing may not want to even consider optimizing their site for search engines because they've already optimized it for some other system, perhaps like people walking down a street after emptying out of bars or an amusement park.
When Competing Differently In An A Saturated Market
Let's say you want to market entirely through social media, or internet cred & reputation here on SE. In such instances, you don't have to worry much about SEO.
Go real and do for user not for robots you will reach the success!!
Thanks!

Are you human? (or How to prevent spam) [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What mechanisms do you know that prevent your site from being abused by anonymous spammers.
For example, let's say that I have a site where people can vote something. But I don't want someone to spam something all the way to the top. So I found (a) creating an account and only allowed to vote once and (b) CAPTCHA to decrease spam. What other methods do you know and how good do they work?
From xkcd
The big thing I've noticed is that whatever you do, you want your system to be unique. You want an attacker to have to tailor their automation program for your specific site, rather than just throw a pre-existing script at it that will work almost anywhere. It doesn't even have to be cryptographically secure; it just has to make your site a little different from the norm.
This doesn't mean you can't or shouldn't use something like a pre-built captcha widget. Absolutely do use one of those as a staring point! It just means you have to customize it somewhere so that something extra happens that is outside the norm and will break any pre-existing script that could normally defeat it.
If your site gets big enough that you have attackers targeting it specifically, then your simple little customization probably won't hold up anymore and you might have do something a little more special and think about real cryptography and all that. But that's one of those things that's a "good" problem to have.
For a CAPTCHA system, I heartily recommend reCAPTCHA.
Traditional computer-generated CAPTCHAs will eventually be broken by developing a sufficiently intelligent system. For instance, here's someone who claims to break the Google CAPTCHA, formerly considered unbreakable, with a 30% hit rate. reCAPTCHA, by definition, shows you only images that cannot be recognized by optical character recognition.
And at the same time, your users' effort will be directed towards the common good - they help digitize books by recognizing words that cannot be recognized automatically.
See here for further explanation and to try it out.
From Quantum Random Bit Generator Service, via MNeylon
Limit the number of votes per IP address per time
Block anonymizing proxies.
For voting: How about shuffling the value that has to be returned by the form on a "per session basis". Once "1" means the first item, "2" means the second. Then "77" means the first item, "812" means the second, ... could be some simple maths behind the scene, but it prevents users from just sending the same HTTP query over and over again.
What's worked for me very well: Use AJAX forms, not simple HTTP forms. Technically it's not much more complicated to fake votes, but I have written a simple blog software and it's only SPAM protection mechanism is to submit the comments via AJAX - no SPAM so far.
I'm a fan of the "hidden field" CAPTCHA. I don't remember where I read about it, but the idea is this:
create your form as normal
add an extra field but hide it (i.e. style="display:none" on the surrounding div or table row)
after submission, if the field is blank, do the appropriate action (eg send an email); if the field has been filled in, then it's a robot submitter
The only case where this falls down is if the user's browser doesn't handle CSS (or they have it switched off), which is very rare.
Charge for votes, like they do on some television "talent" shows, and get spammed all the way to the bank!
Seriously, this is a really tough problem, and someday (maybe soon, if you listen to Ray Kurzweil), computers will do testing to screen out humans. The answers I'm adding to the list have obvious drawbacks, but just for the sake of enumeration: moderation (have humans do the testing), and IP-based tracking (limit the number of votes from a host).
stackoverflow has a few features that help with this; I think the single most useful step you can take is disabling the ability of anonymous users and new accounts to vote. This way, no one can sign up for hundreds of accounts and use their one vote to overpower other users. I'd say requiring a few posts or membership for a certain period of time are both decent options.
Some would say you could allow one vote per IP address to help address this, but I've played plenty of games where malicious users with a nigh-infinite number of proxies defied IP address-based security. It's a deterrent, but a savvy user will get around it easily.
This is the study area of Human Computation.
there is an excellent video from Luis von Ahn here:
http://video.google.com/videoplay?docid=-8246463980976635143
There's a few ideas in the answers to the Best non-image based CAPTCHA? question if you haven't seen it already.
I normally use a combination of the two: anonmous user is free to browse everything, but if he wants to vote, then he has to register.
In the registration process, depending on the situation, I use an optin thru mail (to complete registration and confirm that at least the mailbox exists) and/or a CAPTCHA.
From that point on you can decide if the user can vonte more than once, or any other rule.
Btw I'm not a fan of the IP-based constraints: there are a lot of situation in which big organization's network use few IP for all their users, so the risk to block users that could vote is high.