What is preventing people from using someone else's CAPTCHA as their own? - captcha

Why (other than moral reasons) don't more people use the CAPTCHAs of other sites as their own while selling the solving of said CAPTCHAs?
To me, such a system seems like it would be simple to implement:
set up a script that does something on another website that requires a CAPTCHA to be completed through the use of a proxy service
when a user on your site performs a task that requires the completion of a CAPTCHA, simply serve them the CAPTCHA that the other
site asks you to solve
when the user solves the CAPTCHA, your script can perform the desired action on the other site that is the source of the CAPTCHA,
and the user on your site is also verified through this process
Is this commonplace? If not, why not? What, if anything, could be done to prevent this?

Fetching the captcha. Assuming one could easily fetch the exact visual of the captcha from the foreign host. To do this, you have to pass the referral check (most browsers (navigated by humans) allow to send the http_referer). You also would have to save the session_id and the secret from the hidden input.
Checking the result. The foreign host must link the saved variables with the ones associated with the session of your first request, which requires you to implement tricky cURL methods. You would have to handle multiple parallel requests, all from your single ip.
Your server will probably use more resources when hacking a captcha on a foreign host than if it generates a captcha on its own.
http_referer check
limit requests for single IP to e.g. 5 / minute
good session handling and tricky cookies
it's not impossible to reverse engineer javascript, but the more complicated your javascript is, ...
you have to find a pattern that recognizes the result on the foreign host. the easiest signature may be the Location header field, leading either to /path/success.html or /path/tryagain.php
I took a moment to prepare an example: http://woisteinebank.de/test/
In this example, I attach keys to the session_id(); and save it in the database.
Through session_regenerate_id(); I have a fresh session on every request.
In check.php, I compare the database values to the $_GET values.
Try to find a way to get leech this captcha, I'll try to defend. Everytime you sucessfully use my captcha on your site, I try to defend it.


In IdentityServer4, how do you securely store the ReturnUrl?

I am developing an identity server 4 dotnet core application so this is as much as a dotnet question than and IDS4 question. One example of state I need to maintain between pages (login, signup etc...) is the returnUrl. The application I'm migrating from used to store it in a session variable but, as I understand, unless I run a persistent session strategy, this won't scale well.
So currently, I'm passing it around as a field in each View Model used by each view so it can be returned. Is this a sound approach? I'll be needing other fields to be passed around as well so I'm wondering whether this is a secure and logical way to do it.
So currently, I'm passing it around as a field in each View Model used by each view so it can be returned. Is this a sound approach?
Yes, how you choose to pass it around is up to you, I choose this same approach. You could use TempData, Sessions or even localStorage as an alternative. I think having it in the models (view models) is a good approach because you are explicitly specifying where you want the return url to exist, otherwise it might persist in context that you wouldn't want.
Now the security question because obviously you might be able to see the return url in the browser address field.
As part of Identity Server 4 setup you specify which return url's you are allowed to redirect back to, so I don't think there is any harm in having the users see the redirect url.
Something to consider is what if the user would share the url to someone else in the middle of the authentication process, would they be able to resume from that part of the process that the initial user has stopped? is this something you want in your app?
If you mean reliably instead of securely, write tests which will provide you with confidence that your code works.

How do I make my selenium tests detectable by the server?

I'm currently working on making a few improvements to our selenium based UI tests. One feature I'm looking for is a reliable way for our website to detect what traffic is coming from our tests, so I can filter this traffic out of our browser usage metrics and logging.
One thought I had is to set a tracking cookie with selenium that I could read server side to append to my logs/metrics making it easier to filter it out. The challenge here is cookies are domain specific, and as far as I know wouldn't be readable from other sites. Cookies are also a finite resource, and given the size/distributed nature of our website it's quite possible to run into a situation where this could blow the size limit on cookies/headers and cause issues in the page.
Is this my best option, or is there another reliable way to detect from my webserver if my page is being automated with selenium. (I'm not trying to combat bots, we have other systems in place to guard against DoS/DDoS attacks.
When using Chrome, the Selenium driver injects a webdriver property into the browser’s navigator object. This means that for me, adding the following js to my page redirected it to StackOverflow:
if (navigator.webdriver == true) {
window.location = "https://stackoverflow.com";
So I guess just replace window.location = "https://stackoverflow.com"; with whatever you want, I'm guessing logging the requests somewhere or somehow excluding them from whatever tool you use to measure traffic.
So, the server obviously needs some token that tells it that a session is selenium based. Given that, here is what I would do.
Create a super simple API on your server. Have that API take the session token of the logged in used and pass that in the API (almost always automatic). When the API receives that session token, mark something in the database (new table or same table that stores session ID's if any).
Have the API flag the session as a test session, and thus not valid for metrics.
This is not a statistically significant impact on any server, so there is no worry about resources or impact.
Should take a very simple code-behind API, a very lightweight table that could simply have a single column with a foreign key to the session id involved. All inserted session IDs in this table, by virtue of existing here, are test sessions.
And then, your metrics recording will need to add a single clause to a query that has effectively "WHERE (SELECT COUNT(sessionId) FROM TestSessionsTable WHERE sessionId = currentIdChecked) = 0"
And that would give you what you need. I am happy to be told of a better solution, but this strikes me as the simplest effort, with the least impact on resources.
As for detecting Webdriver sessions on the client side, you can either use C. Peck's suggestion, or directly call the API from your automation run using the WebDriver's Javascript Executor logic.

remote image embeds: how to handle ones that require authentication?

I manage a large and active forum and we're being plagued by a very serious problem. We allow users to embed remote images, much like how stackoverflow handles image (imgur) however we don't have a specific set of hosts, images can be embedded from any host with the following code:
and this works fine and dandy... except users can embed an image that require authentication, the image causes a pop-up to appear and because authentication pop-ups can be edited they put something like "please enter your [sitename] username and password here" and unfortunately our users have been falling for it.
What is the correct response to this? I have been considering the following:
Each page load has a piece of Javascript execute that checks each image on the page and its status
Have an authorised list of image hosts
Disable remote embedding completely
The problem is I've NEVER seen this happen anywhere else, yet we're plagued with it, how do we prevent this?
Its more than the password problem. You are also allowing some of your users to carry out CSRF attacks against other users. For example, a user can set up his profile image as [img]http://my-active-forum.com/some-dangerous-operation?with-some-parameters[/img].
The best solution is to -
Download the image server side and store it on the file system/database. Keep a reasonable maximum file size, otherwise the attacker can download tons of GBs of data onto your servers to hog n/w and disk resources.
Optionally, verify the file is actually an image
Serve the image using a throw-away domain or ip address. It is possible to create images that masquerade as a jar or applet; serving all files from a throwaway domain protects you
from such malicious activity.
If you cannot download the images on the server side, create a white list of allowed url patterns (not just domains) on the server side. Discard any urls that don't match this URL pattern.
You MUST NOT perform any checks in javascript. Performing checks in JS solves your immediate problems, but does not protect your from CSRF. You are still making a request to an attacker-controlled url from your users browser, and that is risky. Besides, the performance impact of that approach is prohibitive.
I think you mostly answered your own question. Personally I would have gone for a mix between option 1 and option 2: i.e. create a client-side Javascript which first checks image embed URLs against a set of white-listed hosts. For each embedded URL which is not in that list, do something along these lines, while checking that the server does not return the 401 status code.
This way there is a balance between latency (we attempt to minimize duplicate requests via the HEAD method and domain whitelists) and security.
Having said that, option 2 is the safest one, if your users can accept it.

Prevention from entire website downloading?

There is one IP (from China) which is trying to download my entire website. It downloads all my pages and loads the server significantly (I have more than 500 000 pages). Looking at the access logs I can tell it's definitely not a Google bot or any other search engine bot.
Temporarily I've banned it (using iptables rules), but it's not a solution for me, because some of my real users also have the same IP, so they are also banned and cannot acces the website.
Is there any way to prevent such kind of "user activity"? Maybe a mechanism which implements captcha if you try to request more than 5 requests a second or something?
P.S. I'm using Yii framework (PHP).
Any suggestions are greatly appreciated.
thank you!
You have answered your own question!
Make captcha appear if the request exceeds certain number per second or per minute!
You should use CCaptchaAction to implement, like this.
I guess the best way to monitor for suspicious user activity is really user session, CWebUser's getState()/setState(). Store current request time in user session, compare it to several previous values, show captcha if user makes requests too often.
Create new component, preload it via CWebApplication::$preload and check user activity in components init() function. This way you'll be able to turn bot check on and off easily.

Figure out if a website has restricted/password protected area

I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!
Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.
Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).
I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.
You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.
Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.
You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.