Prevention from entire website downloading?

Prevention from entire website downloading? - iptables

There is one IP (from China) which is trying to download my entire website. It downloads all my pages and loads the server significantly (I have more than 500 000 pages). Looking at the access logs I can tell it's definitely not a Google bot or any other search engine bot.
Temporarily I've banned it (using iptables rules), but it's not a solution for me, because some of my real users also have the same IP, so they are also banned and cannot acces the website.
Is there any way to prevent such kind of "user activity"? Maybe a mechanism which implements captcha if you try to request more than 5 requests a second or something?
P.S. I'm using Yii framework (PHP).
Any suggestions are greatly appreciated.
thank you!

You have answered your own question!
Make captcha appear if the request exceeds certain number per second or per minute!
You should use CCaptchaAction to implement, like this.

I guess the best way to monitor for suspicious user activity is really user session, CWebUser's getState()/setState(). Store current request time in user session, compare it to several previous values, show captcha if user makes requests too often.
Create new component, preload it via CWebApplication::$preload and check user activity in components init() function. This way you'll be able to turn bot check on and off easily.

Related

Counting the amount of users or executions of an application.

I made a program that gets the data from the clipboard and saves it in a string variable. Then it looks for specific words in that string and generates several URLs. Afterwards it open the browser and shows each URL in an own tab.
Some of my friends already use this program frequently and I want to have some statistics about how often. I simple counter variable would be enough but I need to get access to it.
I came up with two options that could work:
I could send an email to a specific adress every time my app is executed. Then I can track the amount of uses by manually or automaticly counting the amount of emails in the postbox. I think this would be a Vers dirty solution.
I could create and publish a website containing a counter. This counter could be refreshed by my application. This solution is a bit better I think but a lot more work for just one single counter.
Do you have better ideas to solve my problem or is one of mine already a good one?
Thank you in advace!

You can use Measurement Protocol Overview. This provides you statistics of usage your application compared with Google Analytics. You can see even a geo statistic, version distribution, crash reports. It is easy to use it from .net. It is just about requesting http request to google.

What is preventing people from using someone else's CAPTCHA as their own?

Why (other than moral reasons) don't more people use the CAPTCHAs of other sites as their own while selling the solving of said CAPTCHAs?
To me, such a system seems like it would be simple to implement:
set up a script that does something on another website that requires a CAPTCHA to be completed through the use of a proxy service
when a user on your site performs a task that requires the completion of a CAPTCHA, simply serve them the CAPTCHA that the other
site asks you to solve
when the user solves the CAPTCHA, your script can perform the desired action on the other site that is the source of the CAPTCHA,
and the user on your site is also verified through this process
Is this commonplace? If not, why not? What, if anything, could be done to prevent this?

Fetching the captcha. Assuming one could easily fetch the exact visual of the captcha from the foreign host. To do this, you have to pass the referral check (most browsers (navigated by humans) allow to send the http_referer). You also would have to save the session_id and the secret from the hidden input.
Checking the result. The foreign host must link the saved variables with the ones associated with the session of your first request, which requires you to implement tricky cURL methods. You would have to handle multiple parallel requests, all from your single ip.
Your server will probably use more resources when hacking a captcha on a foreign host than if it generates a captcha on its own.
Prevents
http_referer check
limit requests for single IP to e.g. 5 / minute
good session handling and tricky cookies
it's not impossible to reverse engineer javascript, but the more complicated your javascript is, ...
you have to find a pattern that recognizes the result on the foreign host. the easiest signature may be the Location header field, leading either to /path/success.html or /path/tryagain.php
Challenge:
I took a moment to prepare an example: http://woisteinebank.de/test/
In this example, I attach keys to the session_id(); and save it in the database.
Through session_regenerate_id(); I have a fresh session on every request.
In check.php, I compare the database values to the $_GET values.
Try to find a way to get leech this captcha, I'll try to defend. Everytime you sucessfully use my captcha on your site, I try to defend it.

May we use Yii flash messages on this scenario?

I haven't seen this scenario covered here:
Yii Framework: How to work with Flash Messages.
So, after user registration, I wish to redirect the user to a thank you page where he/she could read more about what he/she should do, and what would happen next. It's a nice amount of information, so adding that message to an already existing page is not an option, because it would get to noisy. Making temporary displaying msg isn't an option neither, because it's a fair amount of text to be read.
On cases like this:
Should we still use flash messages and use a conditional so that what normally exists on the page stays hidden while display a success flash message ?
OR
Should we simply redirect to a given thank you view (by creating the respective thankyou action?)
Is there a better option?

You could use a flash message. But these are really for things like "Your account is now created".
If you want to include a good amount of information, I think it best to have a separate thankyou action/view that people are redirected to after the sign up process is complete.

From email to shared hosted backend to remote frontend

So my friend hosts a little get together every once in a while where space is limited to the first 14 people who RSVP. He emails the invite out to a list and then accepts the first people who respond. Tonight I barely got in because I can't always check my email, so I told him that I would write a program that would respond instantly to his request. This would not normally be a problem (autoresponder, easy) except he has recently created an online signup form. I think it would be funny for him to send out his next invite and get a sub-100ms response from me, so I would like to give this a try.
The problem is, I'm not quite sure how to go about it without going to to much expense. I have a personal site that can host some .NET backend code, but it's on a shared GoDaddy server so I don't really have a ton of access to the mailserver or anything. I was thinking that if I could get an email sent to a certain address that maybe it could trigger a webrequest that could pull down his page and then fill the (very simple, like 2 or 3 inputs) form out and submit it, but again, I'm not quite sure how.
Would anyone have an idea about how I could go about this? I would want for this to happen automatically without any sort of interaction from me, just basically as soon as I get an email from a certain email address, somehow my code is triggered and the form filled out and submitted.
This is just for fun, but the programmer in me is curious as to how I could actually get this to work.
Thanks!

The most affordable thing I know of would be through NearlyFreeSpeech.NET. If you set up an account there, you can configure a domain with email forwarding for 3 cents/day. They have an option to forward the email to a script, so you could write something that would look at the mail, pull down the form, and post to a server.
I'm not sure but I think the script has to be running on their servers, so you'll have to set up a website (another few cents per day) and write the script to run in a UNIX environment (PHP or Perl or such). If you insist on .NET, you could write a minimal PHP script to forward the data to your GoDaddy account.

Figure out if a website has restricted/password protected area

I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!

Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.

Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).

I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.

You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.

Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.

You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas