Figure out if a website has restricted/password protected area - passwords

I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!

Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.

Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).

I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.

You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.

Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.

You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.

Related

Basic Web Development Questions (building a working test site)

I am new to this site and coding. I have self taught myself html and I understand css. I have been putting together a site of mine using my basic knowledge. I have no college experience but this is MY DREAM to put this site together so I have done a lot of research and read books to get started but I have hit a roadblock now. Here is what I have done:
-I have put together all of the front end pages and design using html/css. So, I have all of the pages that would be involved with the site, ready to go. All designed and have the layout how I wish it to be.
I guess I would call it the "skeleton" of the site. Any page that a user would be directed to, I have in a folder.
I have put together a little "demo" for myself to mimic a user experience. For example, I created a login page that "looks" how i want it to be but it doesnt actually store or save any logins.
This is my first question:
What is my next step? I admit it sounds stupd but I am self taught and I really have the ambition to acheive this I just can not figure out where to go from here in order to actually make a functioning site. All I have right now is my html "demo" where basically I have to follow a certain path down my site that mimics what a user would do on the site. I have it now where I click on the "sign up" button on my html form and it basically just redirects to my "new user" page. Then it is the same formula throughout the rest of my demo. I just put my other html pages I have designed into the html to sort of give a "user experience" to the demo. But I REALLY want to be able to have working accounts and saved data.
How do I create/save a user login to my site? DO i need to get a sql database? Is there a free one to use while i build the site? Honestly i really need someone who is willing to help me out with the steps in this journey without me sharing my entire site (i wish to keep it to my self) but.. i understand this is basic web stuff i just am genuinely lost as how to take it to the next level. I have all of the html done and now i need a way to actually make it work. I wish to conversate with someone please about this kink in the chain i am seeming to find myself in please. Thank you so much and I would be grateful. :)
----basically what programming languages do i need to learn, or when looking for someone to hire, what should they be skilled in? any software or sites or databases that i need? please help!!!
HTML and CSS are the languages that make up the front end of a website, like you said. In order for your website to have dynamic content (content specific to a user) and the ability to actually process logins, etc., there needs to be a server involved. A webpage is a text document that is interpreted by a browser. HTML makes up the content and CSS tells the browser how you want it to look. What you are missing, primarily, is server scripts, most commonly, in my experience, PHP. You can also include JavaScript for client-side effects.
Specific to your question about a user login, yes, you will need a database. The process should look something like this.
User visits login page
User enters information into an HTML form
User clicks submit
Form is submitted to a server URL using the 'POST' method
Server validates the form content
Server checks database for username or email (whichever you are using)
If the username/email exists, it compares the passwords
Server sends a response back to the client, either good or bad
Once the user is validated, you can redirect the user to the dashboard or user section.
Please keep in mind this is a very simplistic version of events. There are more in depth steps that need to be taken, for example, your passwords should never be stored in a database as plain text, you should use a one-way encryption (hashing) algorithm to make them unreadable. Then when a password is given to the server it should be hashed and you should compare the hashes. You can also use salts when hashing for more security. The form should use SSL to prevent man in the middle attacks, etc.
Sounds like you are off to a good start, but in order to make it work you have to add the server logic. Self-teaching will get you as far as you are willing to let it. I taught myself how to do web programming, and now I do it as a business. The Internet is a great resource. There are a ton of great tutorials online that will show you how to do everything I just laid out.

Restrict unauthenticated access to files with mod_rewrite and scripting language

I have scavenged for the answers online but none seem to be similar to what I am trying to achieve. As such, I hope that gurus at stackoverflow can help me out.
What is it that I am trying to accomplish?
I want to restrict access to content for non-authorized users. Accessible content to non-authorized users will be specified in a white list. All other content is blacklisted.
What is my environment?
I am running Apache in conjunction with a scripting language very similar to that of PHP. The scripting language will not be known by many but it is Fazzt ( in case you do know and are able to infer the differences of it as compared to PHP... there are no pointers / memory management, decimal values, and binary data ). I have to use this environment due to the nature of the project.
What is happening on the site?
The site authenticates users and stores authentication in sessions. An unauthenticated user is presented with a styled ( contains images, css, js, etc ) webpage. Hence, I need to white-list all of the static images, css, js files in order for them to be available for download by the client browser. Once signed in, broader range of dynamic content is presented ( as such, anything that is not white-listed is automatically black-listed ).
How did I plan to solve the problem?
This is silly but I guess obvious is not always seen. My approach involved mod_rewriting all requests to existing files that do not match .fzt and .fsp pages. The rewrite would go to a scripting file that would check the requested file against the white list. If the file is present in the list, request would get routed directly to the file ( yes, silly me... it would get mod_rewritten again >_< ). If it's not in the list, user's authentication would be checked. If the user is not authenticated, "File not found" HTTP would be returned. Otherwise, the request would be redirected to the file and served ( same folly ).
As you can see, the approach is greatly flawed. However, I am sure something of the nature should be possible... yet, I have not found any proof just yet. What do you think? Is the mod_rewrite / script a completely wrong way of performing this task? How would you do it otherwise? Note that I cannot simply slap .htaccess as the access determined by user authentication that is tracked by Fazzt ( read above, scripting language similar to that of PHP ).
Any suggestions or thoughts would be greatly appreciated!

remote image embeds: how to handle ones that require authentication?

I manage a large and active forum and we're being plagued by a very serious problem. We allow users to embed remote images, much like how stackoverflow handles image (imgur) however we don't have a specific set of hosts, images can be embedded from any host with the following code:
[img]http://randomsource.org/image.png[/img]
and this works fine and dandy... except users can embed an image that require authentication, the image causes a pop-up to appear and because authentication pop-ups can be edited they put something like "please enter your [sitename] username and password here" and unfortunately our users have been falling for it.
What is the correct response to this? I have been considering the following:
Each page load has a piece of Javascript execute that checks each image on the page and its status
Have an authorised list of image hosts
Disable remote embedding completely
The problem is I've NEVER seen this happen anywhere else, yet we're plagued with it, how do we prevent this?
Its more than the password problem. You are also allowing some of your users to carry out CSRF attacks against other users. For example, a user can set up his profile image as [img]http://my-active-forum.com/some-dangerous-operation?with-some-parameters[/img].
The best solution is to -
Download the image server side and store it on the file system/database. Keep a reasonable maximum file size, otherwise the attacker can download tons of GBs of data onto your servers to hog n/w and disk resources.
Optionally, verify the file is actually an image
Serve the image using a throw-away domain or ip address. It is possible to create images that masquerade as a jar or applet; serving all files from a throwaway domain protects you
from such malicious activity.
If you cannot download the images on the server side, create a white list of allowed url patterns (not just domains) on the server side. Discard any urls that don't match this URL pattern.
You MUST NOT perform any checks in javascript. Performing checks in JS solves your immediate problems, but does not protect your from CSRF. You are still making a request to an attacker-controlled url from your users browser, and that is risky. Besides, the performance impact of that approach is prohibitive.
I think you mostly answered your own question. Personally I would have gone for a mix between option 1 and option 2: i.e. create a client-side Javascript which first checks image embed URLs against a set of white-listed hosts. For each embedded URL which is not in that list, do something along these lines, while checking that the server does not return the 401 status code.
This way there is a balance between latency (we attempt to minimize duplicate requests via the HEAD method and domain whitelists) and security.
Having said that, option 2 is the safest one, if your users can accept it.

How do I implement a secure upload/download area?

I've been asked to create a solution where people log in and are able to upload and download off of our work server. So John uploads a photo, and Jen can download it, for example. They also have to authenticate themselves.
Can someone give me a rough overview of how to implement this? I'm familiar enough with MySQL, C#, and JavaScript.
The rough overview
This should just be a matter of planning out the pieces.
at the very top of the page, put some code that checks if a user is logged in. If not, show a login form (or redirect to...). If they are logged in, show the rest of the page. If not, you'll need some logic to show a form, and then check it once it's submitted for authentication, and set a SESSION cookie or something similar.
Once the user is logged in, on the homepage, you might have an file-upload form and a listing of existing files. How you would style would depend on how many files you might expect to have. To keep things extremely simple, you could simple iterate through whatever files are in the upload directory. If you expect many more files than that, you may consider using a db.
Handle a file upload by sanitizing filenames (checking for filetype/filesize if you want to limit those) and putting the file into the directory.
Force the users to download the files (instead of having the browser decide what to do with them) for security purposes. Implementing this on certain filetypes may also be acceptable.
Other thoughts
You probably would not want the users to be able to excecute any files, so keeping the file directory hidden would be a good idea.
Keeping track of who uploaded and downloaded what is also doable, but would add another layer of complication to the script.

Refresh browser via cron(or not) to a different page on remote request?

I need to display pages in a tutorial fashion. I looked in to netsupport, beamyourscreen and other possibilities but, I do not want the viewers to download anything. I cannot use gd / send screenshots due to audio / video instructions embedded in some of the pages.
Basically, I need the ability to "refresh" a users browser window to a different page via an interface on my end. Whether via a form submission, javascript or any other type of "controller" that allows me to change the page on the viewers browser. PERL preferred but, PHP / javascript whatever works and is cross browser. I set up a simple javascript page forward timer that "works" but, page load times and conversation interruptions are a huge factor.
The entire tutorial website will be developed around this ability.
I was looking in to curl / cron / wget methods but, found little information.
I have seen forum and chat scripts that basically perform a similar task but, there must be a simple(ish) solution in leau of hacking up another script to suit my needs.
I do not want others to control the pages either. The site really, only needs to be accessable during the tutorial however, It "could" remain web accessable as long as user interaction was normal unless (being controlled).
The initial site concept is based on instructing people how to properly introduce new pets into a home. Will be operated by a veteranarian that saved my pets life. I wanted to give something back.
Possible? I really appreciate simple examples etc...
You have no other way but to keep polling the server for "instructions" using javascript. No, you can't send nothing to the end user browser, neither curl nor wget.
Mainly, you'll have to set up a simple request/response protocol between the browser and the server.
If you want to go deeper, you can use something like cometd/meteord/etc. If not, a hidden iframe that reloads himself and receives pages with javascript code for the needed actions can do the trick.
Another alternative.
With javascript dopolling and single character flatfile. Have a simple one character flatfile with a single var. Write it in perl (it is faster and uses less resources than php). The parent script calls a javascript variable in a flatfile. It hits the flatfile and goes wherever the var sets it. The flatfile is written to by the controller. Done.
I guess you could also rename an empty flatfile and use that as the controller. I am usure which is faster, open and read a specific file or hit the directory and return the file name. On the controller side, opening and writing to a file vs renaming a file. Maybe they counter each other in resources and time?
This way the site can act as a normal site. When you want to have remote users see a "presentation" (automatically being shown the site pages at the controllers pace), the controller activates polling and tells the viewers to push a start button. This allows a remote instructor to load pages for the viewers at his leisure.
It is a simple solution that works with nothing really sophisticated going on. No frames are needed either. Just need javascript enabled.
Any better suggestions are welcome!
It occurred to me that what you might want to use is HTML Push technology. Check out the wiki, they have several links. I have never used it myself