How Safe is an Obscure File Download Link? - apache

Here's what I'm trying to do:
I want to distribute my Vcard (.vcf) file by hosting it on my personal website (this part is a rigid requirement). People will access it from a QR code on my business card, however, no links to the file will exist on my webpages.
I want to make the file publicly accessible, while ensuring that it doesn't get scraped by a bot. It will be contained in a folder disallowed from "normal" bots via robots.txt, and I will disable directory listings in Apache.
I do NOT want to introduce additional steps such as captchas or authentication.
My thought is something like how google drive does public sharing - a 44-character random string that represents the file. So....
http://mywebsite.com/private/34599771831821330576336168849178778047996955.vcf
My questions are:
1) How safe is this? Presumably, as long as I disable directory listing on Apache, the only way a bot can stumble on the file without a direct link is via random guessing. Do bots really bother trying to do just a thing?
2) If it's safe, presumably string length is key. Just how long does the string need to be to make it "safe"?
3) Is there a better way to do this than filename obscurity?

Yes, there is a better way. It is called recaptcha.
The idea should be to present the user with the captcha and if he/she/it solves it correctly, then you proceed to the download.
https://www.google.com/recaptcha/intro/index.html

Related

Is it possible for others to find images on my server that aren't referenced on my website?

If I upload a file to my webserver, is it possible for anyone or any crawler of some sort to find that file even though I haven't linked to it from anywhere or referenced to it?
Say for example you have a site that hides content to non logged in users, if I know the path to an image file I am able to reach that file even though I am not logged in. This is the case of several sites I regularly visit. But is this really a problem, is it possible for people with bad intentions to find these images even though they can't log in?
My next question would of course be (maybe that's another thread though): how can I as a web developer, using a LAMP stack, protect file paths from being requested from non logged in users?

Sub-domain vs Sub-directory to block from crawlers

I've google a lot and read a lot of articles, but got mixed reactions.
I'm a little confused about which is a better option if I want a certain section of my site to be blocked from being indexed by Search Engines. Basically I make a lot of updates to my site and also design for clients, I don't want all the "test data" that I upload for previews to be indexed to avoid the duplicate content issue.
Should I use a sub-domain and block the whole sub-domain
or
Create a sub-directory and block it using robots.txt.
I'm new to web-designing and was a little insecure about using sub-domains (read somewhere that it's a little advanced procedure and even a tiny mistake could have big consequences, moreover Matt Cutts has also mentioned something similar (source):
"I’d recommend using sub directories until you start to feel pretty
confident with the architecture of your site. At that point, you’ll be
better equipped to make the right decision for your own site."
But on the other hand I'm hesitant on using robots.txt as well as anyone could access the file.
What are the pros and cons of both?
For now I am under the impression that Google treats both similarly and it would be best to go for a sub-directory with robots.txt, but I'd like a second opinion before "taking the plunge".
Either you ask bots not to index your content (→ robots.txt) or you lock everyone out (→ password protection).
For this decision it's not relevant whether you use a separate subdomain or a folder. You can use robots.txt or password protection for both. Note that the robots.txt always has to be put in the document root.
Using robots.txt gives no guaranty, it's only a polite request. Polite bots will honor it, others not. Human users will still be able to visit your "disallowed" pages. Even those bots that honor your robots.txt (e.g. Google) may still link to your "disallowed" content in their search (they won't index content, though).
Using a login mechanism protects your pages from all bots and visitors.

Restrict unauthenticated access to files with mod_rewrite and scripting language

I have scavenged for the answers online but none seem to be similar to what I am trying to achieve. As such, I hope that gurus at stackoverflow can help me out.
What is it that I am trying to accomplish?
I want to restrict access to content for non-authorized users. Accessible content to non-authorized users will be specified in a white list. All other content is blacklisted.
What is my environment?
I am running Apache in conjunction with a scripting language very similar to that of PHP. The scripting language will not be known by many but it is Fazzt ( in case you do know and are able to infer the differences of it as compared to PHP... there are no pointers / memory management, decimal values, and binary data ). I have to use this environment due to the nature of the project.
What is happening on the site?
The site authenticates users and stores authentication in sessions. An unauthenticated user is presented with a styled ( contains images, css, js, etc ) webpage. Hence, I need to white-list all of the static images, css, js files in order for them to be available for download by the client browser. Once signed in, broader range of dynamic content is presented ( as such, anything that is not white-listed is automatically black-listed ).
How did I plan to solve the problem?
This is silly but I guess obvious is not always seen. My approach involved mod_rewriting all requests to existing files that do not match .fzt and .fsp pages. The rewrite would go to a scripting file that would check the requested file against the white list. If the file is present in the list, request would get routed directly to the file ( yes, silly me... it would get mod_rewritten again >_< ). If it's not in the list, user's authentication would be checked. If the user is not authenticated, "File not found" HTTP would be returned. Otherwise, the request would be redirected to the file and served ( same folly ).
As you can see, the approach is greatly flawed. However, I am sure something of the nature should be possible... yet, I have not found any proof just yet. What do you think? Is the mod_rewrite / script a completely wrong way of performing this task? How would you do it otherwise? Note that I cannot simply slap .htaccess as the access determined by user authentication that is tracked by Fazzt ( read above, scripting language similar to that of PHP ).
Any suggestions or thoughts would be greatly appreciated!

How do I implement a secure upload/download area?

I've been asked to create a solution where people log in and are able to upload and download off of our work server. So John uploads a photo, and Jen can download it, for example. They also have to authenticate themselves.
Can someone give me a rough overview of how to implement this? I'm familiar enough with MySQL, C#, and JavaScript.
The rough overview
This should just be a matter of planning out the pieces.
at the very top of the page, put some code that checks if a user is logged in. If not, show a login form (or redirect to...). If they are logged in, show the rest of the page. If not, you'll need some logic to show a form, and then check it once it's submitted for authentication, and set a SESSION cookie or something similar.
Once the user is logged in, on the homepage, you might have an file-upload form and a listing of existing files. How you would style would depend on how many files you might expect to have. To keep things extremely simple, you could simple iterate through whatever files are in the upload directory. If you expect many more files than that, you may consider using a db.
Handle a file upload by sanitizing filenames (checking for filetype/filesize if you want to limit those) and putting the file into the directory.
Force the users to download the files (instead of having the browser decide what to do with them) for security purposes. Implementing this on certain filetypes may also be acceptable.
Other thoughts
You probably would not want the users to be able to excecute any files, so keeping the file directory hidden would be a good idea.
Keeping track of who uploaded and downloaded what is also doable, but would add another layer of complication to the script.

Figure out if a website has restricted/password protected area

I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!
Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.
Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).
I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.
You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.
Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.
You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.