Handling large amounts of file uploads - Any limitations I should know of? - file-upload

I'm building a website that will involve a lot of uploaded files. Hopefully, more than I intend for there to be.
I figured I'd have an uploaded files path and use a UUID as the filename. I was curious if there are any limitations on this? For instance, would storing thousands of files in the one folder on my server create problems?

There are quite many issues that couldappear, from file system limitations to backup problems.
I suggest using the first X characters of tue UUIS as folder name - possibly multiple levels deep (first 4, second 43, third 4). This way you have one structure but can back up folders and move them to different servers if needed later (by using the folders as redirection points).

Related

What's a best approach to create a filestore

This is an open ended question. I have noob understanding of databases but willing to learn whatever is required. Though I believe my problem could be done without learning a lot.
So, here goes the question:
I have large amount of files getting generated in mt projects(depending on the builds) and I need to archive them and also need to reproduce them according to buildNumber if requested by users. I don't expect these requests to be a lot. May be 1-2 requests a day.
For eg: 16GB data per build every week. Most of the files in weekly builds are duplicate. And I don't want to archive them again and again. I prefer to store them only once. There is one caveat that it can happen that the files relative location can change, even though content hasn't changed.
My approach is as follow: Create a hash from each file. Create the key-value pair as fileHash-actual file and store it. Store this information in some kind of manifest file for each build. So, I should be able to create the builds back with correct files/paths etc.
Can it ever happen that 2 different files will ever have the same hash? Can some database help to do it efficiently? I am currently thinking of dumping all files in one folder.
Thanks

Amazon S3, storing large number of files (millions, and many TB of data)

I'll have to store millions of files (many TB in the future) in S3.
Are there any limitations? (not a price :) ), i'm asking about architectural limitations (like - don't store it this way, the other way will be better/faster).
My files are in a hierarchy
/{country}/{number}/{code}/docs
and i checked i can keep them that way (to access them easy thru REST)
(of course i know S3 keeps them internally in other way - not important to me).
So, are there any limitations/pitfalls ?
S3 has no limits that you would hit. The files are not really in folders, they are just strings as locations. Make the folder structure something that is easy for you to keep track of and organize.
You do NOT want to be listing the "folder" contents in S3 to find things.
S3 is slow at giving directory listings, because it's not really directories.
You should be storing either the whole path /{country}/{number}/{code}/docs in a database or the logic should be so repeatable that you can be confident that the file will be in that location.
James Brady gave an excellent and very detailed answer to how s3 treats file storage in a question here https://stackoverflow.com/a/394505/4179009
AWS S3 does definitely have limits to access 100req/sec in case of similar path prefix, see the official docs: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
From the other side a hierarchical approach makes logic complicated. A trade off depends on your requirements, one of good options can be using at least 4 symbols length key (primary id or hash key) in front of URL. In case of having limited number countries try using multiple buckets with country code as a bucket name, it also helps to define a specific physical location if required.

Asset Management: which is the better way to organise user generated files on a web server?

We are in the process of building a system which allows users to upload multiple images and videos to our servers.
The team I'm working with have decided to save all the assets belonging to a user in a folder named using the user's unique identifier. This folder in turn will be a sub-folder of our main assets folder on the file server.
The file structure they have proposed is as follows:
[asset_root]/userid1/assets1
[asset_root]/userid1/assets2
[asset_root]/userid2/assets1
[asset_root]/userid2/assets2
etc.
We are expecting to have thousands or possibly a million+ users in the life time of this system.
I always thought that it wasn't a good idea to have many sub-folders in a single location and suggested a year/month/day approach as follows:
[asset_root]/2010/11/04/userid1/assets1
[asset_root]/2010/11/04/userid1/assets2
[asset_root]/2010/11/04/userid2/assets1
[asset_root]/2010/11/04/userid2/assets2
etc.
Does anyone know which of the above approaches would be better suited for this many assets? Is there a better method to organize images/videos on a server?
The system in question will be an Windows IIS 7.5 with a SAN.
Many thanks in advance.
In general you are correct, in that many file systems impose a limit on the number of files and folders which may be in one folder. If you hit that limit with the number of users you have, your in trouble.
In general, I would simply use a uuid for each image, with some dimension of partitioning. e.g. A hash of ABCDEFGH would end up as [asset_root]/ABC/DEFGH. Using a hash gives you a greater degree of assurance about the number of files which will end up in each folder and prevents you from having to worry about, for example, not knowing which month an image you need was stored in.
I'm presuming your file system is NTFS? IF so, you've got a limit of 4,294,967,295 files on the disk - the limit of files in a folder is the same. If you have on the order of millions of users you should be fine, though you might want to consider having only one folder per user instead of several as your example indicates.

Storing uploaded content on a website

For the past 5 years, my typical solution for storing uploaded files (images, videos, documents, etc) was to throw everything into an "upload" folder and give it a unique name.
I'm looking to refine my methods for storing uploaded content and I'm just wondering what other methods are used / preferred.
I've considered storing each item in their own folder (folder name is the Id in the db) so I can preserve the uploaded file name. I've also considered uploading all media to a locked folder, then using a file handler, which you pass the Id of the file you want to download in the querystring, it would then read the file and send the bytes to the user. This is handy for checking access, and restricting bandwidth for users.
I think the file handler method is a good way to handle files, as long as you know to how make good use of resources on your platform of choice. It is possible to do stupid things like read a 1GB file into memory if you don't know what you are doing.
In terms of storing the files on disk it is a question of how many, what are the access patterns, and what OS/platform you are using. For some people it can even be advantageous to store files in a database.
Creating a separate directory per upload seems like overkill unless you are doing some type of versioning. My personal preference is to rename files that are uploaded and store the original name. When a user downloads I attach the original name again.
Consider a virtual file system such as SolFS. Here's how it can solve your task:
If you have returning visitors, you can have a separate container for each visitors (and name it by visitor login, for example). One of the benefits of this approach is that you can encrypt the container using visitor's password.
If you have many probably one-time visitors, you can have one or several containers with files grouped by date of upload.
Virtual file system lets you keep original filenames either as actual filesnames, or as a metadata for the files being stored.
Next, you can compress the data being stored in the container.

Vb.Net Document Storage

I am attempting to add a document storage module to our AR software.
I will be prompting the user to attach a doc/image to thier account. I will then put a copy of this file into our folder so that we can reference it without having to rely on them keeping the file in its original place. This system is not using a database but instead its using multiple flat files.
I am looking for guidance on how to handle these files once they have attached them to our system.
How should I store these attached files?
I was thinking I could copy the file over to a sub directory then renaming it to a auto-generated number so that we do not have duplicates. The bad thing about this, is the contents of the folder can get rather large.
Anyone have a better way? Should I create directories and store them...?
This system is not using a database but instead its using multiple flat files.
This sounds like a multi-user system. How are you handing concurrent access issues? Your answer to that will greatly influence anything we tell you here.
Since you aren't doing anything special with your other files to handle concurrent access, what I would do is add a new folder under your main data folder specifically for document storage, and write your user files there. Additionally, you need to worry about name collisions. To handle that, I'd name each file there with by appending the date and username to the original file name and taking the md5 or sha1 hash of that string. Then add a file to your other data files to map the hash values to original file names for users.
Given your constraints (and assuming a limited number of total users) I'd also be inclined to go with a "documents" folder -- plus a subfolder for each user. Each file name should include the date to prevent collisions. Over time, you'll have to deal with getting rid of old or outdated files either administratively or with a UI for users. Consider setting a maximum number of files or maximum byte count for each user. You'll also want to handle the files of departed users.