How should I format user uploaded pictures' filenames? - naming-conventions

My website deals with pictures that users upload. I'm kind of conflicted on what my picture filename should consist of. I'm worried about scalability simply and possibly security? Maybe someone out there deals with the same thing and can tell me what their use on their site?
Currently, my filename convention is
{pictureId}_{userId}_{salt}_{variant}.{fileExt}
where salt is a token generated server-side (not sure why I decided to put this here, maybe for security purposes I don't know) and variant is something like t where it signifies it's a thumbnail. So it would look something like
12332_22_hb8324jk_t.jpg
Please advise, thanks.

In addition to the previous comments, you may want to consider creating a directory hierarchy for your files. Depending on volume and the particular OS hosting the files, you can easily reach a point where you have an unreasonably large number of files in a single directory. There may be limits on the number of files allowed per folder. If you ever need to do any manual QA or maintenance on your files, this may be problematic (especially if such maintenance is not scripted).
I once worked on a project with a high volume of images. We decided to record a subpath in our database in addition to the filename of each file. Our folder names looked like this:
a/e/2/f/9
3/3/2/b/7
Essentially, we created folders 5 deep with a single hex value as the folder name. The depth was probably excessive, but effective. I suppose this could have led to us reaching a limit on the number of folders on a volume (not sure if such a limit exists).
I would also consider storing a drive in addition to a path (assuming you have a bunch of disks for storage). This way you can move images around and then update your database (assuming you have one) as part of the move.

My 2 pence worth; there is a bit of a conflict between scalability and security in this problem I would say.
If you have real security concerns, then you should not rely at all on the filename of the target image : this is just security-by-obfusication - somebody could just guess the name eventually.[even with your salt idea, which makes it harder]
Instead you should at least have a login mechanism to create a session between client and server , to make sure you can only get at stuff once you have authenticated: even then stuff is sniffable: if security really is a concern , then I would say you have to use SSL.
Regarding scalability : I would suggest you actually do give your images sequential numbers: and store them in 'bins' of (say) 500 images each. As you fill up a bin, create a new one. Store bin (min-image-id, max-image id) information in one DB table and image numbers in another: you can then comparitively cheaply find which bin a particular image lives in from its id. This is a fairly common solution for storing lots of docs/images.
You could then map your URLs to the bin+image id: but then to avoid the problem noted by Jason Williams (sequential numbering, makes it easy to probe), you really should address security separately as in point 1.

You might like to consider replacing the underscores with (e.g.) minuses. (Underscores are used as wildcards in SQL, so you could potentially run into trouble one day in a LIKE comparison). (And of course, underscores are just plain evil :-)
It looks form your example like you're avoiding spaces and upper-case characters - good move. I'd keep everything lowercase and use case-insensitive comparisons to eliminate any potential case-sensitivity issues with different file systems.
Scalability should be fine as long as you can cope with any number of digits in your user, picture and type IDs. You're very unlikely to hit any filename length limits with this scheme.
Security could be an issue if you use sequential IDs, as someone could potentially tweak the numbers and request a picture they shouldn't be able to access - but the salt should make it virtually impossible for someone to guess the correct filename for another picture. If users can't see/access the internal filename in any way, that may be an unnecessary measure though.

The first thing to do is to setup a directory structure that models your use case. In your case you have a user that uploads a picture. You would probably have a directory structure like this (probably on a network share somewhere):
-Pictures
-UserID1
-PictureID1~^~Variant.jpg
-PictureID2~^~Variant.jpg
-UserID2
-PictureID1~^~Variant.jpg
-PictureID2~^~Variant.jpg
Pictures - simply the root directory for the following.
UserID - is the database user ID.
PictureID is simply the picture ID from the database (assuming you record the filename of each uploaded picture in a database.)
~^~ - This is simply a delimitor. You can use a one character or X character sequence. I like three characters as it is easily handled with the split function and is readily distinguishable in the file name.
Sometimes I like to add the size of the picture in with the file name .256.jpg or .1024.jpg.
At any rate, all of this depends on your use case. The most important thing is setting up the directory structure properly. That will make it easier to access/serve and manage the pictures.
You can add any other information you need into the filename as long it doesn't exceed the maximum filename length on your system.

Related

Amazon S3, storing large number of files (millions, and many TB of data)

I'll have to store millions of files (many TB in the future) in S3.
Are there any limitations? (not a price :) ), i'm asking about architectural limitations (like - don't store it this way, the other way will be better/faster).
My files are in a hierarchy
/{country}/{number}/{code}/docs
and i checked i can keep them that way (to access them easy thru REST)
(of course i know S3 keeps them internally in other way - not important to me).
So, are there any limitations/pitfalls ?
S3 has no limits that you would hit. The files are not really in folders, they are just strings as locations. Make the folder structure something that is easy for you to keep track of and organize.
You do NOT want to be listing the "folder" contents in S3 to find things.
S3 is slow at giving directory listings, because it's not really directories.
You should be storing either the whole path /{country}/{number}/{code}/docs in a database or the logic should be so repeatable that you can be confident that the file will be in that location.
James Brady gave an excellent and very detailed answer to how s3 treats file storage in a question here https://stackoverflow.com/a/394505/4179009
AWS S3 does definitely have limits to access 100req/sec in case of similar path prefix, see the official docs: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
From the other side a hierarchical approach makes logic complicated. A trade off depends on your requirements, one of good options can be using at least 4 symbols length key (primary id or hash key) in front of URL. In case of having limited number countries try using multiple buckets with country code as a bucket name, it also helps to define a specific physical location if required.

Database Schema, pointer to file

This is probably a really simple question, but just making sure. I am designing a database schema and some of tables should link to files on the file system (PDF, PPT, etc).
How should this be done?
My initial idea is varchar(255) with the absolute/relative path to the file. Is there a better way to do this? I've searched online and found varbinary(max), but not sure if that's what I actually want; I don't wish to actually load any binary into the database, merely to have a pointer to a file.
This depends on the OS and the max length of a valid path. What you are calling a "pointer" is just a text field with the file path, so no different than other character data.
I would usually store the relative path, and have the root folder specified in my application. This way you can move files to a different drive, for example, and not have to udpate the rows in your db.
The actual data type you choose depends on the dbms you are using. Some databases also provide specific data types for files that you may want to explore, e.g., the FileStream data type introduced in SQL Server 2008.
You need to store in the database de name of the file, and it's path, is that right? Then you should create a fild with varchar(255). I always used like that and never had problems.
Hope it helped.
If you don't want to store the file's binary data in the database, then storing the path is the only way to go. Whether you store the absolute path or the relative path is up to you.
Yep that's basically it.
Relative path from some location configured as a parameter in Db is the usual way of it.
Aside from getting round length restrictions.
If you had say C:MySystem\MyData as the base path. Then you could do Images\MyImageFile.jpg, Docs\MyDopc.pdf etc.
Note the impact on backup and restore though. You have to do the database and the file system.
One other potential consideration is filenames have to be unique. So you If Fred and Wilma both up load Picture1.jpg, the db is okay, but the file system will be stuffed.
Usual way round this is to have a user filename and an actual filename.
So Fred's Picture1.jpg is actually p000004566.jpg
Don't forget to add code to cope with the file you think should be there has been deleted by some twit.
Also some sort of admin task to tidy up orphaned files might be in order, in the infinitely unlikely event that a coding error was made. :)
Also if the path to the file is configurable by software, make sure you check that the account that will be doing the work has read write access, might also want to use a UNC path, but don't saddle yourself with a mapped drive.

Asset Management: which is the better way to organise user generated files on a web server?

We are in the process of building a system which allows users to upload multiple images and videos to our servers.
The team I'm working with have decided to save all the assets belonging to a user in a folder named using the user's unique identifier. This folder in turn will be a sub-folder of our main assets folder on the file server.
The file structure they have proposed is as follows:
[asset_root]/userid1/assets1
[asset_root]/userid1/assets2
[asset_root]/userid2/assets1
[asset_root]/userid2/assets2
etc.
We are expecting to have thousands or possibly a million+ users in the life time of this system.
I always thought that it wasn't a good idea to have many sub-folders in a single location and suggested a year/month/day approach as follows:
[asset_root]/2010/11/04/userid1/assets1
[asset_root]/2010/11/04/userid1/assets2
[asset_root]/2010/11/04/userid2/assets1
[asset_root]/2010/11/04/userid2/assets2
etc.
Does anyone know which of the above approaches would be better suited for this many assets? Is there a better method to organize images/videos on a server?
The system in question will be an Windows IIS 7.5 with a SAN.
Many thanks in advance.
In general you are correct, in that many file systems impose a limit on the number of files and folders which may be in one folder. If you hit that limit with the number of users you have, your in trouble.
In general, I would simply use a uuid for each image, with some dimension of partitioning. e.g. A hash of ABCDEFGH would end up as [asset_root]/ABC/DEFGH. Using a hash gives you a greater degree of assurance about the number of files which will end up in each folder and prevents you from having to worry about, for example, not knowing which month an image you need was stored in.
I'm presuming your file system is NTFS? IF so, you've got a limit of 4,294,967,295 files on the disk - the limit of files in a folder is the same. If you have on the order of millions of users you should be fine, though you might want to consider having only one folder per user instead of several as your example indicates.

Changing hash of a files

I have a folder full of binary files and I want to make a change to these files so that the hash of these files will change. I want to do this is a fashion that doesn't pertinently corrupt the files. Meaning that the change should still allow the file to operate normally or that I should be able to undo the change at any point in time.
Does anyone know of a script that I could use to do this or many a program that will automate this?
Cheers
UPDATE
Its a edge case that I am trying to deal with. I have a system that only allows me to store a file with a given hash once. Hence I am wanting to change the content hash of the file to allow the file to be stored. Note the system in question is not one I control or can change.
Couldn't I just add a random 1 to the end of the file and then remove it afterward without breaking anything? I'm just not sure how to script this - as in how to modify the binary data in this way. Note I'm in a windows environment.
Without knowing the format of the files, we can't tell. It may in fact be impossible - for instance if these binary files are self-signed with some private key. Changing any single bit within the file is likely to render it invalid.
Is your hash calculated purely from the contents, and not any other metadata that you can change (such as filename or modified date)? If so, you're probably out of luck. If the hash is meant to detect when the content changes, but you're trying to change the hash without actually changing the content, you've clearly got a problem...
What is the hash used for? Why do you want to change it? There may be an alternative solution if you could give us more information about the bigger picture.
EDIT: One alternative is to effectively create your own container format - so while a file is stored in your container format, it's not usable in its original form, but it can be extracted easily. Your container could be as simple as "add four bytes at the end as a seed to disturb the hash" - "extracting" the file would just involve copying it and removing the last four bytes. But the important point is that what you end up with isn't an MP3 file or whatever you started with - it's your custom format, simple as it is. You need to package/extract the file any time you interact with the store.

Vb.Net Document Storage

I am attempting to add a document storage module to our AR software.
I will be prompting the user to attach a doc/image to thier account. I will then put a copy of this file into our folder so that we can reference it without having to rely on them keeping the file in its original place. This system is not using a database but instead its using multiple flat files.
I am looking for guidance on how to handle these files once they have attached them to our system.
How should I store these attached files?
I was thinking I could copy the file over to a sub directory then renaming it to a auto-generated number so that we do not have duplicates. The bad thing about this, is the contents of the folder can get rather large.
Anyone have a better way? Should I create directories and store them...?
This system is not using a database but instead its using multiple flat files.
This sounds like a multi-user system. How are you handing concurrent access issues? Your answer to that will greatly influence anything we tell you here.
Since you aren't doing anything special with your other files to handle concurrent access, what I would do is add a new folder under your main data folder specifically for document storage, and write your user files there. Additionally, you need to worry about name collisions. To handle that, I'd name each file there with by appending the date and username to the original file name and taking the md5 or sha1 hash of that string. Then add a file to your other data files to map the hash values to original file names for users.
Given your constraints (and assuming a limited number of total users) I'd also be inclined to go with a "documents" folder -- plus a subfolder for each user. Each file name should include the date to prevent collisions. Over time, you'll have to deal with getting rid of old or outdated files either administratively or with a UI for users. Consider setting a maximum number of files or maximum byte count for each user. You'll also want to handle the files of departed users.