How to avoid overwriting files in s3? ex. user1 uploads IMG_123 and user2 uploads IMG_123 - amazon-s3

Both users are uploading unique images but their camera roll has the same image name.
Should I use a uuid as a file name instead?

You can't prevent overwrites in S3 through the API.
For avoidance, there are multiple strategies that all come down to namespacing or UUIDs. What you want to use depends on what you plan to do with the data.
Approach 1
s3://<bucket>/<userId>/<filename>
This way, you avoid users overwriting each other's files, but a user could still overwrite their own files. You could, with relative ease, list the uploads a specific user has made (but it could get expensive).
Approach 2
s3://<bucket>/<userId>/<uuid>.jpg
You still avoid users being able to overwrite each other's data and make it exceedingly unlikely that a user overwrites their own images - but you lose the information about the original file name.
Approach 3
s3://<bucket>/<userId>/<uuid>/<filename>
This key schema retains the benefits of the first two approaches and also allows you to retain the original filename, but it will be more annoying if you want to look at the data in the console because there will be more "directory" levels.
Approach 4
s3://<bucket>/<uuid>.jpg
This way, you don't namespace anything and just rely on UUIDs to avoid overwriting data. You lose the information about the original file name and which user the object belongs to unless you have a secondary data structure (e.g. an index of you data in DynamoDB).
All of these options (and more) are completely valid, personally, I'd pick something that at least namespaces my data by the user id because that makes it easier to delete specific users if necessary and also allows me to write IAM policies to allow or deny access to specific users.

Related

Amazon S3, storing large number of files (millions, and many TB of data)

I'll have to store millions of files (many TB in the future) in S3.
Are there any limitations? (not a price :) ), i'm asking about architectural limitations (like - don't store it this way, the other way will be better/faster).
My files are in a hierarchy
/{country}/{number}/{code}/docs
and i checked i can keep them that way (to access them easy thru REST)
(of course i know S3 keeps them internally in other way - not important to me).
So, are there any limitations/pitfalls ?
S3 has no limits that you would hit. The files are not really in folders, they are just strings as locations. Make the folder structure something that is easy for you to keep track of and organize.
You do NOT want to be listing the "folder" contents in S3 to find things.
S3 is slow at giving directory listings, because it's not really directories.
You should be storing either the whole path /{country}/{number}/{code}/docs in a database or the logic should be so repeatable that you can be confident that the file will be in that location.
James Brady gave an excellent and very detailed answer to how s3 treats file storage in a question here https://stackoverflow.com/a/394505/4179009
AWS S3 does definitely have limits to access 100req/sec in case of similar path prefix, see the official docs: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
From the other side a hierarchical approach makes logic complicated. A trade off depends on your requirements, one of good options can be using at least 4 symbols length key (primary id or hash key) in front of URL. In case of having limited number countries try using multiple buckets with country code as a bucket name, it also helps to define a specific physical location if required.

Storing uploaded content on a website

For the past 5 years, my typical solution for storing uploaded files (images, videos, documents, etc) was to throw everything into an "upload" folder and give it a unique name.
I'm looking to refine my methods for storing uploaded content and I'm just wondering what other methods are used / preferred.
I've considered storing each item in their own folder (folder name is the Id in the db) so I can preserve the uploaded file name. I've also considered uploading all media to a locked folder, then using a file handler, which you pass the Id of the file you want to download in the querystring, it would then read the file and send the bytes to the user. This is handy for checking access, and restricting bandwidth for users.
I think the file handler method is a good way to handle files, as long as you know to how make good use of resources on your platform of choice. It is possible to do stupid things like read a 1GB file into memory if you don't know what you are doing.
In terms of storing the files on disk it is a question of how many, what are the access patterns, and what OS/platform you are using. For some people it can even be advantageous to store files in a database.
Creating a separate directory per upload seems like overkill unless you are doing some type of versioning. My personal preference is to rename files that are uploaded and store the original name. When a user downloads I attach the original name again.
Consider a virtual file system such as SolFS. Here's how it can solve your task:
If you have returning visitors, you can have a separate container for each visitors (and name it by visitor login, for example). One of the benefits of this approach is that you can encrypt the container using visitor's password.
If you have many probably one-time visitors, you can have one or several containers with files grouped by date of upload.
Virtual file system lets you keep original filenames either as actual filesnames, or as a metadata for the files being stored.
Next, you can compress the data being stored in the container.

How should I format user uploaded pictures' filenames?

My website deals with pictures that users upload. I'm kind of conflicted on what my picture filename should consist of. I'm worried about scalability simply and possibly security? Maybe someone out there deals with the same thing and can tell me what their use on their site?
Currently, my filename convention is
{pictureId}_{userId}_{salt}_{variant}.{fileExt}
where salt is a token generated server-side (not sure why I decided to put this here, maybe for security purposes I don't know) and variant is something like t where it signifies it's a thumbnail. So it would look something like
12332_22_hb8324jk_t.jpg
Please advise, thanks.
In addition to the previous comments, you may want to consider creating a directory hierarchy for your files. Depending on volume and the particular OS hosting the files, you can easily reach a point where you have an unreasonably large number of files in a single directory. There may be limits on the number of files allowed per folder. If you ever need to do any manual QA or maintenance on your files, this may be problematic (especially if such maintenance is not scripted).
I once worked on a project with a high volume of images. We decided to record a subpath in our database in addition to the filename of each file. Our folder names looked like this:
a/e/2/f/9
3/3/2/b/7
Essentially, we created folders 5 deep with a single hex value as the folder name. The depth was probably excessive, but effective. I suppose this could have led to us reaching a limit on the number of folders on a volume (not sure if such a limit exists).
I would also consider storing a drive in addition to a path (assuming you have a bunch of disks for storage). This way you can move images around and then update your database (assuming you have one) as part of the move.
My 2 pence worth; there is a bit of a conflict between scalability and security in this problem I would say.
If you have real security concerns, then you should not rely at all on the filename of the target image : this is just security-by-obfusication - somebody could just guess the name eventually.[even with your salt idea, which makes it harder]
Instead you should at least have a login mechanism to create a session between client and server , to make sure you can only get at stuff once you have authenticated: even then stuff is sniffable: if security really is a concern , then I would say you have to use SSL.
Regarding scalability : I would suggest you actually do give your images sequential numbers: and store them in 'bins' of (say) 500 images each. As you fill up a bin, create a new one. Store bin (min-image-id, max-image id) information in one DB table and image numbers in another: you can then comparitively cheaply find which bin a particular image lives in from its id. This is a fairly common solution for storing lots of docs/images.
You could then map your URLs to the bin+image id: but then to avoid the problem noted by Jason Williams (sequential numbering, makes it easy to probe), you really should address security separately as in point 1.
You might like to consider replacing the underscores with (e.g.) minuses. (Underscores are used as wildcards in SQL, so you could potentially run into trouble one day in a LIKE comparison). (And of course, underscores are just plain evil :-)
It looks form your example like you're avoiding spaces and upper-case characters - good move. I'd keep everything lowercase and use case-insensitive comparisons to eliminate any potential case-sensitivity issues with different file systems.
Scalability should be fine as long as you can cope with any number of digits in your user, picture and type IDs. You're very unlikely to hit any filename length limits with this scheme.
Security could be an issue if you use sequential IDs, as someone could potentially tweak the numbers and request a picture they shouldn't be able to access - but the salt should make it virtually impossible for someone to guess the correct filename for another picture. If users can't see/access the internal filename in any way, that may be an unnecessary measure though.
The first thing to do is to setup a directory structure that models your use case. In your case you have a user that uploads a picture. You would probably have a directory structure like this (probably on a network share somewhere):
-Pictures
-UserID1
-PictureID1~^~Variant.jpg
-PictureID2~^~Variant.jpg
-UserID2
-PictureID1~^~Variant.jpg
-PictureID2~^~Variant.jpg
Pictures - simply the root directory for the following.
UserID - is the database user ID.
PictureID is simply the picture ID from the database (assuming you record the filename of each uploaded picture in a database.)
~^~ - This is simply a delimitor. You can use a one character or X character sequence. I like three characters as it is easily handled with the split function and is readily distinguishable in the file name.
Sometimes I like to add the size of the picture in with the file name .256.jpg or .1024.jpg.
At any rate, all of this depends on your use case. The most important thing is setting up the directory structure properly. That will make it easier to access/serve and manage the pictures.
You can add any other information you need into the filename as long it doesn't exceed the maximum filename length on your system.

Vb.Net Document Storage

I am attempting to add a document storage module to our AR software.
I will be prompting the user to attach a doc/image to thier account. I will then put a copy of this file into our folder so that we can reference it without having to rely on them keeping the file in its original place. This system is not using a database but instead its using multiple flat files.
I am looking for guidance on how to handle these files once they have attached them to our system.
How should I store these attached files?
I was thinking I could copy the file over to a sub directory then renaming it to a auto-generated number so that we do not have duplicates. The bad thing about this, is the contents of the folder can get rather large.
Anyone have a better way? Should I create directories and store them...?
This system is not using a database but instead its using multiple flat files.
This sounds like a multi-user system. How are you handing concurrent access issues? Your answer to that will greatly influence anything we tell you here.
Since you aren't doing anything special with your other files to handle concurrent access, what I would do is add a new folder under your main data folder specifically for document storage, and write your user files there. Additionally, you need to worry about name collisions. To handle that, I'd name each file there with by appending the date and username to the original file name and taking the md5 or sha1 hash of that string. Then add a file to your other data files to map the hash values to original file names for users.
Given your constraints (and assuming a limited number of total users) I'd also be inclined to go with a "documents" folder -- plus a subfolder for each user. Each file name should include the date to prevent collisions. Over time, you'll have to deal with getting rid of old or outdated files either administratively or with a UI for users. Consider setting a maximum number of files or maximum byte count for each user. You'll also want to handle the files of departed users.

What is a managable way to store e-mails for extended periods of time?

If you have a site which sends out emails to the customer, and you want to save a copy of the mail, what is an effective strategy?
If you save it to a table in your database (e.g. create a table called Mail), it gets very large very quickly.
Some strategies I've seen are:
Save it to the file system
Run a scheduled task to clear old entries from the database - but then you wind up not having a copy;
Create a separate table for each time frame (one each year, or one each month)
What strategies have you used?
I don't agree that gmail is an effective backup for business data.
Why trust your business information to a provider who makes no guarantees of service, or over who you have no control whatsoever?
Makes no sense to me.
Depending on how frequently you need to access this information, I'd say go with the filesystem or database archive. At least that way, you have control over your own data.
Data you want to save is saved in a database. The only exception that is justified is large binary data (images, videos). Who cares how large the table gets? If the mails are automated and template-based, you just have to save the variable parts anyway. The size will be about the same wherever you save it, but you probably already have a mechanism to backup your database, so you won't have to invent one to handle millions of files.
Lots of assumptions:
1. You're running windows / would like an archive in windows
2. The ability to search in the mails is important.
Since you are sending mails to your customers there isn't any reason you can't bcc a mail account of your own. Assuming you have a suitable account on your own server then I'd look at using MailStore (home) to pull the mails out from your account and put them into it's own compressed database.
Another option (depending on the email content) is to not save the email, but make sure you can recreate the email by archiving the original content that went into generating the email.
It depends on the content of your email. If it contains large images. I would plump for the file system. Otherwise if your Mail table table is getting very large very quickly I would go for the separate table, archiving off dead customers.
We save the email to a database table. It really doesn't get that big that quickly. We've a table with 32,000 emails in it (they're biggish emails too # 50kb per email) and with compression, the file only uses 16MB.
If you're sending a shed load of email, then know that GMail(free) currently only allows 7GB of data. I'd be happy holding that on a disk.
I'd think about putting in place some sort of general archiving functionality. How you implement that depends on your specific retrieval needs.
For example if you wish just to retrieve emails sent to a particular customer for a certain month then stocking them in an appropriate heirachy on the File System (zip them up if necessary) should be simple to do. You might want to record a list of sent emails in a database table with a pointer to the appropriate directory but a naming convention for your directories and files might be sufficient
You might not need to access very old emails very infrequently so you might archive these to DVD for example if online storage is a problem
If you're wanting to often search the actual content of emails then your going to have to put the content in a DB table or use an indexer like Lucerne to examine the files stocked on disk