S3 Bucket Types - ruby-on-rails-3

Just wondering if there is a recommended strategy for storing different types of assets/files in separate S3 buckets or just put them all in one bucket? The different types of assets that I have include: static site images, user's profile images, user-generated content like documents, files, and videos.

As far as how to group files into buckets. That is really not that critical of an issue unless you want to have different domain names or CNAMEs fordifferent types on content, in which case you would need a separate bucket for each domain name you would want to use.
I would tend to group them by functionality. Perhaps static files used in your application that you have full control over you might deploy into a separate bucket from content that is going to be user generated. Or you might want to have video in a different bucket than images, etc.
To add to my earlier comments about S3 metadata. It is going to be a critical part of optimizing how you server up content from S3/Cloudfront.
Basically, S3 metadata consists of key-value pairs. So you could have Content-Type as a key with a value of image/jpeg for example if the file is .jpg. This will automatically send appropriate Content-Type headers corresponding to your values for requests made directly to S3 URL or via Cloudfront. The same is true of Cache-Control metatags. You can also use your own custom metatags. For example, I use a custom metatag named x-amz-meta-md5 to store an md5 hash of the file. It is used for simple bucket comparisons against content stored in a revision control system, so we don't have to make checksums of each file in the bucket on the fly. We use this for pushing differential content updates to the buckets (i.e. only push those that have changed).
As far as how revision control goes. I would HIGHLY recommend using versioned file names. In other words say you have bigimage.jpg and you want to make an update, call it bigimage1.jpg and change your code to reflect this. Why? Because optimally, you would like to set long expiration time frames in your Cache-Control headers. Unfortunately, if you then want to deploy a file of the same name and you are using Cloudfront, it becomes problematic to invalidate the edge caching locations. Whereas if you have a new file name, Cloudfront would just begin to populate the edge nodes and you don't have to worry about invalidating the cache at all.
Similarly for user-produced content, you might want to include an md5 or some other (mostly) unique identifier scheme, so that each video/image can have its own unique filename and place in the cache.
For your reference here is a link to the AWs documentation on setting up streaming in Cloudfront
http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/CreatingStreamingDistributions.html

Related

hiding s3 path in aws cloudfront url

I am trying to make sure I did not miss anything in the AWS CloudFront documentation or anywhere else ...
I have a (not public) S3 bucket configured as origin in a CloudFront web distribution (i.e. I don't think it matters but I am using signed urls).
Let's say a have a file in a S3 path like
/someRandomString/someCustomerName/someProductName/somevideo.mp4
So, perhaps the url generated by CloudFront would be something like:
https://my.domain.com/someRandomString/someCustomerName/someProductName/somevideo.mp4?Expires=1512062975&Signature=unqsignature&Key-Pair-Id=keyid
Is there a way to obfuscate the path to actual file on the generated URL. All 3 parts before the filename can change, so I prefer not to use "Origin Path" on Origin Settings to hide the begging of the path. With that approach, I would have to create a lot of origins mapped to the same bucket but different paths. If that's the only way, then the limit of 25 origins per distribution would be a problem.
Ideally, I would like to get something like
https://my.domain.com/someRandomObfuscatedPath/somevideo.mp4?Expires=1512062975&Signature=unqsignature&Key-Pair-Id=keyid
Note: I am also using my own domain/CNAME.
Thanks
Cris
One way could be to use a lambda function that receives the S3 file's path, copies it into an obfuscated directory (maybe it has a simple mapping from source to origin) and then returns the signed URL of the copied file. This will ensure that only the obfuscated path is visible externally.
Of course, this will (potentially) double the data storage so you need some way to clean up the obfuscated folders. That could be done on a time-based manner, so if each signed URL is expected to expire after 24 hours, you could create folders based on date, and each of the obfuscated directories could be deleted every other day.
Alternatively, you could use a service like tinyurl.com or something similar to create a mapping. It would be much easier, save on storage, etc. The only downside would be that it would not reflect your domain name.
If you have the ability to modify the routing of your domain then this is a non-issue, but I presume that's not an option.
Obfuscation is not a form of security.
If you wish to control which objects users can access, you should use Pre-Signed URLs or Cookies. This way, you can grant access to private objects via S3 or CloudFront and not worry about people obtaining access to other objects.
See: Serving Private Content through CloudFront

AMWS s3 bucket image url

I am using AMWS s3 in a ruby on rails project to store images for my models. Everything is working fine. I was just wondering if it okay/normal that if someone right clicks an image, it shows the following url:
https://mybucketname.s3.amazonaws.com/uploads/photo/picture/100/batman.jpg
Is this a hacking risk, letting people see your bucket name? I guess I was expecting to see a bunch of randomized letters or something. /Noob
Yes, it's normal.
It's not a security risk unless your bucket permissions allow unauthenticated actions like uploading and deleting objects by anonymous users (obviously, having the bucket name would be necessary if a malicious user wanted to overwrite your files) or your bucket name itself provides some kind of information you don't want revealed.
If it makes you feel better, you can always associate a CloudFront distribution with your bucket -- a CloudFront distribution has a default hostname like d1a2b3c4dexample.cloudfront.net, which you can use in your links, or you can associate a vanity hostname with the CloudFront distribution, like assets.example.com, neither of which will reveal the bucket name.
But your bucket name, itself, is not considered sensitive information. It is common practice to use links to objects in buckets, which necessarily include the bucket name.

Amazon S3. Maximum object size

Is there a possibility to set maximum file (object) size using a bucket's policy?
I found here a question like this, but there is no size limitation in the examples.
No, you can't do this with a bucket policy. Check the Element Descriptions page of the S3 documentation for an exhaustive list of the things you can do in a bucket policy.
However, you can specify a content-length-range restriction within a Browser Uploads policy document. This feature is commonly used for giving untrusted users write access to specific keys within an S3 bucket you control (e.g. user-facing media uploads), and it provides the appropriate tools for limiting the location, size, and data types that can be uploaded without needing to expose your S3 credentials.

Searching Inside an Amazon S3 Bucket

If I have a bucket with hundreds of thousands of images, is it ok to have to search for each image I want to display in my site via it's ID or is there a more efficient way (including having multiple folders in a bucket maybe)?
I was also thinking of giving each image a unique hash or something similar in order to stop duplicated names in the bucket. Does that seem like a good idea?
You just link to each image using normal urls. for public files the urls are in the format:
http://mybucket.s3.amazonaws.com/myimage.jpg
For private urls, you need to generate a url (which is easy using any of the sdks) in the format:
http://mybucket.s3.amazonaws.com/myimage.jpg?AWSAccessKeyId=44CF9SAMPLEF252F707&Expires=1177363698&Signature=vjSAMPLENmGa%2ByT272YEAiv4%3D
There's nothing wrong with storing each file with a unique name. If you set the correct headers on the file, any downloads can still have the original name. eg Content-Disposition: attachment; filename=myimage.jpg;
For listing a buckets contents you would use the APIs GetBucket command. I find it easier to use the SDKs for any access via the API.
It can be a pain to search or do things in parallel over bucket objects as amazon lists everything lexicographically (the only way currently supported). The problem with using random IDs is that all of it would be written to the same block storage and you cannot do search in parallel to optimize.
Here is an interesting article on performance improvements. I use it for my work and see significant difference in high load.
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html

How to create multiple keys for the same object in Amazon S3 OR CLOUDFRONT?

I am syndicating out my multi-media content (mp4 and images) to several clients. So I create one S3 object for every mp4 say "my_content_that_pays_my_bills.mp4" and let the client access the S3 URL for the objects and embed it wherever they want.
What I want is for client A to access this MP4 as "A_my_content_that_pays_my_bills.mp4"
and Client B to access this as "B_my_content_that_pays_my_bills.mp4" and so on.
I want to bill the clients by usage: so I could process access logs and count access to "B_my_content_that_pays_my_bills.mp4" and bill client B for usage.
I know that S3 allows only one key per object. So how do I get around this ?
I don't know that you can alias file names in the way you'd like. Here are a couple of hacks I can think of for public files embedded freely by a customer:
1) Create one Cloudfront distribution per client, each pointing at the same bucket. Each AWS account can have 100 distributions, so you could support only that many clients. Or,
2) Duplicate the files, using the the client-specific names that you'd like. This is simpler but your file storage costs scale with your clients (which may or may not be significant).