If I have a bucket with hundreds of thousands of images, is it ok to have to search for each image I want to display in my site via it's ID or is there a more efficient way (including having multiple folders in a bucket maybe)?
I was also thinking of giving each image a unique hash or something similar in order to stop duplicated names in the bucket. Does that seem like a good idea?
You just link to each image using normal urls. for public files the urls are in the format:
http://mybucket.s3.amazonaws.com/myimage.jpg
For private urls, you need to generate a url (which is easy using any of the sdks) in the format:
http://mybucket.s3.amazonaws.com/myimage.jpg?AWSAccessKeyId=44CF9SAMPLEF252F707&Expires=1177363698&Signature=vjSAMPLENmGa%2ByT272YEAiv4%3D
There's nothing wrong with storing each file with a unique name. If you set the correct headers on the file, any downloads can still have the original name. eg Content-Disposition: attachment; filename=myimage.jpg;
For listing a buckets contents you would use the APIs GetBucket command. I find it easier to use the SDKs for any access via the API.
It can be a pain to search or do things in parallel over bucket objects as amazon lists everything lexicographically (the only way currently supported). The problem with using random IDs is that all of it would be written to the same block storage and you cannot do search in parallel to optimize.
Here is an interesting article on performance improvements. I use it for my work and see significant difference in high load.
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
Related
I want the user to upload his picture when he registers his information.
The thing is when the user uploads his image.. should automatically create a folder with his ID to be like this wwwroot/images/UserID/fadi.jpg
Basically: you really shouldn't. The wwwroot is for static assets used by the application. You're using server-side, so in theory it might be possible but that's not what the folder is meant for. An alternative method like AWS would be preferred, but if you can't do that (either because of payment requirements or other complications) I would suggest saving the image to your database. One way to do this would be to base64 encode the image and save it that way. I'm not going to give an example of that here, there are plenty available elsewhere. One such example is this.
I'd like to download a complete repository from S3. I know the bucket is reachable at https://s3.amazonaws.com/big-data-benchmark/pavlo
I'd like everything under /pavlo/sequence-snappy/5nodes
How should one download this with the least amount of manual effort with readily available tools like wget? (The s3 tools require an actual s3 account, which I do not have and want.)
Though a bit of manual effort is needed, this is how it can be done:
Goto the bucket http URL and add the ?marker=/pavlo/sequence-snappy/5nodes, resulting in https://s3.amazonaws.com/big-data-benchmark/pavlo/sequence-snappy/5nodes
Now, binary search manually on how large the dataset it is. Fortunately, the listing of your specific bucket is predictable and it seems to have 100 items ranging from 000000_0-000099_0
Use the following shell one-liner:
for i in {0000..0099}; do echo https://s3.amazonaws.com/big-data-benchmark/pavlo/sequence-snappy/5nodes/rankings/00${i}_0; done | xargs -n1 -P8 wget
Preferably we would like a more general solution which would also work for unpredictable filenames.
I think you will find that the S3 tools do not require an account for anonymous access to public buckets. (Nor do I understand why anyone wouldn't want a free account, but I digress.)
But here is a solution that works when the keys (paths/filenames) are not known or predictable:
If a bucket is truly public, as this one is, you'll find paginated XML list of all the keys at the root of the bucket.
curl -v https://s3.amazonaws.com/big-data-benchmark/, for example.
Each <Key> contains the path to an object. This is the List Objects V1 API, so you add ?marker= and the value of the last key in the listing, on the next request, to resume the listing, repeating the process until <IsTruncated> is no longer true.
Use this to build a list to pass to curl, wget, or your http client of choice, by appending the key to the bucket URL. S3 can handle many, many parallel requests, so you may want to parallelize the process.
I'm designing RESTful API for file storage and having problems with finding the best way to organize URL's to actions.
Files can be grouped to folders but it is needed to be able to get all the files.
Guidelines suggests to use the following url to get files for specific folder.
GET /folders/{folderName}/files
But what should be used to just get all files? GET /files or GET /folders/files?
Also Google Drive has somewhat similar functionality and they use diifferent approach
GET files/{folderName}/children
As you've noticed this can range from one API designer to another.
If I was facing this problem I would want to consider all use cases and figure out what works best.
It looks like the following would meet your needs:
GET / Retrieves all files and folders
GET /{folderId} Retrieves all contents of said folderId (folders and files)
GET /{fileId} Retrieves the file
GET /{folderId}/{folderId} Same as above, but for nested folder
GET /{folderId}/{folderId}/{fileId} Retrieves the file
this pattern can continue for however nested the file structure is (note there is a limit on URL length)
Then if you have a unique requirement such as all you just create a new api endpoint.
GET /files/ Retrieves all files
GET /files/?filter="*.txt" Retrieves all text files
So to answer your EXACT question of:
But what should be used to just get all files? GET /files or GET
/folders/files
I would lean towards /files instead of /folders/files. /folders/files does not make much sense as an api consumer.
I am building an new web application and want to use Cloudinary for users' images. My question is that do I need to create folders in my Cloudinary cloud? The reason I am asking is that if I were using a file system and start having 100,000+ images in one folder, it will start killing my app, and I would need to break then into several folders.
Is it the same for Cloudinary?
Thanks,
It depends on your current and future requirements.
In general, I believe that folders can help with better organizing your resources, especially when there are lots of them.
Note that besides folders, you can also assign tags to your images (e.g., by user) or add a prefix to the images' public IDs (e.g., user1-<image_name>).
You can later use Cloudinary's Admin API to list your resources either by folder/prefix or by tag.
Just wondering if there is a recommended strategy for storing different types of assets/files in separate S3 buckets or just put them all in one bucket? The different types of assets that I have include: static site images, user's profile images, user-generated content like documents, files, and videos.
As far as how to group files into buckets. That is really not that critical of an issue unless you want to have different domain names or CNAMEs fordifferent types on content, in which case you would need a separate bucket for each domain name you would want to use.
I would tend to group them by functionality. Perhaps static files used in your application that you have full control over you might deploy into a separate bucket from content that is going to be user generated. Or you might want to have video in a different bucket than images, etc.
To add to my earlier comments about S3 metadata. It is going to be a critical part of optimizing how you server up content from S3/Cloudfront.
Basically, S3 metadata consists of key-value pairs. So you could have Content-Type as a key with a value of image/jpeg for example if the file is .jpg. This will automatically send appropriate Content-Type headers corresponding to your values for requests made directly to S3 URL or via Cloudfront. The same is true of Cache-Control metatags. You can also use your own custom metatags. For example, I use a custom metatag named x-amz-meta-md5 to store an md5 hash of the file. It is used for simple bucket comparisons against content stored in a revision control system, so we don't have to make checksums of each file in the bucket on the fly. We use this for pushing differential content updates to the buckets (i.e. only push those that have changed).
As far as how revision control goes. I would HIGHLY recommend using versioned file names. In other words say you have bigimage.jpg and you want to make an update, call it bigimage1.jpg and change your code to reflect this. Why? Because optimally, you would like to set long expiration time frames in your Cache-Control headers. Unfortunately, if you then want to deploy a file of the same name and you are using Cloudfront, it becomes problematic to invalidate the edge caching locations. Whereas if you have a new file name, Cloudfront would just begin to populate the edge nodes and you don't have to worry about invalidating the cache at all.
Similarly for user-produced content, you might want to include an md5 or some other (mostly) unique identifier scheme, so that each video/image can have its own unique filename and place in the cache.
For your reference here is a link to the AWs documentation on setting up streaming in Cloudfront
http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/CreatingStreamingDistributions.html