Software to manage 1 Million files on Amazon S3 - amazon-s3

I have an Amazon S3 account with 1 million items in the root folder. Is there any Windows software that can be used to manage these files?
I've tried Bucket Explorer and CloudBerry, but both seize up trying to list this many files.

http://www.bucketexplorer.com/documentation/amazon-s3--search-on-objects-in-bucket.html Shows how to set delimiter and prefix up for a search. This might be what you need. Your files might have some set of common prefixes - file name prefixes, or a character thats common you can use as a delimiter to make folders - say there are a lot of numbers, then try a delimiter of '1' instead of the default '/'.
You may need to read up a bit to learn about delimiters and prefixes in S3.

Well after trying many S3 tools I've finally found one which handles >million files with ease, and can do a sync as well. It's free, thought that wasn't important to me, I just wanted something that worked.
Dragon Disk:
http://www.dragondisk.com

Related

aws s3 search sub-folders containing one specific file

I understand that s3 does not have "folder" but I will still use the term to illustrate what I am looking for.
I have this folder structure in s3:
my-bucket/folder-1/file-named-a
my-bucket/folder-2/...
my-bucket/folder-3/file-named-a
my-bucket/folder-4/...
I would like to find all folders containing "file-named-a", so folder-1 and folder-3 in above example will be returned. I only need to search the "top level" folders under my-bucket. There could be tens of thousands of folders to search. How to construct the ListObjectsRequest to do that?
Thanks,
Sam
An Amazon S3 bucket can be listed (ListBucket()) to view its contents, and this API call can be limited by a Prefix. However, it is not possible to put a wildcard within the prefix.
Therefore, you would need to retrieve the entire bucket listing, looking for these files. This would require repeated calls if there are a large number of objects.
Example: Listing Keys Using the AWS SDK for Java

Amazon S3, storing large number of files (millions, and many TB of data)

I'll have to store millions of files (many TB in the future) in S3.
Are there any limitations? (not a price :) ), i'm asking about architectural limitations (like - don't store it this way, the other way will be better/faster).
My files are in a hierarchy
/{country}/{number}/{code}/docs
and i checked i can keep them that way (to access them easy thru REST)
(of course i know S3 keeps them internally in other way - not important to me).
So, are there any limitations/pitfalls ?
S3 has no limits that you would hit. The files are not really in folders, they are just strings as locations. Make the folder structure something that is easy for you to keep track of and organize.
You do NOT want to be listing the "folder" contents in S3 to find things.
S3 is slow at giving directory listings, because it's not really directories.
You should be storing either the whole path /{country}/{number}/{code}/docs in a database or the logic should be so repeatable that you can be confident that the file will be in that location.
James Brady gave an excellent and very detailed answer to how s3 treats file storage in a question here https://stackoverflow.com/a/394505/4179009
AWS S3 does definitely have limits to access 100req/sec in case of similar path prefix, see the official docs: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
From the other side a hierarchical approach makes logic complicated. A trade off depends on your requirements, one of good options can be using at least 4 symbols length key (primary id or hash key) in front of URL. In case of having limited number countries try using multiple buckets with country code as a bucket name, it also helps to define a specific physical location if required.

Migrate to Amazon S3 - Keeping my hierarchical directories?

I have a Rails 3 app with Paperclip gem.
Actually, my local directories structure is based on my record UUID to stock images:
5D5E5641-FCE8-4D0B-A413-A9F993CD0E34
becomes:
5/D/5/E/5/6/....... 3/4/full/image.jpg
5/D/5/E/5/6/....... 3/4/thumb/image.jpg
so that, I never have more than 32000 nodes per directory.
I want to migrate to S3:
1) Can I keep this directories structure on S3 ? Could it be a perf issue ?
2) Does Amazon S3 has its own directories management per bucket ?
Thanks.
There is no such stuff as folders in Amazon S3. It is a "flat" file system. The closer you can get to folders is adding prefixes like you said: 5/D/image.jpg to your file names. In this case, 5 is a prefix and 5/D is also a prefix. On the other hand, your delimiter could be /.
Even though several S3 tools will show you stuff as if they were contained inside folders, this concept does not exist on S3. Please see this and this related threads.
You can definitely use the pattern you suggested, and I don't think you will suffer any performance penalties by doing so.

Amazon MapReduce input splitting and downloading

I'm new to EMR and just had a few questions i have been struggling with the past few days. The first of which is the logs that i want to process are already compressed as .gz and i was wondering if these types of files are able to be split by emr so that more then one mapper will work on a file. Also i have been reading that input files will not be split unless they are 5gb, my files are not that large so does that mean they will only be processed by one instance?
My other question might seem relatively dumb but is it possible to use emr+streaming and have an input someplace other then s3? It seems redundant to have to download the logs from the CDN, then upload them to my s3 bucket to run mapreduce on them. Right now i have them downloading onto my server then my server is uploading them to s3, is there a way to cut out the middle man and have it go straight to s3, or run the inputs off my server?
are already compressed as .gz and i was wondering if these types of files are able to be split by emr so that more then one mapper will work on a file
Alas, no, straight gzip files are not splittable. One option is to just roll your log files more frequently; this very simple solution works for some people though it's a bit clumsy.
Also i have been reading that input files will not be split unless they are 5gb,
This is definitely not the case. If a file is splittable you have lots of options on how you want to split it, eg configuring mapred.max.split.size. I found [1] to be a good description of the options available.
is it possible to use emr+streaming and have an input someplace other then s3?
Yes. Elastic MapReduce now supports VPC so you could connect directly to your CDN [2]
[1] http://www.scribd.com/doc/23046928/Hadoop-Performance-Tuning
[2] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/EnvironmentConfig_VPC.html?r=146

Handling large amounts of file uploads - Any limitations I should know of?

I'm building a website that will involve a lot of uploaded files. Hopefully, more than I intend for there to be.
I figured I'd have an uploaded files path and use a UUID as the filename. I was curious if there are any limitations on this? For instance, would storing thousands of files in the one folder on my server create problems?
There are quite many issues that couldappear, from file system limitations to backup problems.
I suggest using the first X characters of tue UUIS as folder name - possibly multiple levels deep (first 4, second 43, third 4). This way you have one structure but can back up folders and move them to different servers if needed later (by using the folders as redirection points).