copy multiple objects into one object in amazon S3 - amazon-s3

I stuck with the following problem: I need to upload objects in small parts (512KB), so I can not use multipart upload (since the minimum 5MB restriction). On the grounds of that, I have to put my parts in a "partitions" bucket and run a Cron task to download partitions and upload a single concatenated object into a "completed" bucket.
I would like to clarify, however, that there is no more elegant way to do this except direct download and concatenation. AWS CLI suggests one can copy objects as a whole, but I see no way to copy and concatenate several objects into one. Is there a way to do this via AWS S3 means?
UPD: I am not guaranteed 512KB chunk size (in fact, it is 512KB to 16MB), but it is usually 512KB and this limit takes origin from vendor of my IP cameras so I can not really change that. And I know the result size beforehead, the camera tells me "I am going to upload 33MB" with a separate call to my backend, but I have no control over number of chunks or their size except the guaranteed boundaries above.

Related

AWS : How do Athena GET requests on S3 work?

How do Athena GET requests on S3 work? I had the impression that one S3 GET request = getting one single file from a bucket. But that doesn't seem to be the case since a single query that uses 4 files is costing me around 400 GET requests.
What's happening exactly?
If you run queries against files that are splittable and are large enough Athena will spin up workers that will read partial files. This improves performance because of parallelization. Splittable files are for example Parquet files.
A 100x amplification sounds very high though. I don't know what size Athena aims for when it comes to splits, and I don't know the sizes for your files. There could also be other explanations for the additional GET operations, both inside of Athena and from other sources – how sure are you that these requests are from Athena?
One way you could investigate further is to turn on object level logging in CloudTrail for the bucket. You should be able to see all the request parameters like what byte ranges are read. If you assume a role and pass a unique session name and make only a single query with the credentials you get you should be able to isolate all the S3 operations made by Athena for that query.

Does S3 multipart upload actually create multiple objects in my bucket?

Here is an example for me trying to understand the under the hood mechanism.
I decide to upload a 2GB file onto my S3 bucket, and I decide to use the size of 128MB for the parts. Then I will have
(2 * 1024) / 128 => 16 parts
Here are my questions:
Am I going to see 16 128MB objects in my bucket or a single 2GB
object in my bucket?
How can S3 understand the order of the parts (1->2->...->16) and
reassemble them into a single 2GB file when I download them back? Is
there an extra 'meta' object (see the above question) that I need to download first to help the client to achieve this reassembling-needed information?
When the s3 client download the above in parallel, at what time does it write the file descriptor for this 2GB file in the local file system (I guess it does not know all the needed information before all the parts have been downloaded)?
While uploading the individual parts, there will be multiple uploads stored in Amazon S3 that you can view with the ListMultipartUploads command.
When completing a multipart upload with the CompleteMultipartUpload command, you must specify a list of the individual parts uploaded in the correct order. The uploads will then be combined into a single object.
Downloading depends upon the client/code you use -- you could download an object in parallel or just single-threaded.

How can I organize a million+ files now that I'm moving to Amazon S3?

Well I'm getting booted from my shared host and I'm switching over to a combination of a VPS from Linode and Amazon S3 to host a few million jpegs.
My big worry is keeping some kind of sanity with all these images. Is there any hope of that? My understanding is you're only allowed 100 "buckets" and "buckets" are the only type of structure within S3.
Is putting a few million files in a bucket something you'd advise against?
You may notice in Bucket Restrictions and Limitations, it is stated:
There is no limit to the number of objects that can be stored in a bucket
My experience is that a very large number of objects in a single bucket will not affect the performance of getting a single object by its key (that is, get appears to be of constant complexity).
Having a very large number of object also does not affect the speed of listing a given number of objects:
List performance is not substantially affected by the total number of keys in your bucket
However, I must warn you, that most S3 management tools I've used (like S3Fox) will choke and die a horrible slow death when attempting to access a bucket with a very large number of objects. One tool that seems to cope well with very large numbers of objects is S3 Browser (they have a free version and a Pro version, I am not affiliated with them in any way).
Using "folders" or prefixes, does not change any of these points (get and listing a given number of objects are still constant, most tools still fall over themselves and hang).

Max files per directory on Amazon S3

I'm currently storing ~3 million images files in a single directory on my server, which is causing serious performance issues. I'd like to move them to Amazon S3 and I'm wondering whether I'd need to use a hierarchical folder structure or whether I can store them in a single folder on S3.
I get a large percentage of my traffic from google image search and I don't want to hurt my SEO by changing the image path, so a single folder on S3 would be ideal if there aren't any performance issues. I imagine LIST operations would be slow, but I'm okay with that.
S3 has no limit on the number of items stored in a bucket. In fact, using a 'directory' separator in key names is completely optional.
There is a practical use for using a separator in your key: as you correctly guessed listing the keys will be more difficult as you'll have to page though many list results.
However, as the S3 documentation points out you can use any character as a separator.

Distributed datastore

We're trying to add some kind of persistence in our app.
The app generates about 250 entries per second. Each of these entries belong to one of 2M files. For each file, we want to keep the last 10 entries, so we can look them up later.
The way our client application works :
it gets a stream of all the data
it fetches the right file (GET)
it adds the new content
it saves the file back (PUT)
We're looking for an efficient way to store this data that can scale horizontally as the amount of data we're getting is doubling every few weeks.
We initially looked at S3. It works fine, but becomes very expensive very fast (>$1000 monthly just in PUT operations!)
We then gave a shot at Riak. But it seems we can't get more than 60 write/sec on each node, which is very very slow.
Any other solution out there?
There are lots of knobs you can turn in Riak - ask the mailing list if you haven't already and we'll figure out a sane configuration for you. 60 writes/sec is not within the norm.
See: http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
What about Hadoop's HDFS spread over Amazon EC2 instances? I know each instance has a good amount of storage space, and you don't have to pay for put/get, only the inbound transfer.
I would suggest looking at CloudIQ Storage from Appistry. Its a fully distributed file store. Its accessible via a REST-based API, and can run on commodity hardware. You can define the number of copies retained on a file by file basis. It supports an Eventually Consistent model so you can balance file consistency with performance.