Can I trust aws-cli to re-upload my data without corrupting when the transfer fails? - amazon-s3

I extensively use S3 to store encrypted and compressed backups of my workstations. I use the aws cli to sync them to S3. Sometimes, the transfer might fail when in progress. I usually just retry it and let it finish.
My question is: Does S3 has some kind of check to make sure that the previously failed transfer didn't leave corrupted files? Does anyone know if syncing again is enough to fix the previously failed transfer?
Thanks!

Individual files uploaded to S3 are never partially uploaded. Either the entire file is completed and S3 stores the file as an S3 object, or the upload is aborted and S3 object is never stored.
Even in the multi-part upload case, multiple parts can be uploaded but they never form a complete S3 object unless all of the pieces are uploaded and the "Complete Multipart Upload" operation is performed. So there is no need worry about corruption via partial uploads.
Syncing will certainly be enough to fix the previously failed transfer.

Yes, looks like AWS CLI does validate what it uploads and takes care of corruption scenarios by employing MD5 checksum.
From https://docs.aws.amazon.com/cli/latest/topic/s3-faq.html
The AWS CLI will perform checksum validation for uploading and downloading files in specific scenarios.
The AWS CLI will calculate and auto-populate the Content-MD5 header for both standard and multipart uploads. If the checksum that S3 calculates does not match the Content-MD5 provided, S3 will not store the object and instead will return an error message back the AWS CLI.

Related

AWS S3 File Integrity and multipart practicality

I am trying to create a file integrity verification using AWS S3 but have been dumbfounded by the multipart gotcha of S3.
As per the documentation:
When using MD5, Amazon S3 calculates the checksum of the entire multipart object after the upload is complete. This checksum is not a checksum of the entire object, but rather a checksum of the checksums for each individual part.
My particular usecase is distribution of binaries, I produce a manifest of the uploaded binaries which contains the SHA256 of each file, and then sign the manifest with my PGP key.
There is no straightforward way for me to both, upload a binary with the appropriate checksum or check for the checksum of a binary while it's still on S3 without downloading it, using the getObjectAttributes API call.
My current understanding for an integrity implementation based on what is suggested by AWS's documentation is:
Chunk each binary to be uploaded locally, based on AWS S3's multipart specifications.
Produce the SHA256 for each of those chunks, BASE64 encoded.
Concatenate the chunks' SHA256 to produce the "Object's checksum"
Store all the chunk's SHA256 as well as the "Object's checksum" on my manifest.
So I can then:
Invoke the getObjectAttributes API call to get the "Object's checksum" from AWS and compare it with my manifest.
Use the manifest's stored chunk SHA256's to reliably verify locally by repeating the chunk-and-sha process described above, when I have the binary downloaded.
Is that what AWS really expects us to implement for integrity verification? Am I missing something glaringly obvious as to how one implements a file integrity check end-to-end given this very particular way S3 choses to checksum large files?
fwiw my stack is Node.js and I am using AWS SDK

Uploading smaller files with multipart file upload with only one part using AWS CLI

I have configured AWS S3 and a lambda function which triggers when a file is inserted into S3. I have configured an event s3:ObjectCreated:CompleteMultipartUpload in lambda to trigger. When I tested through AWS CLI with large files it worked. But when I upload a smaller file with size less than 5 MB, the event is not triggering the lambda. How can I do this for small size files with only one part?
Anyone please help....
Files less than 5MB cannot be uploaded using multipart upload. Therefore, you can add s3:ObjectCreated:Put event to let your lambda get notified too.

Best approach for setting up AEM S3 Data Store

We have an existing setup of AEM 6.1 which uses TarMK for data storage. To migrate the all assets to S3, I followed all steps here: https://docs.adobe.com/docs/en/aem/6-1/deploy/platform/data-store-config.html#Data%20Store%20Configurations (Amazon S3 Data Store). Apparently, the data synced to S3 but when I checked the disk usage report, I still see that assets are using disk space even for existing and newly added assets. What's the purpose of using S3 for assets if they still use the disk space? Or am I doing something wrong? How can I verify that my setup is really using S3? Here is my S3DataStore.config
accessKey="xxxxxxxxxx"
secretKey="xxxxxxxxxx"
s3Bucket="dev-aem-assets-local"
s3Region="eu-west-1"
connectionTimeout="120000"
socketTimeout="120000"
maxConnections="40"
writeThreads="30"
maxErrorRetry="10"
continueOnAsyncUploadFailure=B"true"
cacheSize="0"
minRecordLength="10"
Another question is: Do I need to do the same setup on publisher? Or is it ok just to do it on author and use publisher as is by replicating the binary data?
There are a few parts to your questiob so I'll break down the answer into logical blocks. Shout if I miss anything.
Your setup for migration is correct and S3 will use disk space. This is for the write-through cache.
AEM uses write-through cache for writing to S3 and all the settings for this cache are in your S3 config file. Any writes to data store are first written to this cache. Asynchronous background threads then uploaded to the S3 bucket. This mechanism makes AEM very responsive as it's not blocked by slow S3 writes. Also, data reads for recently written blobs are fast because they don't need slow reads from S3. In short, S3 IO traffic is too slow for AEM so this cache boosts the performance. You cannot disable it as it is required for asynchronous write to S3. You can reduce the size but it's recommended to be at least 50% of your S3 bucket size.
You can verify your S3 setup by looking at your logs for messages related to AWS (grep for aws).
As for publisher, yes you need to migrate from your old publisher to the new publisher. Assuming that you are not using binary-less replication, you will need a different S3 bucket for your publisher. In general, you migrate from author to author and publisher to publisher for a standard implementation.
You can also verify your S3 dat usage by looking at the S3 bucket and the traffic on it. If versioning is enabled on your S3 bucket all the blobs will show version stamping.
Async upload of blobs can be monitored from logs and IP traffic monitoring will show activities related to your S3 bucket. The most useful way is to see the network traffic between your AEM server and S3 end-point.

Uploading large files with Carrierwave to S3

So I will need to upload large files (zip files that are a few GB large) to S3, and I would like Carrierwave to manage the download/distribution of those files.
Meaning, when a user pays Carrierwave can automagically generate the dynamic URL and send it to them. I know how to do this already, but it just occurred to me that I have never uploaded files via Carrierwave that are bigger than a few dozen MB, much less a few GB to S3.
Given the flakiness of HTTP connections, I figure this is a suboptimal way to do it.
I don't have that many files to upload (maybe 10 - 20 max), and users won't be uploading them. It will be a storefront where the customers will be buying/downloading the files, not uploading them.
It would be nice if there was a way for me to upload the files into my S3 bucket separately (say FTP, git, or some other mechanism) and then just link it to my app through Carrierwave in some way.
What's the best way to approach this?
Also, don't forget that you will encounter the Heroku 30 second timeout when you are uploading the file in the first place.
Don't worry though, there are options:
Direct Upload - S3 supports direct upload where you present a form which uploads directly to s3 bypassing Heroku, you then receive a call back into your application with the uploaded files details for you to process (https://github.com/dwilkie/carrierwave_direct)
Upload to S3 and then expose bucket/folder in your application to connect to your models. We do this approach with a number of clients. They use Transmit (Mac Client) to upload large assets to S3 and then visit their app to link the asset to a Rails model.
Also, I'm pretty sure S3 is an HTTP based service so you're only going to be able to upload via HTTP.

AWS S3 and AjaXplorer

I'm using AjaXplorer to give access to my clients to a shared directory stored in Amazon S3. I installed the SD, configured the plugin (http://ajaxplorer.info/plugins/access/s3/) and could upload and download files but the upload size is limited to my host PHP limit which is 64MB.
Is there a way I can upload directly to S3 without going over my host to improve speed and have S3 limit, no PHP's?
Thanks
I think that is not possible, because the server will first climb to the PHP file and then make transfer to bucket.
Maybe
The only way around this is to use some JQuery or JS that can bypass your server/PHP entirely and stream directly into S3. This involves enabling CORS and creating a signed policy on the fly to allow your uploads, but it can be done!
I ran into just this issue with some inordinately large media files for our website users that I no longer wanted to host on the web servers themselves.
The best place to start, IMHO is here:
https://github.com/blueimp/jQuery-File-Upload
A demo is here:
https://blueimp.github.io/jQuery-File-Upload/
This was written to upload+write files to a variety of locations, including S3. The only tricky bits are getting your MIME type correct for each particular upload, and getting your bucket policy the way you need it.