I am trying to create a file integrity verification using AWS S3 but have been dumbfounded by the multipart gotcha of S3.
As per the documentation:
When using MD5, Amazon S3 calculates the checksum of the entire multipart object after the upload is complete. This checksum is not a checksum of the entire object, but rather a checksum of the checksums for each individual part.
My particular usecase is distribution of binaries, I produce a manifest of the uploaded binaries which contains the SHA256 of each file, and then sign the manifest with my PGP key.
There is no straightforward way for me to both, upload a binary with the appropriate checksum or check for the checksum of a binary while it's still on S3 without downloading it, using the getObjectAttributes API call.
My current understanding for an integrity implementation based on what is suggested by AWS's documentation is:
Chunk each binary to be uploaded locally, based on AWS S3's multipart specifications.
Produce the SHA256 for each of those chunks, BASE64 encoded.
Concatenate the chunks' SHA256 to produce the "Object's checksum"
Store all the chunk's SHA256 as well as the "Object's checksum" on my manifest.
So I can then:
Invoke the getObjectAttributes API call to get the "Object's checksum" from AWS and compare it with my manifest.
Use the manifest's stored chunk SHA256's to reliably verify locally by repeating the chunk-and-sha process described above, when I have the binary downloaded.
Is that what AWS really expects us to implement for integrity verification? Am I missing something glaringly obvious as to how one implements a file integrity check end-to-end given this very particular way S3 choses to checksum large files?
fwiw my stack is Node.js and I am using AWS SDK
Related
I am trying to decrypt a file uploaded via sFTP to an S3 bucket and preserve the folder structure of the s3 key.
I have a gpg-encrypted file being uploaded via sFTP to an S3 bucket. The customer uploads a file with a certain folder structure (which I am relying on for metadata), so they might upload a file that appears like this:
customer/folder1/file1.xlsx.gpg.
or another file that appears like this:
customer/folder2/file2.xlsx.gpg
I want to decrypt these files so that their s3 keys are
customer/folder1/file1.xlsx
and
customer/folder2/file2.xlsx
but I only see the option to use ${Transfer:User Name} when parameterizing the file location of the decrypt step, so I end up with
customer/file1.xlsx
and
customer/file2.xlsx
instead and lose the folder structure.
Is there a way to do this?
For anyone else finding limitations with AWS Transfer Family, the solution I have come up with is to store the gpg keys in a secret key, process the S3 trigger sent when .gpg file is placed in the bucket, read the gpg file from the S3 bucket as a stream, decrypt it using a python gpg client and the stored key (which is looked up based on the folder structure of the gpg file), then store the decrypted file in the S3 bucket, preserving the folder structure. A second S3 trigger will be sent upon creation of this file, and my lambda can then process that trigger and process the decrypted file normally.
I have discovered that with the python API for S3, you can store metadata with an object, but I don't believe this is doable if a file is being placed via sFTP. So I think I'm stuck relying on folder structure for metadata.
I think I have a fatal misunderstanding of how SSE-S3 encryption works on an Amazon S3 bucket.
I encrypted some of my files and it says the encrypting was successful but I was never given any key to store.
How does SSE-S3 work? Once I enable it on a file, is the accessing of that file any different? It seems to be the same. I'm still able to access the file using its URL in my web browser. I guess the key is stored for me by the bucket and once I access my bucket, any file I want is automatically decrypted? I guess this is to deter people attempting to hack into a bucket and steal all its files?
This is what I'm seeing on a particular file.
Do you need to obtain a key and save it somewhere once Amazon uses SSE-S3 to encrypt a file?
No, the encryption key is fully managed by Amazon S3. The whole encryption and decryption process are taken care of by S3 and you don't need to do anything else besides flipping the switch.
I encrypted some of my files and it says the encrypting was successful but I was never given any key to store.
Because the key storage is also managed by S3.
How does SSE-S3 work?
You upload a file to S3
S3 generates a plain data key 🔑 and encrypts it with the S3 master key, so now there are two blobs which correspond to 🔑 and E(🔑)
S3 encrypts your file using the plain data key 🔑
S3 stores your encrypted file and E(🔑) side by side
S3 servers wipe out the plain data key 🔑 from the memory
Once I enable it on a file, is the accessing of that file any different?
No, S3 does all the hard encryption and decryption work for you. You just access the file as normal.
I guess the key is stored for me by the bucket and once I access my bucket, any file I want is automatically decrypted?
You are right. S3 stores the E(🔑) for you with your file side-by-side. When you access the file, the underlying data is automatically decrypted.
I guess this is to deter people attempting to hack into a bucket and steal all its files?
This prevents malicious people with physical access to the hard drives that holds your data from gaining access to the raw bytes of your file.
I extensively use S3 to store encrypted and compressed backups of my workstations. I use the aws cli to sync them to S3. Sometimes, the transfer might fail when in progress. I usually just retry it and let it finish.
My question is: Does S3 has some kind of check to make sure that the previously failed transfer didn't leave corrupted files? Does anyone know if syncing again is enough to fix the previously failed transfer?
Thanks!
Individual files uploaded to S3 are never partially uploaded. Either the entire file is completed and S3 stores the file as an S3 object, or the upload is aborted and S3 object is never stored.
Even in the multi-part upload case, multiple parts can be uploaded but they never form a complete S3 object unless all of the pieces are uploaded and the "Complete Multipart Upload" operation is performed. So there is no need worry about corruption via partial uploads.
Syncing will certainly be enough to fix the previously failed transfer.
Yes, looks like AWS CLI does validate what it uploads and takes care of corruption scenarios by employing MD5 checksum.
From https://docs.aws.amazon.com/cli/latest/topic/s3-faq.html
The AWS CLI will perform checksum validation for uploading and downloading files in specific scenarios.
The AWS CLI will calculate and auto-populate the Content-MD5 header for both standard and multipart uploads. If the checksum that S3 calculates does not match the Content-MD5 provided, S3 will not store the object and instead will return an error message back the AWS CLI.
I am using GCS Transfer Service to move objects from S3 into GCS, I then have a ruby script on GAE that downloads the new GCS object and operates on it. The script fails to download because the MD5 and CRC32C hash verification fails. The verification (part of the google-cloud-storage gem) works by comparing the object.md5 and object.crc32c hashes to the file's calculated hashes, but these are mismatched.
I downloaded the file from AWS and calculated the md5 and crc32c hashes and I got the same value that the file attributes on GCS has: object.md5 and object.crc32c, however when I download directly from GCS and calculate the hashes, I get different md5 and crc32c hashes.
To replicate this:
Calculate the hash of an AWS object
Transfer the object to GCS via the transfer service
Pull the attributed GCS object hashes using: gsutil ls -L gs://bucket/path/to/file
Calculate the hashes of the GCS object
The error that I originally got was:
/usr/local/bundle/gems/google-cloud-storage-0.23.2/lib/google/cloud/storage/file/verifier.rb:34:in `verify_md5!': The downloaded file failed MD5 verification. (Google::Cloud::Storage::FileVerificationError)
from /usr/local/bundle/gems/google-cloud-storage-0.23.2/lib/google/cloud/storage/file.rb:809:in `verify_file!'
from /usr/local/bundle/gems/google-cloud-storage-0.23.2/lib/google/cloud/storage/file.rb:407:in `download'
from sample.rb:9:in `'
I have a zip file with the size 1 GB on S3 bucket. After downloading, I can't seem to unzip it. It always says
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
Later, I download it again, using s3cmd this time. It says
WARNING: MD5 signatures do not match: computed=384c9a702c2730a6b46d21606137265d, received="b42099447c7a1a390d8e7e06a988804b-18"
Is there any s3 limitation I need to know or this is a bug?
This question seems dead, but I'll ask it for anyone landing here:
Amazon S3's multipart uploads (those suitable for big files) produce ETag values which no longer matches file's MD5, so if you're using this as checksum (as it seems looking at your received MD5) it won't work.
Best you can do for validation is ensuring ContentMD5 header is added to every part's header on your multipart upload ensuring file does not get corrupted during upload, and adding your own MD5 metadata field for checking data after download.
Thanks #ergoithz for reminding me that I had this question :)The problem is already fixed, with AWS SDK for nodejs being the problem. Apparently it cannot upload large files using stream data fs.createReadStream(), so I switched to using Knox where it worked perfectly