GCS Transfer Service Object MD5 and CRC32C hash mismatch - amazon-s3

I am using GCS Transfer Service to move objects from S3 into GCS, I then have a ruby script on GAE that downloads the new GCS object and operates on it. The script fails to download because the MD5 and CRC32C hash verification fails. The verification (part of the google-cloud-storage gem) works by comparing the object.md5 and object.crc32c hashes to the file's calculated hashes, but these are mismatched.
I downloaded the file from AWS and calculated the md5 and crc32c hashes and I got the same value that the file attributes on GCS has: object.md5 and object.crc32c, however when I download directly from GCS and calculate the hashes, I get different md5 and crc32c hashes.
To replicate this:
Calculate the hash of an AWS object
Transfer the object to GCS via the transfer service
Pull the attributed GCS object hashes using: gsutil ls -L gs://bucket/path/to/file
Calculate the hashes of the GCS object
The error that I originally got was:
/usr/local/bundle/gems/google-cloud-storage-0.23.2/lib/google/cloud/storage/file/verifier.rb:34:in `verify_md5!': The downloaded file failed MD5 verification. (Google::Cloud::Storage::FileVerificationError)
from /usr/local/bundle/gems/google-cloud-storage-0.23.2/lib/google/cloud/storage/file.rb:809:in `verify_file!'
from /usr/local/bundle/gems/google-cloud-storage-0.23.2/lib/google/cloud/storage/file.rb:407:in `download'
from sample.rb:9:in `'

Related

GPG Decrypt using AWS Transfer Family and Preserve Folder Structure

I am trying to decrypt a file uploaded via sFTP to an S3 bucket and preserve the folder structure of the s3 key.
I have a gpg-encrypted file being uploaded via sFTP to an S3 bucket. The customer uploads a file with a certain folder structure (which I am relying on for metadata), so they might upload a file that appears like this:
customer/folder1/file1.xlsx.gpg.
or another file that appears like this:
customer/folder2/file2.xlsx.gpg
I want to decrypt these files so that their s3 keys are
customer/folder1/file1.xlsx
and
customer/folder2/file2.xlsx
but I only see the option to use ${Transfer:User Name} when parameterizing the file location of the decrypt step, so I end up with
customer/file1.xlsx
and
customer/file2.xlsx
instead and lose the folder structure.
Is there a way to do this?
For anyone else finding limitations with AWS Transfer Family, the solution I have come up with is to store the gpg keys in a secret key, process the S3 trigger sent when .gpg file is placed in the bucket, read the gpg file from the S3 bucket as a stream, decrypt it using a python gpg client and the stored key (which is looked up based on the folder structure of the gpg file), then store the decrypted file in the S3 bucket, preserving the folder structure. A second S3 trigger will be sent upon creation of this file, and my lambda can then process that trigger and process the decrypted file normally.
I have discovered that with the python API for S3, you can store metadata with an object, but I don't believe this is doable if a file is being placed via sFTP. So I think I'm stuck relying on folder structure for metadata.

AWS S3 File Integrity and multipart practicality

I am trying to create a file integrity verification using AWS S3 but have been dumbfounded by the multipart gotcha of S3.
As per the documentation:
When using MD5, Amazon S3 calculates the checksum of the entire multipart object after the upload is complete. This checksum is not a checksum of the entire object, but rather a checksum of the checksums for each individual part.
My particular usecase is distribution of binaries, I produce a manifest of the uploaded binaries which contains the SHA256 of each file, and then sign the manifest with my PGP key.
There is no straightforward way for me to both, upload a binary with the appropriate checksum or check for the checksum of a binary while it's still on S3 without downloading it, using the getObjectAttributes API call.
My current understanding for an integrity implementation based on what is suggested by AWS's documentation is:
Chunk each binary to be uploaded locally, based on AWS S3's multipart specifications.
Produce the SHA256 for each of those chunks, BASE64 encoded.
Concatenate the chunks' SHA256 to produce the "Object's checksum"
Store all the chunk's SHA256 as well as the "Object's checksum" on my manifest.
So I can then:
Invoke the getObjectAttributes API call to get the "Object's checksum" from AWS and compare it with my manifest.
Use the manifest's stored chunk SHA256's to reliably verify locally by repeating the chunk-and-sha process described above, when I have the binary downloaded.
Is that what AWS really expects us to implement for integrity verification? Am I missing something glaringly obvious as to how one implements a file integrity check end-to-end given this very particular way S3 choses to checksum large files?
fwiw my stack is Node.js and I am using AWS SDK

Do you need to obtain a key and save it somewhere once Amazon uses SSE-S3 to encrypt a file?

I think I have a fatal misunderstanding of how SSE-S3 encryption works on an Amazon S3 bucket.
I encrypted some of my files and it says the encrypting was successful but I was never given any key to store.
How does SSE-S3 work? Once I enable it on a file, is the accessing of that file any different? It seems to be the same. I'm still able to access the file using its URL in my web browser. I guess the key is stored for me by the bucket and once I access my bucket, any file I want is automatically decrypted? I guess this is to deter people attempting to hack into a bucket and steal all its files?
This is what I'm seeing on a particular file.
Do you need to obtain a key and save it somewhere once Amazon uses SSE-S3 to encrypt a file?
No, the encryption key is fully managed by Amazon S3. The whole encryption and decryption process are taken care of by S3 and you don't need to do anything else besides flipping the switch.
I encrypted some of my files and it says the encrypting was successful but I was never given any key to store.
Because the key storage is also managed by S3.
How does SSE-S3 work?
You upload a file to S3
S3 generates a plain data key 🔑 and encrypts it with the S3 master key, so now there are two blobs which correspond to 🔑 and E(🔑)
S3 encrypts your file using the plain data key 🔑
S3 stores your encrypted file and E(🔑) side by side
S3 servers wipe out the plain data key 🔑 from the memory
Once I enable it on a file, is the accessing of that file any different?
No, S3 does all the hard encryption and decryption work for you. You just access the file as normal.
I guess the key is stored for me by the bucket and once I access my bucket, any file I want is automatically decrypted?
You are right. S3 stores the E(🔑) for you with your file side-by-side. When you access the file, the underlying data is automatically decrypted.
I guess this is to deter people attempting to hack into a bucket and steal all its files?
This prevents malicious people with physical access to the hard drives that holds your data from gaining access to the raw bytes of your file.

cloud storage: how to check md5 on object

I have this scenarios happening in my bucket, I have file called red.dat in my storage and this file will be updating regularly by jenkins once this file has been update I trigger event to deploy this red.dat file, I want check md5 hash of the file before and after update and if the value is different only do the deployment
this is how I upload the file to GCS
gsutil cp red.dat gs://example-bucket
and I have tried this command to get hash
gsutil hash -h gs://example-bucket/red.dat
and the result is this
Hashes [hex] for red.dat:
Hash (crc32c): d4c9895e
Hash (md5): 732b9e36d945f31a6f436a8d19f64671
but I'm little confused how I can implement to compare md5 before and after update since the file is always gonna be stay remote location(GCS). I would like some advice or show me right direction to achieve this, solution in commands or ansible is fine
You can use the gsutil hash command on the local file, and then compare the output with what you saw from gsutil hash against the cloud object:
gsutil hash red.dat

Can I trust aws-cli to re-upload my data without corrupting when the transfer fails?

I extensively use S3 to store encrypted and compressed backups of my workstations. I use the aws cli to sync them to S3. Sometimes, the transfer might fail when in progress. I usually just retry it and let it finish.
My question is: Does S3 has some kind of check to make sure that the previously failed transfer didn't leave corrupted files? Does anyone know if syncing again is enough to fix the previously failed transfer?
Thanks!
Individual files uploaded to S3 are never partially uploaded. Either the entire file is completed and S3 stores the file as an S3 object, or the upload is aborted and S3 object is never stored.
Even in the multi-part upload case, multiple parts can be uploaded but they never form a complete S3 object unless all of the pieces are uploaded and the "Complete Multipart Upload" operation is performed. So there is no need worry about corruption via partial uploads.
Syncing will certainly be enough to fix the previously failed transfer.
Yes, looks like AWS CLI does validate what it uploads and takes care of corruption scenarios by employing MD5 checksum.
From https://docs.aws.amazon.com/cli/latest/topic/s3-faq.html
The AWS CLI will perform checksum validation for uploading and downloading files in specific scenarios.
The AWS CLI will calculate and auto-populate the Content-MD5 header for both standard and multipart uploads. If the checksum that S3 calculates does not match the Content-MD5 provided, S3 will not store the object and instead will return an error message back the AWS CLI.