Corrupted files when downloading large zip file from S3 - amazon-s3

I have a zip file with the size 1 GB on S3 bucket. After downloading, I can't seem to unzip it. It always says
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
Later, I download it again, using s3cmd this time. It says
WARNING: MD5 signatures do not match: computed=384c9a702c2730a6b46d21606137265d, received="b42099447c7a1a390d8e7e06a988804b-18"
Is there any s3 limitation I need to know or this is a bug?

This question seems dead, but I'll ask it for anyone landing here:
Amazon S3's multipart uploads (those suitable for big files) produce ETag values which no longer matches file's MD5, so if you're using this as checksum (as it seems looking at your received MD5) it won't work.
Best you can do for validation is ensuring ContentMD5 header is added to every part's header on your multipart upload ensuring file does not get corrupted during upload, and adding your own MD5 metadata field for checking data after download.

Thanks #ergoithz for reminding me that I had this question :)The problem is already fixed, with AWS SDK for nodejs being the problem. Apparently it cannot upload large files using stream data fs.createReadStream(), so I switched to using Knox where it worked perfectly

Related

AWS S3 File Integrity and multipart practicality

I am trying to create a file integrity verification using AWS S3 but have been dumbfounded by the multipart gotcha of S3.
As per the documentation:
When using MD5, Amazon S3 calculates the checksum of the entire multipart object after the upload is complete. This checksum is not a checksum of the entire object, but rather a checksum of the checksums for each individual part.
My particular usecase is distribution of binaries, I produce a manifest of the uploaded binaries which contains the SHA256 of each file, and then sign the manifest with my PGP key.
There is no straightforward way for me to both, upload a binary with the appropriate checksum or check for the checksum of a binary while it's still on S3 without downloading it, using the getObjectAttributes API call.
My current understanding for an integrity implementation based on what is suggested by AWS's documentation is:
Chunk each binary to be uploaded locally, based on AWS S3's multipart specifications.
Produce the SHA256 for each of those chunks, BASE64 encoded.
Concatenate the chunks' SHA256 to produce the "Object's checksum"
Store all the chunk's SHA256 as well as the "Object's checksum" on my manifest.
So I can then:
Invoke the getObjectAttributes API call to get the "Object's checksum" from AWS and compare it with my manifest.
Use the manifest's stored chunk SHA256's to reliably verify locally by repeating the chunk-and-sha process described above, when I have the binary downloaded.
Is that what AWS really expects us to implement for integrity verification? Am I missing something glaringly obvious as to how one implements a file integrity check end-to-end given this very particular way S3 choses to checksum large files?
fwiw my stack is Node.js and I am using AWS SDK

upload compressed file on S3 -> download with CloudFront -> when extract, different directory depth

I uploaded a compressed file (extension: tar.gz) and often download it by using cloudfront url. But, whenever extracted after downloading, there are cases where the directory structure changes...
unlike the basic file structure, there are cases where a compressed file with one more depth is downloaded..
Does anyone know why this is happening? If you have, I hope you can help me. Thank you.

Can I trust aws-cli to re-upload my data without corrupting when the transfer fails?

I extensively use S3 to store encrypted and compressed backups of my workstations. I use the aws cli to sync them to S3. Sometimes, the transfer might fail when in progress. I usually just retry it and let it finish.
My question is: Does S3 has some kind of check to make sure that the previously failed transfer didn't leave corrupted files? Does anyone know if syncing again is enough to fix the previously failed transfer?
Thanks!
Individual files uploaded to S3 are never partially uploaded. Either the entire file is completed and S3 stores the file as an S3 object, or the upload is aborted and S3 object is never stored.
Even in the multi-part upload case, multiple parts can be uploaded but they never form a complete S3 object unless all of the pieces are uploaded and the "Complete Multipart Upload" operation is performed. So there is no need worry about corruption via partial uploads.
Syncing will certainly be enough to fix the previously failed transfer.
Yes, looks like AWS CLI does validate what it uploads and takes care of corruption scenarios by employing MD5 checksum.
From https://docs.aws.amazon.com/cli/latest/topic/s3-faq.html
The AWS CLI will perform checksum validation for uploading and downloading files in specific scenarios.
The AWS CLI will calculate and auto-populate the Content-MD5 header for both standard and multipart uploads. If the checksum that S3 calculates does not match the Content-MD5 provided, S3 will not store the object and instead will return an error message back the AWS CLI.

How to detect that a file is being uploaded over FTP

My application is keeping watch on a set of folders where users can upload files. When a file upload is finished I have to apply a treatment, but I don't know how to detect that a file has not finish to upload.
Any way to detect if a file is not released yet by the FTP server?
There's no generic solution to this problem.
Some FTP servers lock the file being uploaded, preventing you from accessing it, while the file is still being uploaded. For example IIS FTP server does that. Most other FTP servers do not. See my answer at Prevent file from being accessed as it's being uploaded.
There are some common workarounds to the problem (originally posted in SFTP file lock mechanism, but relevant for the FTP too):
You can have the client upload a "done" file once the upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for an example of implementing this approach.
Also, some FTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive.
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, that the file is incomplete.
Some FTP servers allow you to configure a hook to be called, when an upload is finished. You can make use of that. For example ProFTPD has a mod_exec module (see the ExecOnCommand directive).
I use ftputil to implement this work-around:
connect to ftp server
list all files of the directory
call stat() on each file
wait N seconds
For each file: call stat() again. If result is different, then skip this file, since it was modified during the last seconds.
If stat() result is not different, then download the file.
This whole ftp-fetching is old and obsolete technology. I hope that the customer will use a modern http API the next time :-)
If you are reading files of particular extensions, then use WINSCP for File Transfer. It will create a temporary file with extension .filepart and it will turn to the actual file extension once it fully transfer the file.
I hope, it will help someone.
This is a classic problem with FTP transfers. The only mostly reliable method I've found is to send a file, then send a second short "marker" file just to tell the recipient the transfer of the first is complete. You can use a file naming convention and just check for existence of the second file.
You might get fancy and make the content of the second file a checksum of the first file. Then you could verify the first file. (You don't have the problem with the second file because you just wait until file size = checksum size).
And of course this only works if you can get the sender to send a second file.

How to receive an uploaded file using node.js formidable library and save it to Amazon S3 using knox?

I would like to upload a form from a web page and directly save the file to S3 without first saving it to disk. This node.js app will be deployed to Heroku, where there is no local disk to save the file to.
The node-formidable library provides a great way to upload files and save them to disk. I am not sure how to turn off formidable (or connect-form) from saving file first. The Knox library on the other hand provides a way to read a file from the disk and save it on Amazon S3.
1) Is there a way to hook into formidable's events (on Data) to send the stream to Knox's events, so that I can directly save the uploaded file in my Amazon S3 bucket?
2) Are there any libraries or code snippets that can allow me to directly take the uploaded file and save it Amazon S3 using node.js?
There is a similar question here but the answers there do not address NOT saving the file to disk.
It looks like there is no good way to do it. One reason might be that the node-formidable library saves the uploaded file to disk. I could not find any options to do otherwise. The knox library takes the saved file on the disk and using your Amazon S3 credentials uploads it to Amazon.
Since on Heroku I cannot save files locally, I ended up using transloadit service. Though their authentication docs have some learning curve, I found the service useful.
For those who want to use transloadit using node.js, the following code sample may help (transloadit page had only Ruby and PHP examples)
var crypto, signature;
crypto = require('crypto');
signature = crypto.createHmac("sha1", 'auth secret').
update('some string').
digest("hex")
console.log(signature);
this is Andy, creator of AwsSum:
https://github.com/appsattic/node-awssum/
I just released v0.2.0 of this library. It uploads the files that were created by Express' bodyParser() though as you say, this won't work on Heroku:
https://github.com/appsattic/connect-stream-s3
However, I shall be looking at adding the ability to stream from formidable directly to S3 in the next (v0.3.0) version. For the moment though, take a look and see if it can help. :)