aws s3 - s3cmd: "WARNING: MD5 signatures do not match:" - what do? - amazon-s3

When I use s3cmd to pull down files (of not unreasonable size - less than 100 megabytes) I occasionally see this error:
WARNING: MD5 signatures do not match: computed=BLAH, received="NOT-BLAH"
Googling suggests that this may be caused by the way S3 segments files. Others have said to ignore it.
Does anybody know why this happens and what the right thing to do is?
Thank you for your time,
-- Henry

Looking into this deeper, it seems as though s3cmd is reading the wrong md5 sum from Amazon. It looks as though s3cmd is getting its sum from the ETAG field. Comparing the actual data of the object that was PUT with the object that was GET'ed the contents are identical and this error can be safely ignored.

The ETag of a file in S3 will not match the MD5 if the file was uploaded as "Multipart". When a file is marked multipart AWS will hash each part, concatenate the results and then hash that value.
If the file does not actually have multiple parts the result will be a hash of a hash with -1 added to the end. Try disabling multipart in the tool you use to upload files to S3. For s3cmd, the option is --disable-multipart.

ETags with a '-' in them are expected, if the file was uploaded using the S3 Multipart Upload feature (typically used for files >15MB or files read from stdin). s3cmd 1.5.2 knows this and ignores such ETags. If your s3cmd is older than 1.5.2, please upgrade.

This is a bigger problem is you are using s3cmd sync, because it causes it to re-download previously-synced files. To solve this, add the --no-check-md5 option, which causes s3cmd to only check file sizes to determine changed files (this is good for my purposes, but probably not for everyone, depending on the application).

I saw reports about an hour ago that S3 is currently having exactly this problem, e.g. this tweet:
RT #drags: #ylastic S3 returning incorrect md5s to s3cmd as well. Never seen an md5 with a '-' in it, until AWS. #AWS #S3
Though the AWS Status Page reports no issue, I expect this is a transient problem. Try again soon :-)

Related

Silverstripe 4 large Files in Uploadfield

when uploading a large file with uploadfield i get the error.
"Server responded with an error.
Expected a value of type "Int" but received: 4008021167"
to set the allowed filesize i used $upload->getValidator()->setAllowedMaxFileSize(6291456000);
$upload is an UploadField.
every file larger than 2gb gets this error. smaller files are uploaded without any error.
where can i adjust that i can upload bigger files.
I remember that there has been a 2GB border in the past, but i don't know where to adjust it
thanks for your answers
klaus
The regular file upload limits don't seem to be the issue, if you are already at 2 GB. This might be the memory limit of the process itself. I would recommend looking into chunked uploads - this allows you processing larger files.
I know, this answer is late - but the problem is rooted in the graphQL type definition of the File type (it is set to Int). I've submitted a pull request to the upstream repository. Also here is the sed one-liner to patch it:
sed -i 's/size\: Int/size\: Float/g' vendor/silverstripe/asset-admin/_graphql/types/File.yml

Moving files >5 gig to AWS S3 using a Data Pipeline

We are experiencing problems with files produced by Java code which are written locally and then copied by the Data Pipeline to S3. The error mentions file size.
I would have thought that if multipart uploads is required, then the Pipeline would figure that out. I wonder if there is a way of configuring the Pipeline so that it indeed uses multipart uploading. Because otherwise the current Java code which is agnostic about S3 has to write directly to S3 or has to do what it used to and then use multipart uploading -- in fact, I would think the code would just directly write to S3 and not worry about uploading.
Can anyone tell me if Pipelines can use multipart uploading and if not, can you suggest whether the correct approach is to have the program write directly to S3 or to continue to write to local storage and then perhaps have a separate program be invoked within the same Pipeline which will do the multipart uploading?
The answer, based on AWS support, is that indeed 5 gig files can't be uploaded directly to S3. And there is no way currently for a Data Pipeline to say, "You are trying to upload a large file, so I will do something special to handle this." It simply fails.
This may change in the future.
Data Pipeline CopyActivity does not support files larger than 4GB. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
This is below the 5GB limit imposed by S3 for each file-part put.
You need to write your own script wrapping AWS CLI or S3cmd (older). This script may be executed as a shell activity.
Writing directly to S3 may be an issue as S3 does not support append operations - unless you can somehow write multiple smaller objects in a folder.

Differences in some filenames case after uploading to Amazon S3

I uploaded a lot of files (about 5,800) to Amazon S3, which seemed to work perfectly well, but a few of them (about 30) had their filenames converted to lowercase.
The first time, I uploaded with Cyberduck. When I saw this problem, I deleted them all and re-uploaded with Transmit. Same result.
I see absolutely no pattern that would link the files that got their names changed, it seems very random.
Has anyone had this happen to them?
Any idea what could be going on?
Thank you!
Daniel
I let you know first that Amazon S3 object URLs are case sensitive. So when you upload file file with upper case and access that file with same URL, it was working. But after renaming objects in lower case and I hope you are trying same older URL so you may get access denied/NoSuchKey error message.
Can you try Bucket Explorer to generate the file URL for Amazon S3 object and then try to access that file?
Disclosure: I work for Bucket Explorer.
When I upload to Amazon servers, I always use Filezilla and STFP. I never had such a problem. I'd guess (and honestly, this is just a guess since I haven't used Cyberduck nor Transmit) that the utilities you're using are doing the filename changing. Try it with Filezilla and see what the result is.

Amazon MapReduce input splitting and downloading

I'm new to EMR and just had a few questions i have been struggling with the past few days. The first of which is the logs that i want to process are already compressed as .gz and i was wondering if these types of files are able to be split by emr so that more then one mapper will work on a file. Also i have been reading that input files will not be split unless they are 5gb, my files are not that large so does that mean they will only be processed by one instance?
My other question might seem relatively dumb but is it possible to use emr+streaming and have an input someplace other then s3? It seems redundant to have to download the logs from the CDN, then upload them to my s3 bucket to run mapreduce on them. Right now i have them downloading onto my server then my server is uploading them to s3, is there a way to cut out the middle man and have it go straight to s3, or run the inputs off my server?
are already compressed as .gz and i was wondering if these types of files are able to be split by emr so that more then one mapper will work on a file
Alas, no, straight gzip files are not splittable. One option is to just roll your log files more frequently; this very simple solution works for some people though it's a bit clumsy.
Also i have been reading that input files will not be split unless they are 5gb,
This is definitely not the case. If a file is splittable you have lots of options on how you want to split it, eg configuring mapred.max.split.size. I found [1] to be a good description of the options available.
is it possible to use emr+streaming and have an input someplace other then s3?
Yes. Elastic MapReduce now supports VPC so you could connect directly to your CDN [2]
[1] http://www.scribd.com/doc/23046928/Hadoop-Performance-Tuning
[2] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/EnvironmentConfig_VPC.html?r=146

robocopy, jungledisk file copy problems

I'm a huge fan of robocopy and use it extensively to copy between various servers I need to update.
Lately I've been archiving to an Amazon S3 account that I access via a mapped drive using JungleDisk. I then robocopy my files from local PC to S3.
Sometimes I get a very strange 'Incorrect function' error message in robocopy and the file fails to copy. I've tried xcopy and straightforward copy and paste between file explorer windows. In each case I get some variation of the 'Incorrect function' or 'Illegal MS-DOS function' and the file will never copy.
I delete the target but to no avail.
Any ideas?
Don't know if you're allowed to answer your own questions but I think I've fixed it...
I found this in the jungledisk support forums
The quick solution is to zip the
files, delete the original, then unzip
the files because zip can't handle
extended attributes. Another solution
is to move them to a FAT filesystem,
then move again to NTFS filesystem,
because FAT don't manage extended
attributes.
In both cases the result is the
deletion of extended attributes, and
the files can be moved to the
jungledisk.
The files can have extended attributes
for different reasons, expecially
migrations from other filesystems: in
my case was the migration of a CVS
repository from a ext2 filesystem to
NTFS.
Seems to have worked for me...
I've had similar issues from both OSX and linux. At first I was not concerned by it but then it occurred to me these issues could result in potential data contamination or backup failure. So I have abandon JungleDisk for everything except my lightweight work.
Zipping/taring files was not an option for me because of the size of my data set. With this approach you have to upload your entire data set each and every time.
I'm not sure which attributes you refer to but could you robocopy with the /COPY:DT switch to strip off the attributes?