Replacing bytes of an uploaded file in Amazon S3

Replacing bytes of an uploaded file in Amazon S3 - amazon-s3

I understand that in order to upload a file to Amazon S3 using Multipart, the instructions are here:
http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html
How do I go about replacing the bytes (say, between the range 4-1523) of an uploaded file? Do I need to make use of Multipart Upload to achieve this? or do I fire a REST call with the range specified in the HTTP header?
Appreciate any advice.

Objects in S3 are immutable.
If it's a small object, you'll need to upload the entire object again.
If it's an object over 5MB in size, then there is a workaround that allows you to "patch" a file, using a modified approach to the multipart upload API.
Background:
As you know, a multipart upload allows you to upload a file in "parts," with minimum part size 5MB and maximum part count 10,000.
However a multipart "upload" doesn't mean you have to "upload" all the data again, if some or all of it already exists in S3, and you can address it.
PUT part/copy allows you to "upload" the individual parts by specifying octet ranges in an existing object. Or more than one object.
Since uploads are atomic, the "existing object" can be the object you're in the process of overwriting, since it remains unharmed and in place until you actually complete the multipart upload.
But there appears to be nothing stopping you from using the copy capability to provide the data for the parts you want to leave the same, avoiding the actual upload then using a normal PUT part request to upload the parts that you want to have different content.
So, while not a byte-range patch with granularity to the level of 1 octet, this could be useful for emulating an in-place modification of a large file. Examples of valid "parts" would be replacing a minimum 5 MB chunk, on a 5MB boundary, for files smaller than 50GB, or replacing a mimimum 500MB chunk on 500MB boundary for objects up to 5TB, with minimum part sizes varying between those to extremes, because of the requirement that a multipart upload have no more than 10,000 parts. The catch is that a part must start at an appropriate offset, and you need to replace the whole part.

Michael's answer is pretty explanatory on the background of the issue. Just adding the actual steps to be performed to achieve this, in case you're wondering.
List object parts using ListParts
Identify the part that has been modified
Start a multipart upload
Copy the unchanged parts using UploadPartCopy
Upload the modified part
Finish the upload to save the modification
Skip 2 if you already know which part has to be changed.
Tip: Each part has an ETag, which is MD5 hash of the specified part. This can be used to verify is that particular part has been changed.

Related

Lambda not invoking if the uploaded files are large in size in s3 bucket?

I have created a lambda which would invoke and do the transformation based on the event in the target source bucket.
This is working fine when I upload the small size of file in the targeted source bucket.
But when I upload large file(eg: 65 mb file), it looks lambda not invoking based on that event..
Appreciate if anyone can help on this kind of issue?
Thanks

I am guessing, big files would be uploaded on S3 via S3 Multipart Upload instead of a regular put-object operation.
Maybe your Lambda function is just subscribed to s3:ObjectCreated:Put events. You need to add s3:ObjectCreated:CompleteMultipartUpload permission to Lambda as well.

The large files in S3 are uploaded via S3 Multipart Upload instead of a regular PUT or single part upload process.
There can be two problems
``In your lambda you probably have created the subscription for s3:ObjectCreated:Put events. You should add s3:ObjectCreated:CompleteMultipartUpload too in the Lambda subscription list.
Your lambda timeout could be small for and that works for the smaller files. You might want to increase that.

There could be any of these issues:
Your event only captures s3:ObjectCreated:Put event, as others have mentioned. Usually if it's a big file, the event is s3:ObjectCreated:CompleteMultipartUpload instead. You could either (a) add s3:ObjectCreated:CompleteMultipartUpload event to your capture, or (b) simply use s3:ObjectCreated:* event - this will include Put, MultiPart Upload, Post, Copy, and also other similar events to be added in the future (source: https://aws.amazon.com/blogs/aws/s3-event-notification/)
Your Lambda function might run longer limit you set (limit is 15min).
Your Lambda function requires more memory than the limit you set.
Your Lambda function requires more disk space than the limit you set. This may be an issue if your function downloads the data on disk first and perform transformation there (limit is 512MB).

Why isn't List Parts to be used with Complete Multipart Upload?

The multipart upload overview documentation has, in the Multipart Upload Listings section, the following warning:
Note
Only use the returned listing for verification. You should not use the result of this listing when sending a complete multipart upload request. Instead, maintain your own list of the part numbers you specified when uploading parts and the corresponding ETag values that Amazon S3 returns.
Why?
Why I ask: Let's say I want to support resuming an upload that is interrupted. Doing so means knowing what remains to be uploaded, and therefore what already was uploaded. Knowing this is simpler if I may disregard the above warning. S3 is persisting the list of already-uploaded parts. I can obtain it from List Parts.
Whereas if I heed that warning, instead I'd need to intercept break or kill signals and persist the uploaded parts list locally. Although that's feasible, it seems silly to do this if S3 already has the list.
Furthermore, the warning says to use List Parts "only for verification". OK. Let's say I persist my own list, and compare it to List Parts. If they do not match, what am I going to do? I'm going to believe List Parts -- if S3 doesn't think it has a part, of course I'm going to upload it again. Therefore if List Parts is the ultimate authority, why not simply use it in the first place, and use it alone?

If they do not match, what am I going to do? I'm going to believe List Parts -- if S3 doesn't think it has a part, of course I'm going to upload it again.
You're missing the point of the warning.
It's not so much about whether parts were received. It's about whether they were received intact.
When you complete a multipart upload, you have to send a list of the parts and their etags. The etags are the hex md5sum of each part.
The lazy and careless way to complete a multipart upload would be to blindly submit the etags of the parts by just reading them from the "list" operation.
That is what they are warning against.
The correct way is to use your locally-created list, based on what you think S3 should have received, what you think the etag of each part should have been, based on the local file.
If you are resuming an upload that was interrupted, you should go back and compare the parts already uploaded (by re-reading and re-checksumming the parts of the local file) against the checksums S3 has calculated against the parts already stored (as returned by the list operation)... then either resend any incorrect parts or missing parts, or abandon the upload because the local file may have changed if one or more parts doesn't match your local calculation.
Additionally, in the interest of data integrity, you should be sending the md5 of each part with the individual part uploads, base64-encoded, with a Content-MD5 header, since this will cause S3 to refuse to accept a part that has been corrupted in any way during the upload.

md5 checksums when uploading to file picker

Background
I'm working on integrating an existing app with File Picker. In our existing setup we are relying on md5 checksums to ensure data integrity. As far as I can see File Picker does not provide any md5 when they respond to an upload against the REST API (nor using JavaScript client).
S3 storage, md5 and data integrity
We are using S3 for storage, and as far as I know you may provide S3 with an md5 checksum when storing files so that Amazon may verify and reject storing request if data seems to be wrong.
To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.
I have investigated the etag header which Amazon returns a bit, and found that it isn't clear what actually is returned as etag. The Java documentation states:
Gets the hex encoded 128-bit MD5 hash of this object's contents as computed by Amazon S3.
The Ruby documentation states:
Generally the ETAG is the MD5 of the object. If the object was uploaded using multipart upload then this is the MD5 all of the upload-part-md5s
Another place in their documentation I found this:
The entity tag is a hash of the object. The ETag only reflects changes to the contents of an object, not its metadata. The ETag is determined when an object is created. For objects created by the PUT Object operation and the POST Object operation, the ETag is a quoted, 32-digit hexadecimal string representing the MD5 digest of the object data. For other objects, the ETag may or may not be an MD5 digest of the object data. If the ETag is not an MD5 digest of the object data, it will contain one or more non-hexadecimal characters and/or will consist of less than 32 or more than 32 hexadecimal digits.
This seems to describe how etag is actually calculated on S3, and this stack overflow post seems to imply the same thing: Etag cannot be trusted to always be equal to the file MD5.
So - here are my questions
In general, how does file picker store files to s3? Is multipart post requests used?
I see that when I do a HEAD request against for example https://www.filepicker.io/api/file/<file handle> I do get an etag header back. The etag I get back do indeed match the md5 of the file I have uploaded. Are the headers returned more or less taken from S3 directly? Or is this actually an md5 calculated by filepicker which I can trust?
Is it possible to have an explicit statement of the md5 returned to clients of File Picker's API? For instance when we POST a file we get a JSON structure back including the URL to the file and it's size. Could md5 be included here?
Is it possible to provide File Picker with an md5 which in turn will be used when posting files to S3 so we can get an end-to-end check on files?

Yes, we use the python boto library to be specific.
The ETag is pulled from S3.
& 4. It's been considered and is in our backlog, but hasn't been implemented yet.

Checking filesize before download of multiple files

Im currently implementing an update feature into an app that I'm building. It uses NSUrlConnection and the NSURLConnectionDelegate to download the files and save them to the users device.
At the moment, 1 update item downloads multiple files but i want to display this download of multiple files using 1 UIProgressView. So my problem is, how do i get the expected content length of all the files i'm about to download? I know i can get the expectedContentLength of the NSURLResponse object that gets passed into the didReceiveResponse method but thats just for the file thats being downloaded.
Any help is much appreciated. Thanks.

How about having some kind of information file on your server, which actually gives you the total bytes. You could load that at first and then load your files. Then you can substract the loaded amount for each file from the total amount.
Another method would be to connect to all files at first, and cancel the connection after you received responses. Add the expected bytes of all files and then use that as a basis for showing the total progress while loading files.
Downside of #1: you have to manually keep track of the bytes.
Downside of #2: you'll have the double amount of requests, even though they get cancelled after the response.

Use ASIhttp opensource framework widely used for this purpose,
here u just need to set progressview delegate..so it will keep updating your progress view
Try this
http://allseeing-i.com/ASIHTTPRequest/

I need Multi-Part DOWNLOADS from Amazon S3 for huge files

I know Amazon S3 added the multi-part upload for huge files. That's great. What I also need is a similar functionality on the client side for customers who get part way through downloading a gigabyte plus file and have errors.
I realize browsers have some level of retry and resume built in, but when you're talking about huge files I'd like to be able to pick up where they left off regardless of the type of error out.
Any ideas?
Thanks,
Brian

S3 supports the standard HTTP "Range" header if you want to build your own solution.
S3 Getting Objects

I use aria2c. For private content, you can use "GetPreSignedUrlRequest" to generate temporary private URLs that you can pass to aria2c

S3 has a feature called byte range fetches. It’s kind of the download compliment to multipart upload:
Using the Range HTTP header in a GET Object request, you can fetch a byte-range from an object, transferring only the specified portion. You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request. Fetching smaller ranges of a large object also allows your application to improve retry times when requests are interrupted. For more information, see Getting Objects.
Typical sizes for byte-range requests are 8 MB or 16 MB. If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance. GET requests can directly address individual parts; for example, GET ?partNumber=N.
Source: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html

Just updating for current situation, S3 natively supports multipart GET as well as PUT. https://youtu.be/uXHw0Xae2ww?t=1459.

NOTE: For Ruby user only
Try aws-sdk gem from Ruby, and download
object = AWS::S3::Object.new(...)
object.download_file('path/to/file.rb')
Because it download a large file with multipart by default.
Files larger than 5MB are downloaded using multipart method
http://docs.aws.amazon.com/sdkforruby/api/Aws/S3/Object.html#download_file-instance_method

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas