How to manage user profile pic updates on AWS S3? - amazon-s3

I am using AWS S3 for saving user profile pics on a mobile app.
How do I guarantee that request for those pics will not result in a corrupted file if it gets requested while a user updates his image?
Please note that although those files are small it could happen that the connection on the mobile app drops, resulting in a stalled upload (maybe even for hours).
My first idea was to upload the new file under a temporary name and upon completion delete the original file and rename the uploaded file.
I couldn't find any commands for that in the iOS SDK though.
Another approach would be to just increment a number with the filename and always point to the new file in the database upon completion. But this results in a big headache and unneeded complexity for cleanup since I am using a denormalized nosql database.
any ideas?

You're worrying about a non-problem.
S3 uploads are atomic. When you overwrite an object on S3, there is zero chance of corrupting a download of the previous object. The object isn't technically "overwritten" -- it is replaced -- a fine distinction, but with a difference -- nothing at all happens to the old object until the replacement upload has finished successfully.
(In fact, it's possible though unlikely that the previous object will still be returned for a short time after the new upload has completed, because of S3's eventual consistency model on overwrites).
Additionally, if you send the Content-MD5 header with an S3 upload, then a failure in the upload process (stall, lost connection, corruption, etc.) will absolutely not allow the replacement object to be stored at all -- S3 will abort the operation and the prior version will remain intact unless the uploaded object can be validated against the Content-MD5 specified. (The SDK should be doing this for you.)
Note that this holds true whether or not object versioning is enabled on the bucket.

Related

S3 api operation failure , garbage handler

I have build on top of AWS S3 sdk an operation which uses the copy operation of the amazon sdk.
I'm using the multi part copying as my object is larger than the maximum available (5GB)
enter link description here
My question is: what happen if all parts of the "multi part copy" are successfully done, but the last part?
Should i handle a situation of deleting the parts that have been copied?
Generally i'm expecting the copy operation to put the object in a tmp folder and only if the operation has been successful to mv it to the final name (the dest s3 bucket name). is it working like that?
If a part doesn't transfer successfully, you can send it again.
Until the parts are all copied and the multipart upload (including those created using put-part+copy) is completed, you don't have an accessible object... but you are still being charged for storage of what you have successfully uploaded/copied, unless you clean up manually or configure the bucket to automatically purge incomplete multipart objects.
Best practice is to do both -- configure the bucket to discard, but also configure your code to clean up after itself.
It looks like AWS sdk isn't writing/closing the object as an s3 object until it won't finish copying successfully the entire obj.
i have run a simple test which verifying rather it is writing the parts during the copy part code line, and it looks it won't write the obj to s3.
so the answer is that multi part won't write the obj until all part are copied successfully to the dest bucket.
there is no need for cleanup

Amazon S3: How to safely upload multiple files?

I have two client programs which are using S3 to communicate some information. That information is a list of files.
Let's call the clients the "uploader" and "downloader":
The uploader does something like this:
upload file A
upload file B
upload file C
upload a SUCCESS marker file
The downloader does something lie this:
check for SUCCESS marker
if found, download A, B, C.
else, get data from somewhere else
and both of these programs are being run periodically. The uploader will populate a new directory when it is done, and the downloader will try to get the latest versions of A,B,C available.
Hopefully the intent is clear — I don't want the downloader to see a partial view, but rather get all of A,B,C or skip that directory.
However, I don't think that works, as written. Thanks to eventual consistency, the uploader's PUTs could be reordered into:
upload file B
upload a SUCCESS marker file
upload file A
...
And at this moment, the downloader might run, see the SUCCESS marker, and assume the directory is populated (which it is not).
So what's the right approach, here?
One idea is for the uploader to first upload A,B,C, then repeatedly check that the files are stored, and only after it sees all of them, then finally write the SUCCESS marker.
Would that work?
Stumbled upon similar issue in my project.
If the intention is to guarantee cross-file consistency (between files A,B,C) the only possible solution (purely within s3) is:
1) to put them as NEW objects
2) do not explicitly check for existence using HEAD or GET request prior to the put.
These two constraints above are required for fully consistent read-after-write behavior (https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3-introduces-new-usability-enhancements/)
Each time you update the files, you need to generate a unique prefix (folder) name and put this name into your marker file (the manifest) which you are going to UPDATE.
The manifest will have a stable name but will be eventually consistent. Some clients may get the old version and some may get the new one.
The old manifest will point to the old “folder” and the new one will point the new “folder”. Thus each client will read only old files or only new files but never mixed, so cross file consistency will be achieved. Still different clients may end up having different versions. If the clients keep pulling the manifest and getting updated on change, they will eventually become consistent too.
Possible solution for client inconsistency is to move manifest meta data out of s3 into a consistent database (such as dynamo db)
A few obvious caveats with pure s3 approach:
1) requires full set of files to be uploaded each time (incremental updates are not possible)
2) needs eventual cleanup of old obsolete folders
3) clients need to keep pulling manifest to get updated
4) clients may be inconsistent between each other
It is possible to do this single copies in S3. Each file (A B C) will have prepended to it a unique hash or version code [e.g. md5sum generated from the concatenation of all three files.]
In addition the hash value will be uploaded to the bucket as well into a separate object.
When consuming the files, first read the hash file and compare to the last hash successfully consumed. If changed, then read the files and check the hash value within each. If they all match, the data is valid and may be used. If not, the downloaded files should be disgarded and downloaded again (after a suitable delay)..
This will catch the occassional race condition between write and read across multiple objects.
This works because the hash is repeated in all objects. The hash file is actually optional, serving as a low-cost and fast short cut for determining if the data is updated.

S3 Eventual Consistency: file parts are lost on PUT with overwrite

I am using Amazon S3 to store a big amount of text files.
My software is in Java, and I am using the official S3 SDK.
Apart from create/delete/retrieve/, i often need to append new content to files.
S3 does not support append, so I have implemented an append operation that basically:
- with an S3 GET, obtains the file metadata from S3
- with an S3 GET, downloads the whole file into a local copy
- performs the append to the the local copy
- with an S3 PUT, uploads the local file on S3 overwriting the old one.
Appends are never performed concurrently.
I have tested the software, and so far it seems to work well.
And here’s my issue: in scenarios where appends are very very frequent, when I perform append big parts of my files are lost.
Might this depend on S3 eventual consistency on overwrite PUTs?
Thanks for your help!
Yes, it could. Eventual consistency means that the next GET of an object may or may not return the results of the last PUT when an object has been overwritten.
Enable bucket versioning and you should easily be able to identify what happens in these events by capturing and logging the object's version-id each time you upload or download it.
If the version you last uploaded isn't the one you subsequently download, that's a sign of eventual consistency causing the issue.
On the other hand, if you actively manage your download by specifically requesting the latest version using its last known version ID (which you'd need to capture when you PUT the object, and store somewhere that offers strongly-consistent reads, like DynamoDB or RDS) then you can always request the latest version explicitly when you download it.
Explicit requests for a specific version of an object solve the problem because they have no consistency limitations -- a given, specified version of an object either exists or doesn't. The consistency issue is related to implicitly fetching the "latest" version of an object. If the specific index replica that happens to serve your request hasn't yet learned of the latest version, it will serve up a prior version.
This holds true whether versioning is enabled, or not, because an overwrite of an object is not truly an overwrite, even in an unversioned bucket. It's a store + update index to new internal storage location + purge old storage location operation. This isn't documented but atomic overwrites and the consistency model dictate that it must necessarily be the case.

How do services like Dropbox implement delta encoding if their files are stored in the cloud?

Dropbox claims that during syncing only the portion of files that changes are transmitted back to main server, which is obviously a great functionality, but how do they perform changes to files stored in Amazon S3 cloud? So for example, lets say a 30 page document on user's desktop contains changes to only page 4. Dropbox now syncs the blocks representing the changes and what happens on the backend if they files that they store are in the cloud? Does that mean they have to download the 30 page document stored in S3 to their server, then perform replacement of blocks representing page 4, and then uploading back to the cloud? I doubt this would be the case because that would be somewhat inefficient. The other option I could think of is if Amazon S3 provides update of file stored in the cloud based on byte ranges, so for example, make a PUT request to file X from bytes 100-200 which will replace all the bytes from 100 to 200 with value of PUT request. So I was curious how companies that use other cloud services such as Amazon, implement this type of syncing.
Thanks
As S3 and similar storages don't offer filesystem capabilities, anything that pretends to store files and directories needs to emulate a file system. And when doing this files are often split to pages of certain size, where each page is stored in a separate file in the storage. This way the changed block requires uploading only one page (for example) and not the whole file. I should note, that with files like office documents this approach can be faulty if file size is changed - for example, if you insert a page at the beginning or delete a page, then the whole file will be changed and the complete file would need to be re-uploaded. We didn't analyze how Dropbox in particular does his job, and I just described the common scenario. There exist also different "patch algorithms", where a patch can be created locally (if Dropbox has an older local copy in the cache) and then applied to one or more blocks on the server.
There are several synchronizing tools which transfer deltas over the wire like rsync, rdiff, rdiff-backup, etc. For bi-directional synchronising with S3 there are paid services like s3rsync for example. For pure client-side synchronising, tools like zsync can be considered (which is what many people employ to roll-out app updates).
An alternative approach would be to tar-ball a directory, generate a delta file (using rdiff or xdelta3), and upload the delta file by using a timestamp as part of the key. In order to sync, all you need to do is to perform these 2 checks client-side:
You have all the delta files from S3. If not pull them and apply them to generate the latest backup state.
Your last backup state corresponds to your current directory. If not generate a new delta file and push to S3.
The concerning factor here would be the at least 100% additional space utilization, client-side. But this approach will help you revert changes if needed.

Backup application with single instance functionality

Currently working on an application, that help you to take backup of the files in you machine at the server (hosted by the company itself), so that you can recover data after any hdd crash. I have implemented Single Instance feature, across the users.
Single Instance : A file uploaded already at the server, wouldn't be uploaded again. Whenever any other instance of the exact file uploaded will not be actual upload but some database changes and linked to the same previously uploaded file.
Issue arise when same file (that has not already been uploaded before) is uploaded simultaneously by more than one users, On Start file wouldn't be detected for an instance (as database is updated only after successful upload/backup). All are running, at once. What will be the best way to implement single instance in this way.
I am thinking when I let all the instance upload as it is. So more than one instance of the file will reside at the server. But whenever another backup of the same file will be taken afterwards, I will remove all the previous instances and link them up with the one. This will not let user double uploads and also less complex on the cost of some disc space that too for a while probably (till next upload of the same file will be done)
Thanks for your thoughts in advance.
Calculate the hash (signature) of the file before upload and store it in the DB.
Then - start uploading.
if a similar file will be mark for uploading during the upload of the first file (you will know b/c you already saved the hash) - you will hold the 2nd file upload, until the first one finish successfully, and then link, if the 1st on fails, you can go to the 2nd source and upload it.