Comparing uncompressed local files to compressed files stored on Amazon S3? - amazon-s3

We put hundreds of image files on Amazon S3 that our users need to synchronize to their local directories. In order to save storage space and bandwidth, we zip the files stored on S3.
On the user's end they have a python script that runs every 5 min to get a current list of files, and download new/updated files.
My question is what's the best way determine what is new or changed to download?
Currently we add an additional header that we put with the compressed file which contains the MD5 value of the uncompressed file...
We start with a file like this:
image_file_1.tif 17MB MD5 = xxxx1234
We compress it (with 7zip) and put it to S3 (with Python/Boto):
image_file_1.tif.z 9MB MD5 = yyy3456 x-amz-meta-uncompressedmd5 = xxxx1234
The problems is we can't get a large list of files from S3 that include the x-amz-meta-uncompressedmd5 header without an additional API for EACH one (SLOW for hundreds/thousands of files).
Our most practical solution is have users get a full list of files (without the extra headers), download the files that do not exist locally. If it does exist locally, then do and additional API call to get the full headers to compare local MD5 checksum against x-amz-meta-uncompressedmd5.
I'm thinking there must be a better way.

You could include the MD5 hash of the uncompressed image into the compressed filename.
So image_file_1.tif could become image_file_1.xxxx1234.tif.z
Your user python file which does the synchronising would therefore have the information needed to determine if it needed to go get the file again from S3, and could either strip out the MD5 part of the filename, or maintain it, depending on what you wanted to do.
Or, you could also maintain, on S3, a single file containing the full file list including the MD5 metadata. So the python script just need to fetch that single file, parse that, and then decide what to do.

Related

Amazon S3: How to safely upload multiple files?

I have two client programs which are using S3 to communicate some information. That information is a list of files.
Let's call the clients the "uploader" and "downloader":
The uploader does something like this:
upload file A
upload file B
upload file C
upload a SUCCESS marker file
The downloader does something lie this:
check for SUCCESS marker
if found, download A, B, C.
else, get data from somewhere else
and both of these programs are being run periodically. The uploader will populate a new directory when it is done, and the downloader will try to get the latest versions of A,B,C available.
Hopefully the intent is clear — I don't want the downloader to see a partial view, but rather get all of A,B,C or skip that directory.
However, I don't think that works, as written. Thanks to eventual consistency, the uploader's PUTs could be reordered into:
upload file B
upload a SUCCESS marker file
upload file A
...
And at this moment, the downloader might run, see the SUCCESS marker, and assume the directory is populated (which it is not).
So what's the right approach, here?
One idea is for the uploader to first upload A,B,C, then repeatedly check that the files are stored, and only after it sees all of them, then finally write the SUCCESS marker.
Would that work?
Stumbled upon similar issue in my project.
If the intention is to guarantee cross-file consistency (between files A,B,C) the only possible solution (purely within s3) is:
1) to put them as NEW objects
2) do not explicitly check for existence using HEAD or GET request prior to the put.
These two constraints above are required for fully consistent read-after-write behavior (https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3-introduces-new-usability-enhancements/)
Each time you update the files, you need to generate a unique prefix (folder) name and put this name into your marker file (the manifest) which you are going to UPDATE.
The manifest will have a stable name but will be eventually consistent. Some clients may get the old version and some may get the new one.
The old manifest will point to the old “folder” and the new one will point the new “folder”. Thus each client will read only old files or only new files but never mixed, so cross file consistency will be achieved. Still different clients may end up having different versions. If the clients keep pulling the manifest and getting updated on change, they will eventually become consistent too.
Possible solution for client inconsistency is to move manifest meta data out of s3 into a consistent database (such as dynamo db)
A few obvious caveats with pure s3 approach:
1) requires full set of files to be uploaded each time (incremental updates are not possible)
2) needs eventual cleanup of old obsolete folders
3) clients need to keep pulling manifest to get updated
4) clients may be inconsistent between each other
It is possible to do this single copies in S3. Each file (A B C) will have prepended to it a unique hash or version code [e.g. md5sum generated from the concatenation of all three files.]
In addition the hash value will be uploaded to the bucket as well into a separate object.
When consuming the files, first read the hash file and compare to the last hash successfully consumed. If changed, then read the files and check the hash value within each. If they all match, the data is valid and may be used. If not, the downloaded files should be disgarded and downloaded again (after a suitable delay)..
This will catch the occassional race condition between write and read across multiple objects.
This works because the hash is repeated in all objects. The hash file is actually optional, serving as a low-cost and fast short cut for determining if the data is updated.

Generate A Large File Inside s3 with .NET

I would to generate a big file (several TB) with special format using my C# logic and persist it to S3. What is the best way to do this. I can launch a node in EC2 and then write the big file into EBS and then upload the file from the EBS into S3 using the S3 .net Clinent library.
Can I stream the file content as I am generating in my code and directly stream it to S3 until the generation is done specially for such large file and out of memory issues. I can see this code help with stream but it sounds like the stream should have already filled up with. I obviously can not put such a mount of data to memory and also do not want to save it as a file to the disk first.
PutObjectRequest request = new PutObjectRequest();
request.WithBucketName(BUCKET_NAME);
request.WithKey(S3_KEY);
request.WithInputStream(ms);
s3Client.PutObject(request);
What is my best bet to generate this big file ans stream it to S3 as I am generating it?
You certainly could upload any file up to 5 TB that's the limit. I recommend using the streaming and multipart put operations. Uploading a file 1TB could easily fail in the process and you'd have to do it all over, break it up into parts when you're storing it. Also you should be aware that if you need to modify the file you would need to download the file, modify the file and re-upload. If you plan on modifying the file at all i recommend trying to split it up into smaller files.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html

Ways to achieve de-duplicated file storage within Amazon S3?

I am wondering the best way to achieve de-duplicated (single instance storage) file storage within Amazon S3. For example, if I have 3 identical files, I would like to only store the file once. Is there a library, api, or program out there to help implement this? Is this functionality present in S3 natively? Perhaps something that checks the file hash, etc.
I'm wondering what approaches people have use to accomplish this.
You could probably roll your own solution to do this. Something along the lines of:
To upload a file:
Hash the file first, using SHA-1 or stronger.
Use the hash to name the file. Do not use the actual file name.
Create a virtual file system of sorts to save the directory structure - each file can simply be a text file that contains the calculated hash. This 'file system' should be placed separately from the data blob storage to prevent name conflicts - like in a separate bucket.
To upload subsequent files:
Calculate the hash, and only upload the data blob file if it doesn't already exist.
Save the directory entry with the hash as the content, like for all files.
To read a file:
Open the file from the virtual file system to discover the hash, and then get the actual file using that information.
You could also make this technique more efficient by uploading files in fixed-size blocks - and de-duplicating, as above, at the block level rather than the full-file level. Each file in the virtual file system would then contain one or more hashes, representing the block chain for that file. That would also have the advantage that uploading a large file which is only slightly different from another previously uploaded file would involve a lot less storage and data transfer.

How do services like Dropbox implement delta encoding if their files are stored in the cloud?

Dropbox claims that during syncing only the portion of files that changes are transmitted back to main server, which is obviously a great functionality, but how do they perform changes to files stored in Amazon S3 cloud? So for example, lets say a 30 page document on user's desktop contains changes to only page 4. Dropbox now syncs the blocks representing the changes and what happens on the backend if they files that they store are in the cloud? Does that mean they have to download the 30 page document stored in S3 to their server, then perform replacement of blocks representing page 4, and then uploading back to the cloud? I doubt this would be the case because that would be somewhat inefficient. The other option I could think of is if Amazon S3 provides update of file stored in the cloud based on byte ranges, so for example, make a PUT request to file X from bytes 100-200 which will replace all the bytes from 100 to 200 with value of PUT request. So I was curious how companies that use other cloud services such as Amazon, implement this type of syncing.
Thanks
As S3 and similar storages don't offer filesystem capabilities, anything that pretends to store files and directories needs to emulate a file system. And when doing this files are often split to pages of certain size, where each page is stored in a separate file in the storage. This way the changed block requires uploading only one page (for example) and not the whole file. I should note, that with files like office documents this approach can be faulty if file size is changed - for example, if you insert a page at the beginning or delete a page, then the whole file will be changed and the complete file would need to be re-uploaded. We didn't analyze how Dropbox in particular does his job, and I just described the common scenario. There exist also different "patch algorithms", where a patch can be created locally (if Dropbox has an older local copy in the cache) and then applied to one or more blocks on the server.
There are several synchronizing tools which transfer deltas over the wire like rsync, rdiff, rdiff-backup, etc. For bi-directional synchronising with S3 there are paid services like s3rsync for example. For pure client-side synchronising, tools like zsync can be considered (which is what many people employ to roll-out app updates).
An alternative approach would be to tar-ball a directory, generate a delta file (using rdiff or xdelta3), and upload the delta file by using a timestamp as part of the key. In order to sync, all you need to do is to perform these 2 checks client-side:
You have all the delta files from S3. If not pull them and apply them to generate the latest backup state.
Your last backup state corresponds to your current directory. If not generate a new delta file and push to S3.
The concerning factor here would be the at least 100% additional space utilization, client-side. But this approach will help you revert changes if needed.

Storing uploaded content on a website

For the past 5 years, my typical solution for storing uploaded files (images, videos, documents, etc) was to throw everything into an "upload" folder and give it a unique name.
I'm looking to refine my methods for storing uploaded content and I'm just wondering what other methods are used / preferred.
I've considered storing each item in their own folder (folder name is the Id in the db) so I can preserve the uploaded file name. I've also considered uploading all media to a locked folder, then using a file handler, which you pass the Id of the file you want to download in the querystring, it would then read the file and send the bytes to the user. This is handy for checking access, and restricting bandwidth for users.
I think the file handler method is a good way to handle files, as long as you know to how make good use of resources on your platform of choice. It is possible to do stupid things like read a 1GB file into memory if you don't know what you are doing.
In terms of storing the files on disk it is a question of how many, what are the access patterns, and what OS/platform you are using. For some people it can even be advantageous to store files in a database.
Creating a separate directory per upload seems like overkill unless you are doing some type of versioning. My personal preference is to rename files that are uploaded and store the original name. When a user downloads I attach the original name again.
Consider a virtual file system such as SolFS. Here's how it can solve your task:
If you have returning visitors, you can have a separate container for each visitors (and name it by visitor login, for example). One of the benefits of this approach is that you can encrypt the container using visitor's password.
If you have many probably one-time visitors, you can have one or several containers with files grouped by date of upload.
Virtual file system lets you keep original filenames either as actual filesnames, or as a metadata for the files being stored.
Next, you can compress the data being stored in the container.