How to preserve space and special chars in filename when autostoring files in s3? - uploadcare

I configured a s3 bucket to automatically backup all the files from uploadcare. But the files that are backed up doesn't have the same name and the name is devoid of spaces and special chars.
For example, If I upload "My Resume.pdf" through uploadcare, I am seeing "My Resume.pdf" in cdn but "MyResume.pdf" in s3. It is very difficult this way to map the files between CDN and S3.
Any options to overcome this issue?

Unfortunately, we have to strip special and whitespace characters from actual file that is saved on a disk during upload process.
If we do not do that - all sorts of bad things happen.
Original file name is accessible via REST API.
If you want to map file on Uploadcare CDN to a file backed up to your S3 bucket - you should use UUID for this, not filename (please answer in a comment or update the question, the reason if this doesn't work for you).

Related

S3 — Auto generate folder structure?

I need to store user uploaded files in Amazon S3. I'm new to S3, but as I got from docs, S3 requires of me to specify file upload path in PUT method.
I'm wondering if there is a way to send file to S3, and simply get link for http(s) access? I wish Amazon to handle all headache related to file/folder structure itself. For example, I just pipe from node.js file to S3, and on callback I get http link with no expiration date. And Amazon itself creates smth like /2014/12/01/.../$hash.jpg and just returns me the final link? Such use case looks to be quite common.
Is it possible? If no, could you suggest any options to simplify file storage/filesystem tree structure in S3?
Many thanks.
S3 doesnt' have folders, actually. In a normal filesystem, 2014/12/01/blah.jpg would mean you've got a 2014 folder with a folder called 12 inside it and so on, but in S3 the entire 2014/12/01/blah.jpg it the key - essentially a single long filename. You don't have to create any folders.

query regarding cloud file storage services- can i append data to an existing file

I am working to create an application where some files will be stored in Amazon S3/Rackspace Cloud Files/other similar cloud file storage providers.
There are a couple of scenarios where it would be easier for me, if I could append data to an existing file... Is this possible? Or do I have to download the file from Amazon S3, then append data to it, and finally upload the modified file back to Amazon S3?
There is no way to append anything to existing files in S3.
You will have to download it and upload it again after modifying.
If you wish though, you can always upload the new data with a tag (a timestamp or a counter), e.g. file_201201011344. So when reading files, you get all files mactching your pattern and append them on the client side.

Differences in some filenames case after uploading to Amazon S3

I uploaded a lot of files (about 5,800) to Amazon S3, which seemed to work perfectly well, but a few of them (about 30) had their filenames converted to lowercase.
The first time, I uploaded with Cyberduck. When I saw this problem, I deleted them all and re-uploaded with Transmit. Same result.
I see absolutely no pattern that would link the files that got their names changed, it seems very random.
Has anyone had this happen to them?
Any idea what could be going on?
Thank you!
Daniel
I let you know first that Amazon S3 object URLs are case sensitive. So when you upload file file with upper case and access that file with same URL, it was working. But after renaming objects in lower case and I hope you are trying same older URL so you may get access denied/NoSuchKey error message.
Can you try Bucket Explorer to generate the file URL for Amazon S3 object and then try to access that file?
Disclosure: I work for Bucket Explorer.
When I upload to Amazon servers, I always use Filezilla and STFP. I never had such a problem. I'd guess (and honestly, this is just a guess since I haven't used Cyberduck nor Transmit) that the utilities you're using are doing the filename changing. Try it with Filezilla and see what the result is.

Ways to achieve de-duplicated file storage within Amazon S3?

I am wondering the best way to achieve de-duplicated (single instance storage) file storage within Amazon S3. For example, if I have 3 identical files, I would like to only store the file once. Is there a library, api, or program out there to help implement this? Is this functionality present in S3 natively? Perhaps something that checks the file hash, etc.
I'm wondering what approaches people have use to accomplish this.
You could probably roll your own solution to do this. Something along the lines of:
To upload a file:
Hash the file first, using SHA-1 or stronger.
Use the hash to name the file. Do not use the actual file name.
Create a virtual file system of sorts to save the directory structure - each file can simply be a text file that contains the calculated hash. This 'file system' should be placed separately from the data blob storage to prevent name conflicts - like in a separate bucket.
To upload subsequent files:
Calculate the hash, and only upload the data blob file if it doesn't already exist.
Save the directory entry with the hash as the content, like for all files.
To read a file:
Open the file from the virtual file system to discover the hash, and then get the actual file using that information.
You could also make this technique more efficient by uploading files in fixed-size blocks - and de-duplicating, as above, at the block level rather than the full-file level. Each file in the virtual file system would then contain one or more hashes, representing the block chain for that file. That would also have the advantage that uploading a large file which is only slightly different from another previously uploaded file would involve a lot less storage and data transfer.

Comparing uncompressed local files to compressed files stored on Amazon S3?

We put hundreds of image files on Amazon S3 that our users need to synchronize to their local directories. In order to save storage space and bandwidth, we zip the files stored on S3.
On the user's end they have a python script that runs every 5 min to get a current list of files, and download new/updated files.
My question is what's the best way determine what is new or changed to download?
Currently we add an additional header that we put with the compressed file which contains the MD5 value of the uncompressed file...
We start with a file like this:
image_file_1.tif 17MB MD5 = xxxx1234
We compress it (with 7zip) and put it to S3 (with Python/Boto):
image_file_1.tif.z 9MB MD5 = yyy3456 x-amz-meta-uncompressedmd5 = xxxx1234
The problems is we can't get a large list of files from S3 that include the x-amz-meta-uncompressedmd5 header without an additional API for EACH one (SLOW for hundreds/thousands of files).
Our most practical solution is have users get a full list of files (without the extra headers), download the files that do not exist locally. If it does exist locally, then do and additional API call to get the full headers to compare local MD5 checksum against x-amz-meta-uncompressedmd5.
I'm thinking there must be a better way.
You could include the MD5 hash of the uncompressed image into the compressed filename.
So image_file_1.tif could become image_file_1.xxxx1234.tif.z
Your user python file which does the synchronising would therefore have the information needed to determine if it needed to go get the file again from S3, and could either strip out the MD5 part of the filename, or maintain it, depending on what you wanted to do.
Or, you could also maintain, on S3, a single file containing the full file list including the MD5 metadata. So the python script just need to fetch that single file, parse that, and then decide what to do.