Some quick questions:
Does S3 support soft link?
On mounted S3 on EC2, I can't access the created directory in Linux EC2 instance from AWS UI, however created files are accessible.
Thanks
Amazon S3 is an object store, not a filesystem. It has a specific set of APIs for uploading, listing, downloading, etc but it does not behave like a normal filesystem.
There are some utilities that can mount S3 as a filesystem (eg Expandrive, Cloudberry Drive, s3fs), but in the background these utilities are actually translating requests into API calls. This can cause some issues -- for example, you can modify a 100MB file on a local disk by just writing one by to disk. If you wish to modify one byte on S3, you must upload the whole object again. This can cause synchronization problems between your computer and S3, so such methods are not recommended for production situations. (However, they're a great way of uploading/downloading initial data.)
A good in-between option is to use the AWS Command-Line Interface (CLI), which has commands such as aws s3 cp and aws s3 sync, which are reliable ways to upload/download/sync files with Amazon S3.
To answer your questions...
Amazon S3 does not support a "soft link" (symbolic link). Amazon S3 is an object store, not a file system, so it only contains objects. Objects can also have meta-data that is often for cache control, redirection, classification, etc.
Amazon S3 does not support directories (sort of). Amazon S3 objects are kept within buckets, and the buckets are 'flat' -- they do not contains directories/sub-folders. However, it does maintain the illusion of directories. For example, if file bar.jpg is stored in the foo directory, then the Key (filename) of the object is foo/bar.jpg. This makes the object 'appear' to be in the foo directory, but that's not how it is stored. The AWS Management Console maintains this illusion by allowing users to create and open Folders, but the actual data is stored 'flat'.
This leads to some interesting behaviours:
You do not need to create a directory to store an object in the directory.
Directories don't exist. Just store a file called images/cat.jpg and the images directory magically appears (even though it doesn't exist).
You cannot rename objects. The Key (filename) is a unique identifier for the object. To 'rename' an object, you must copy it to a new Key and delete the original.
You cannot rename a directory. They don't exist. Instead, rename all the objects within the directory (which really means you have to copy the objects, then delete their old versions).
You might create a directory but not see it. Amazon S3 keeps track of CommonPrefixes to assist in listing objects by path, but it doesn't create traditional directories. So, don't get worried if you create a (pretend) directory and then don't see it. Just store your object with a full-path name and the directory will 'appear'.
The above-mentioned utilities take all this into account when allowing an Amazon S3 bucket to be mounted. They translate 'normal' filesystem commands into Amazon S3 API calls, but they can't do everything (eg they might emulate renaming a file but they typically won't let you rename a directory).
Related
We have an Apache Camel app that is supposed to read files in a certain directory structure in S3, process the files (generating some metadata based on the folder the file is in), submit the data in the file (and metadata) to another system and finally put the consumed files into a different bucket, deleting the original from the incoming bucket.
The behaviour I'm seeing is that when I programatically create the directory structure in S3, those "folders" are being consumed, so the dir structure disappears.
I know S3 technically does not have folders, just empty files ending in /.
The twist here is that any "folder" created in the S3 Console, are NOT consumed, they stay there as we want them to. Any folders created via AWS CLI, or boto3 are immediately consumed.
The problem is that we do need the folders to be created with automation, there are too many to do by hand.
I've reached out to AWS Support, and they just tell me that there are no differences between how the Console creates folders, and how the CLI does it. Support confirmed that the command I used in CLI is correct.
I think my issue is similar to Apache Camel deleting AWS S3 bucket's folder , but that has no answer...
How can I get Camel to not "eat" any folders?
Is it possible to move or copy files from s3 to glacier (or if not possible another cheaper storage class) although the original s3 files will be deleted? Looking for a robust solution for server backups from whm > s3 > glacier. I've trialled multiple lifecycle rules, and can see several questions have been asked around this here, but I can't seem to get the settings right.
WHM sends backups to s3 fine for me. It works by essentially creating a mirror of the on-server backups on s3. My problem is that the way the whm/s3 integration works means that when the on-server backups are deleted at the end of the month so are the backups in the s3 bucket.
What I'd like to achieve is that before the files are deleted from s3 they're permanently kept for a specified period, say 6 months. I've tried rules to archive them to glacier without success and think this is because the original files are deleted and so are the glacier instances?
Is what I'm trying to achieve possible?
Thanks.
There are actually two ways to use Amazon Glacier:
As an Amazon S3 storage class (as you describe), or
By interacting with Amazon Glacier directly
Amazon Glacier has its own API that you can use to upload/download objects to/from a Glacier vault (which is the equivalent to an S3 bucket). In fact, when you use Amazon S3 to move data into Glacier, S3 is simply calling the standard Glacier API to send the data to Glacier. The difference is that S3 is managing the vault for you, so you own't see the objects listed in your Glacier console.
So, what you might choose to do is:
Create your WHM backups
Send them directly to Glacier
Versioning
An alternative approach is to use Amazon S3 Versioning. This means that objects delete from Amazon S3 are not actually deleted. Rather, a delete marker hides the object, but the object is still accessible.
You could then define a lifecycle policy to delete non-current versions (including deleted objects) after a period of time.
See (old article): Amazon S3 Lifecycle Management for Versioned Objects | AWS News Blog
i am a little puzzled i hope someone can help me out.
we create some ORC-Files that we would like to query while they are stored on S3.
We noticed that the S3 native Filesystem S3n does not really work out for this manner. I am not really sure what the problem is - but my guess is, that the reader is not able to jump to specific bytes inside the file so that he has to load the whole file before he can query it.
So we tried storing the files on S3 (uri s3://) which is a block file system just like HDFS backed by s3 and it worked great.
But i am a little worried after reading up on this source about Amazon EMR which says
Amazon S3 block file system (URI path: s3bfs://)
The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.
Important
We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.
EMRFS (URI path: s3://)
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.
I am not using EMR - i create my files by launching an EC2 cluster and then use s3 as a cold storage - but I am kind of puzzled right now and not sure which filesystem I use when I store my files on s3 using the URI scheme s3:// - do i use EMRFS or do i use the deprecated s3bfs filesystem?
Amazon S3 is an object storage system. It is not recommended to "mount" S3 as a filesystem. Amazon Elastic Block Store (EBS) is a block storage system that appears as volumes on Amazon EC2 instances.
When used from Amazon Elastic MapReduce (EMR), Hadoop has extensions that make it easy to work with Amazon S3. However, if you are not using EMR, there is no need to use EMRFS (which is available only on EMR), nor should you use S3 as a block storage system.
The easiest way to use S3 from EC2 is via the AWS Command-Line Interface (CLI). You can copy files to/from S3 by using the aws s3 cp command. There's also a sync command to make it easy to syncrhonize data to/from S3.
You can also programmatically connect to Amazon S3 via an SDK, so that your app can directly transfer files to/from S3.
As to which to choose... typically, applications like to work with files on a local filesystem, so copy your files from S3 to a local device. However, if your app can directly communicate with S3, there will be less "moving parts".
How I can I access the file sitting in the following folder of S3 which is own by someone else
s3n://elasticmapreduce/samples/wordcount/input
The files in s3n://elasticmapreduce/samples/wordcount/input are public, and made available as input by Amazon to the sample word count Hadoop program. The best way to fetch them is to
Start a new Amazon Elastic MapReduce Job Flow (it doesn't matter which one) from the Amazon Web Services console, and make sure that you keep the the job alive with the Keep Alive option
Once the EC2 machines have started, find the instances on EC2 from the Amazon Web Services console
ssh into one of the running EC2 instances, using the hadoop user, for example
ssh -i keypair.pem hadoop#ec2-IPADDRESS.compute-1.amazonaws.com
Obtain the files you need, using hadoop dfs -copyToLocal s3://elasticmapreduce/samples/wordcount/input/0002 .
sftp the files to your local system
You can access wordSplitter.py here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/wordSplitter.py
You can access the input files here:
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0012
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0011
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0010
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0009
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0008
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0007
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0006
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0005
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0004
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0003
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0002
https://elasticmapreduce.s3.amazonaws.com/samples/wordcount/input/0001
The owner of the folder (most likely a file in the folder) must have made it accessible to anonymous reader.
If that is the case, s3n://x/y... is translated to
http://s3.amazonaws.com/x/y...
or
http://x.s3.amazonaws.com/y...
x is the name of the bucket.
y... is the path wihtin the bucket.
If you want to make sure the file exists, e.g. if you suspect the name was misspelled, you can in your browser to open
http://s3.amazonaws.com/x
and you'll see XML describing "files" that is S3 objects, available.
Try this:
http://s3.amazonaws.com/elasticmapreduce
I tried this, and seems that the path you want is not public.
AWS EBS documentation quotes s3://elasticmapreduce/samples/wordcount/input in one of the "getting started" examples. But s3 is different from s3n, so input might be available to EMR, but not to HTTP access.
In Amazon S3, there is no concept of folders, a bucket it just a flat collection of objects. But you can list all the files you are interested in a browser with the following URL:
s3.amazonaws.com/elasticmapreduce?prefix=samples/wordcount/input/
Then you can download them by specifying the whole name, e.g.
s3.amazonaws.com/elasticmapreduce/samples/wordcount/input/0001
Reduced Redundancy Storage (RRS) is a new service from Amazon that is a bit cheaper than S3 because there is less redundancy.
However, I can not find any information on how to specify that my data should use RRS rather than standard S3. In fact, there doesn't seem to be any website interface for an S3 services. If I log into AWS, there are only options for EC2, Elastic MapReduce, CloudFront and RDS, none of which I use.
I know this question is old but it's worth mentioning that Amazon's interface for S3 now has an option to change your files (recursively) to RRS. Select a folder and right click on it, under properties change the storage to RRS.
You can use S3 Browser to switch to Reduced Redundancy Storage. It allows you to view/edit storage class for a single file or for multiple files. Moreover, you can configure default storage class for the bucket, so S3 Browser will automatically apply predefined storage class for all new files you are uploading through S3 Browser.
If you are using S3 Browser to work with RRS, the following article may be helpful:
Working with Amazon S3 Reduced Redundancy Storage (RRS)
Note, Storage Class preferences are stored in a local settings file.Other s3 applications are using their own way to store bucket defaults and currently there is not single standard on this.
All objects in Amazon S3 have a
storage class setting. The default
setting is STANDARD. You can use an
optional header on a PUT request to
specify the setting
REDUCED_REDUNDANCY.
From: http://aws.amazon.com/s3/faqs/#How_do_I_specify_that_I_want_to_store_my_data_using_RRS
If you are looking for a way to convert existing data in amazon s3, you can use a fairly recent version of boto and a script I wrote. Details explained on my blog:
http://www.bryceboe.com/2010/07/02/amazon-s3-convert-objects-to-reduced-redundancy-storage/
If you're on a mac, the free cyberduck ftp program will do it. Log into S3, right-click on the bucket (or folder, or file) and choose 'info' and change the storage class from 'unknown' or 'regular s3 storage' to 'reduced redundancy storage'. Took it about 2 hours to change 30,000 files for me...
If you use boto, you can do this:
key.change_storage_class('REDUCED_REDUNDANCY')