block file system on S3 - amazon-s3

i am a little puzzled i hope someone can help me out.
we create some ORC-Files that we would like to query while they are stored on S3.
We noticed that the S3 native Filesystem S3n does not really work out for this manner. I am not really sure what the problem is - but my guess is, that the reader is not able to jump to specific bytes inside the file so that he has to load the whole file before he can query it.
So we tried storing the files on S3 (uri s3://) which is a block file system just like HDFS backed by s3 and it worked great.
But i am a little worried after reading up on this source about Amazon EMR which says
Amazon S3 block file system (URI path: s3bfs://)
The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.
Important
We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.
EMRFS (URI path: s3://)
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.
I am not using EMR - i create my files by launching an EC2 cluster and then use s3 as a cold storage - but I am kind of puzzled right now and not sure which filesystem I use when I store my files on s3 using the URI scheme s3:// - do i use EMRFS or do i use the deprecated s3bfs filesystem?

Amazon S3 is an object storage system. It is not recommended to "mount" S3 as a filesystem. Amazon Elastic Block Store (EBS) is a block storage system that appears as volumes on Amazon EC2 instances.
When used from Amazon Elastic MapReduce (EMR), Hadoop has extensions that make it easy to work with Amazon S3. However, if you are not using EMR, there is no need to use EMRFS (which is available only on EMR), nor should you use S3 as a block storage system.
The easiest way to use S3 from EC2 is via the AWS Command-Line Interface (CLI). You can copy files to/from S3 by using the aws s3 cp command. There's also a sync command to make it easy to syncrhonize data to/from S3.
You can also programmatically connect to Amazon S3 via an SDK, so that your app can directly transfer files to/from S3.
As to which to choose... typically, applications like to work with files on a local filesystem, so copy your files from S3 to a local device. However, if your app can directly communicate with S3, there will be less "moving parts".

Related

AWS S3 auto save folder

Is there a way I can autosave autocad files or changes on the autocad files directly to S3 Bucket?, probably an API I can utilize for this workflow?
While I was not able to quickly find a plug in that does that for you, what you can do is one of the following:
Mount S3 bucket as a drive. You can read more at CloudBerry Drive - Mount S3 bucket as Windows drive
This might create some performance issues with AutoCad.
Sync saved files to S3
You can set a script to run every n minutes that automatically syncs your files to S3 using aws s3 sync. You can read more about AWS S3 Sync here. Your command might look something like
aws s3 sync /path/to/cad/files s3://bucket-with-cad/project/test

How to copy files from an encrypted S3 bucket to Google Cloud Storage?

I need to sync some files between an encrypted (S3-SSE) S3 bucket and a Google Cloud Storage bucket.
The task sounds simple, as gsutil supports S3, but unfortunately it seems it does not support SSE:
Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.
Is there an easy way to sync files between an encrypted (S3-SSE) S3 bucket and a Google Cloud Storage bucket (apart from writing our own script)?
As gsutil doesn't currently support Signature Version 4, there doesn't look to be an "easy" way (i.e. without writing a script of your own) to sync files between your two buckets. A naive solution might simply chain together the s3 cli and gsutil tools for each copy, using your machine as the middleman for a daisy-chain approach as gsutil already does for cross-cloud-provider copies.

AWS FTP behavior

I'm having some issue on my AWS S3 bucket and vsftpd.
I've created a vsftpd instance and mount AWS S3 bucket. My issue is that everytime I upload a file and the connection was disrupted, it appends the existing file on the S3 bucket instead of override it when the FTP client retry. What should I set on the S3 bucket policy to have such behavior to override instead of append?
There are no Amazon S3 configuration settings that would impact this behaviour -- it is totally the result of the software you are using.
It's also worth mentioning that FTP is a rather old protocol and these days there are much better alternatives, such as uploads via the browser or Dropbox-like shared folders.
One of the easiest options is to have your users upload directly to Amazon S3 -- that way, you don't need to run any servers. This could be done by uploading via a browser, or by providing users with some software, such as Cloudberry Explorer or the AWS Command-Line Interface (CLI).
I highly encourage you to stop using FTP these days.

How do I copy files from S3 to Amazon EMR HDFS?

I'm running hive over EMR,
and need to copy some files to all EMR instances.
One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to copy stright from S3 to HDFS.
What is the best way to go about this?
the best way to do this is to use Hadoop's distcp command. Example (on one of the cluster nodes):
% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile
This would copy a file called myfile from an S3 bucket named mybucket to /root/myfile in HDFS. Note that this example assumes you are using the S3 file system in "native" mode; this means that Hadoop sees each object in S3 as a file. If you use S3 in block mode instead, you would replace s3n with s3 in the example above. For more info about the differences between native S3 and block mode, as well as an elaboration on the example above, see http://wiki.apache.org/hadoop/AmazonS3.
I found that distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary.
Now Amazon itself has a wrapper implemented over distcp, namely : s3distcp .
S3DistCp is an extension of DistCp that is optimized to work with
Amazon Web Services (AWS), particularly Amazon Simple Storage Service
(Amazon S3). You use S3DistCp by adding it as a step in a job flow.
Using S3DistCp, you can efficiently copy large amounts of data from
Amazon S3 into HDFS where it can be processed by subsequent steps in
your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use
S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon
S3
Example Copy log files from Amazon S3 to HDFS
This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.
elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'
Note that according to Amazon, at http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html "Amazon Elastic MapReduce - File System Configuration", the S3 Block FileSystem is deprecated and its URI prefix is now s3bfs:// and they specifically discourage using it since "it can trigger a race condition that might cause your job flow to fail".
According to the same page, HDFS is now 'first-class' file system under S3 although it is ephemeral (goes away when the Hadoop jobs ends).

How do I use Amazon's new RRS for S3?

Reduced Redundancy Storage (RRS) is a new service from Amazon that is a bit cheaper than S3 because there is less redundancy.
However, I can not find any information on how to specify that my data should use RRS rather than standard S3. In fact, there doesn't seem to be any website interface for an S3 services. If I log into AWS, there are only options for EC2, Elastic MapReduce, CloudFront and RDS, none of which I use.
I know this question is old but it's worth mentioning that Amazon's interface for S3 now has an option to change your files (recursively) to RRS. Select a folder and right click on it, under properties change the storage to RRS.
You can use S3 Browser to switch to Reduced Redundancy Storage. It allows you to view/edit storage class for a single file or for multiple files. Moreover, you can configure default storage class for the bucket, so S3 Browser will automatically apply predefined storage class for all new files you are uploading through S3 Browser.
If you are using S3 Browser to work with RRS, the following article may be helpful:
Working with Amazon S3 Reduced Redundancy Storage (RRS)
Note, Storage Class preferences are stored in a local settings file.Other s3 applications are using their own way to store bucket defaults and currently there is not single standard on this.
All objects in Amazon S3 have a
storage class setting. The default
setting is STANDARD. You can use an
optional header on a PUT request to
specify the setting
REDUCED_REDUNDANCY.
From: http://aws.amazon.com/s3/faqs/#How_do_I_specify_that_I_want_to_store_my_data_using_RRS
If you are looking for a way to convert existing data in amazon s3, you can use a fairly recent version of boto and a script I wrote. Details explained on my blog:
http://www.bryceboe.com/2010/07/02/amazon-s3-convert-objects-to-reduced-redundancy-storage/
If you're on a mac, the free cyberduck ftp program will do it. Log into S3, right-click on the bucket (or folder, or file) and choose 'info' and change the storage class from 'unknown' or 'regular s3 storage' to 'reduced redundancy storage'. Took it about 2 hours to change 30,000 files for me...
If you use boto, you can do this:
key.change_storage_class('REDUCED_REDUNDANCY')