How do I copy files from S3 to Amazon EMR HDFS? - amazon-s3

I'm running hive over EMR,
and need to copy some files to all EMR instances.
One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to copy stright from S3 to HDFS.
What is the best way to go about this?

the best way to do this is to use Hadoop's distcp command. Example (on one of the cluster nodes):
% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile
This would copy a file called myfile from an S3 bucket named mybucket to /root/myfile in HDFS. Note that this example assumes you are using the S3 file system in "native" mode; this means that Hadoop sees each object in S3 as a file. If you use S3 in block mode instead, you would replace s3n with s3 in the example above. For more info about the differences between native S3 and block mode, as well as an elaboration on the example above, see http://wiki.apache.org/hadoop/AmazonS3.
I found that distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary.

Now Amazon itself has a wrapper implemented over distcp, namely : s3distcp .
S3DistCp is an extension of DistCp that is optimized to work with
Amazon Web Services (AWS), particularly Amazon Simple Storage Service
(Amazon S3). You use S3DistCp by adding it as a step in a job flow.
Using S3DistCp, you can efficiently copy large amounts of data from
Amazon S3 into HDFS where it can be processed by subsequent steps in
your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use
S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon
S3
Example Copy log files from Amazon S3 to HDFS
This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.
elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'

Note that according to Amazon, at http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html "Amazon Elastic MapReduce - File System Configuration", the S3 Block FileSystem is deprecated and its URI prefix is now s3bfs:// and they specifically discourage using it since "it can trigger a race condition that might cause your job flow to fail".
According to the same page, HDFS is now 'first-class' file system under S3 although it is ephemeral (goes away when the Hadoop jobs ends).

Related

Apache flink with S3 as source and S3 as sink

Is it possible to read events as they land in S3 source bucket via apache Flink and process and sink it back to some other S3 bucket? Is there a special connector for that , or I have to use the available read/save examples mentioned in Apache Flink?
How does the checkpointing happen in such case, does flink keep track of what it has read from S3 source bucket automatically, or does it need custom code to be built. Does flink also guarentee exactly once processing in S3 source case.
In Flink 1.11 the FileSystem SQL Connector is much improved; that will be an excellent solution for this use case.
With the DataStream API you can use FileProcessingMode.PROCESS_CONTINUOUSLY with readFile to monitor a bucket and ingest new files as they are atomically moved into it. Flink keeps track of the last-modified timestamp of the bucket, and ingests any children modified since that timestamp -- doing so in an exactly-once way (the read offsets into those files are included in checkpoints).

Any AWS S3 API to move files from HDFS on Amazon EMR to Amazon S3 from spark application

We have a requirement to copy files within Spark job (runs on Hadoop cluster spun up by EMR) to respective S3 bucket.
As of now, we are using Hadoop FileSystem API (FileUtil.copy) to copy or move files between two different file systems.
val config = Spark.sparkContext.hadoopConfiguration
FileUtil.copy(sourceFileSystem, sourceFile, destinationFileSystem, targetLocation, true, config)
This method works as required but not efficient. It streams a given file and execution time depends on the size of file and number of files to be copied.
In another similar requirement to move files between two folders of same S3 bucket, we are using functionalities of com.amazonaws.services.s3 package as below.
val uri1 = new AmazonS3URI(sourcePath)
val uri2 = new AmazonS3URI(targetPath)
s3Client.copyObject(uri1.getBucket, uri1.getKey, uri2.getBucket, uri2.getKey)
The above package only has methods to copy/ move between two S3 locations. My requirement is to copy files between HDFS (on a cluster spun up by EMR) and root S3 bucket.
Can anyone suggest a better way or any AWS S3 api available to use in spark scala for moving files between HDFS and S3 bucket.
We had similar scenerio and we ended up using S3DistCp .
S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly S3.You can use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3. S3DistCp is more scalable and efficient for parallel copying large numbers of objects across buckets and across AWS accounts.
You can find more details here.
You can refer to this sample java code for same here
Hope this helps !

block file system on S3

i am a little puzzled i hope someone can help me out.
we create some ORC-Files that we would like to query while they are stored on S3.
We noticed that the S3 native Filesystem S3n does not really work out for this manner. I am not really sure what the problem is - but my guess is, that the reader is not able to jump to specific bytes inside the file so that he has to load the whole file before he can query it.
So we tried storing the files on S3 (uri s3://) which is a block file system just like HDFS backed by s3 and it worked great.
But i am a little worried after reading up on this source about Amazon EMR which says
Amazon S3 block file system (URI path: s3bfs://)
The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.
Important
We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.
EMRFS (URI path: s3://)
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.
I am not using EMR - i create my files by launching an EC2 cluster and then use s3 as a cold storage - but I am kind of puzzled right now and not sure which filesystem I use when I store my files on s3 using the URI scheme s3:// - do i use EMRFS or do i use the deprecated s3bfs filesystem?
Amazon S3 is an object storage system. It is not recommended to "mount" S3 as a filesystem. Amazon Elastic Block Store (EBS) is a block storage system that appears as volumes on Amazon EC2 instances.
When used from Amazon Elastic MapReduce (EMR), Hadoop has extensions that make it easy to work with Amazon S3. However, if you are not using EMR, there is no need to use EMRFS (which is available only on EMR), nor should you use S3 as a block storage system.
The easiest way to use S3 from EC2 is via the AWS Command-Line Interface (CLI). You can copy files to/from S3 by using the aws s3 cp command. There's also a sync command to make it easy to syncrhonize data to/from S3.
You can also programmatically connect to Amazon S3 via an SDK, so that your app can directly transfer files to/from S3.
As to which to choose... typically, applications like to work with files on a local filesystem, so copy your files from S3 to a local device. However, if your app can directly communicate with S3, there will be less "moving parts".

Script to take a S3 bucket, Compress it, push the compressed file to an SFTP server

I have a s3 bucket with about 100 gb of small files (in folders).
I have been requested to back this up to a local NAS on a weekly basis.
I have access to a an EC2 instance that is attached to the S3 storage.
My Nas allows me to run an sFTP server.
I also have access to a local server in which I can run a cron job to pull the backup if need be.
How can I best go about this? If possible i would like to only download the files that have been added or changed, or compress it on the server end and then push the compressed file to the SFtp on the Nas.
The end goal is to have a complete backup of the S3 bucket on my Nas with the lowest amount of transfer each week.
Any suggestions are welcome!
Thanks for your help!
Ryan
I think the most scalable method for you to achieve this is using AWS Elastic Map Reduce and Data pipeline.
The architecture is this way:
You will use Data pipeline to configure S3 as an input data node, then EC2 with pig/hive scripts to do the required processing to send the data to SFTP. Pig is extendable to have a custom UDF (user defined function) to send data to SFTP. Then you can setup this pipeline to run at a periodical interval. Having said this this, it requires quite some reading to achieve all these - But a good skill to achieve if you for see future data transformation needs.
Start reading from here:
http://aws.typepad.com/aws/2012/11/the-new-amazon-data-pipeline.html
Similar method can be used for Taking periodic backup of DynamoDB to S3, Reading files from FTP servers, processing and moving to say S3/RDS etc.

getting large datasets onto amazon elastic map reduce

There are some large datasets (25gb+, downloadable on the Internet) that I want to play around with using Amazon EMR. Instead of downloading the datasets onto my own computer, and then re-uploading them onto Amazon, what's the best way to get the datasets onto Amazon?
Do I fire up an EC2 instance, download the datasets (using wget) into S3 from within the instance, and then access S3 when I run my EMR jobs? (I haven't used Amazon's cloud infrastructure before, so not sure if what I just said makes any sense.)
I recommend the following...
fire up your EMR cluster
elastic-mapreduce --create --alive --other-options-here
log on to the master node and download the data from there
wget http://blah/data
copy into HDFS
hadoop fs -copyFromLocal data /data
There's no real reason to put the original dataset through S3. If you want to keep the results you can move them into S3 before shutting down your cluster.
If the dataset is represented by multiple files you can use the cluster to download it in parallel across the machines. Let me know if this is the case and I'll walk you through it.
Mat
If you're just getting started and experimenting with EMR, I'm guessing you want these on s3 so you don't have to start an interactive Hadoop session (and instead use the EMR wizards via the AWS console).
The best way would be to start a micro instance in the same region as your S3 bucket, download to that machine using wget and then use something like s3cmd (which you'll probably need to install on the instance). On Ubuntu:
wget http://example.com/mydataset dataset
sudo apt-get install s3cmd
s3cmd --configure
s3cmd put dataset s3://mybucket/
The reason you'll want your instance and s3 bucket in the same region is to avoid extra data transfer charges. Although you'll be charged in bound bandwidth to the instance for the wget, the xfer to S3 will be free.
I'm not sure about it, but to me it seems like hadoop should be able to download files directly from your sources.
just enter http://blah/data as your input, and hadoop should do the rest. It certainly works with s3, why should it not work with http?