How do I use EMRFS for checkpointing with Structured Streaming? - amazon-emr

I have been using S3 for checkpointing with Structured Streaming. However I am getting the FileNotFound Exception related to eventual consistency in S3.
Below is what I currently have with S3 checkpointing.
val msg = testMsgs.writeStream.option("checkpointLocation",
s3://<bucket-name>/checkpoint123).foreach(writer).start
I am planning to switch to EMRFS as my spark job run in EMR.
How reliable is EMRFS and how do I use EMRFS for checkpointing?
Will there be a change in the way we implement checkpoint?
How do I enable EMRFS in EMR?

Related

Apache flink with S3 as source and S3 as sink

Is it possible to read events as they land in S3 source bucket via apache Flink and process and sink it back to some other S3 bucket? Is there a special connector for that , or I have to use the available read/save examples mentioned in Apache Flink?
How does the checkpointing happen in such case, does flink keep track of what it has read from S3 source bucket automatically, or does it need custom code to be built. Does flink also guarentee exactly once processing in S3 source case.
In Flink 1.11 the FileSystem SQL Connector is much improved; that will be an excellent solution for this use case.
With the DataStream API you can use FileProcessingMode.PROCESS_CONTINUOUSLY with readFile to monitor a bucket and ingest new files as they are atomically moved into it. Flink keeps track of the last-modified timestamp of the bucket, and ingests any children modified since that timestamp -- doing so in an exactly-once way (the read offsets into those files are included in checkpoints).

Terraform resource for AWS S3 Batch Operation

I couldn't find Terraform resource for AWS S3 batch operation? I was able to create AWS s3 inventory file through terraform but couldn't create an s3 batch operation.
Did anyone create the s3 batch opearion through terraform?
No, there is no Terraform resource for an S3 batch operation. In general, most Terraform providers only have resources for things that are actually resources (they hang around), not things that could be considered "tasks". For the same reason, there's no CloudFormation resource for S3 batch operations either.
Your best bet is to use a module that allows you to run shell commands and use the AWS CLI for it. I like to use this module for these kinds of tasks. You would use it in combination with the AWS CLI command for S3 batch jobs.

Any AWS S3 API to move files from HDFS on Amazon EMR to Amazon S3 from spark application

We have a requirement to copy files within Spark job (runs on Hadoop cluster spun up by EMR) to respective S3 bucket.
As of now, we are using Hadoop FileSystem API (FileUtil.copy) to copy or move files between two different file systems.
val config = Spark.sparkContext.hadoopConfiguration
FileUtil.copy(sourceFileSystem, sourceFile, destinationFileSystem, targetLocation, true, config)
This method works as required but not efficient. It streams a given file and execution time depends on the size of file and number of files to be copied.
In another similar requirement to move files between two folders of same S3 bucket, we are using functionalities of com.amazonaws.services.s3 package as below.
val uri1 = new AmazonS3URI(sourcePath)
val uri2 = new AmazonS3URI(targetPath)
s3Client.copyObject(uri1.getBucket, uri1.getKey, uri2.getBucket, uri2.getKey)
The above package only has methods to copy/ move between two S3 locations. My requirement is to copy files between HDFS (on a cluster spun up by EMR) and root S3 bucket.
Can anyone suggest a better way or any AWS S3 api available to use in spark scala for moving files between HDFS and S3 bucket.
We had similar scenerio and we ended up using S3DistCp .
S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly S3.You can use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3. S3DistCp is more scalable and efficient for parallel copying large numbers of objects across buckets and across AWS accounts.
You can find more details here.
You can refer to this sample java code for same here
Hope this helps !

block file system on S3

i am a little puzzled i hope someone can help me out.
we create some ORC-Files that we would like to query while they are stored on S3.
We noticed that the S3 native Filesystem S3n does not really work out for this manner. I am not really sure what the problem is - but my guess is, that the reader is not able to jump to specific bytes inside the file so that he has to load the whole file before he can query it.
So we tried storing the files on S3 (uri s3://) which is a block file system just like HDFS backed by s3 and it worked great.
But i am a little worried after reading up on this source about Amazon EMR which says
Amazon S3 block file system (URI path: s3bfs://)
The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.
Important
We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.
EMRFS (URI path: s3://)
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.
I am not using EMR - i create my files by launching an EC2 cluster and then use s3 as a cold storage - but I am kind of puzzled right now and not sure which filesystem I use when I store my files on s3 using the URI scheme s3:// - do i use EMRFS or do i use the deprecated s3bfs filesystem?
Amazon S3 is an object storage system. It is not recommended to "mount" S3 as a filesystem. Amazon Elastic Block Store (EBS) is a block storage system that appears as volumes on Amazon EC2 instances.
When used from Amazon Elastic MapReduce (EMR), Hadoop has extensions that make it easy to work with Amazon S3. However, if you are not using EMR, there is no need to use EMRFS (which is available only on EMR), nor should you use S3 as a block storage system.
The easiest way to use S3 from EC2 is via the AWS Command-Line Interface (CLI). You can copy files to/from S3 by using the aws s3 cp command. There's also a sync command to make it easy to syncrhonize data to/from S3.
You can also programmatically connect to Amazon S3 via an SDK, so that your app can directly transfer files to/from S3.
As to which to choose... typically, applications like to work with files on a local filesystem, so copy your files from S3 to a local device. However, if your app can directly communicate with S3, there will be less "moving parts".

How do I copy files from S3 to Amazon EMR HDFS?

I'm running hive over EMR,
and need to copy some files to all EMR instances.
One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to copy stright from S3 to HDFS.
What is the best way to go about this?
the best way to do this is to use Hadoop's distcp command. Example (on one of the cluster nodes):
% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile
This would copy a file called myfile from an S3 bucket named mybucket to /root/myfile in HDFS. Note that this example assumes you are using the S3 file system in "native" mode; this means that Hadoop sees each object in S3 as a file. If you use S3 in block mode instead, you would replace s3n with s3 in the example above. For more info about the differences between native S3 and block mode, as well as an elaboration on the example above, see http://wiki.apache.org/hadoop/AmazonS3.
I found that distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary.
Now Amazon itself has a wrapper implemented over distcp, namely : s3distcp .
S3DistCp is an extension of DistCp that is optimized to work with
Amazon Web Services (AWS), particularly Amazon Simple Storage Service
(Amazon S3). You use S3DistCp by adding it as a step in a job flow.
Using S3DistCp, you can efficiently copy large amounts of data from
Amazon S3 into HDFS where it can be processed by subsequent steps in
your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use
S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon
S3
Example Copy log files from Amazon S3 to HDFS
This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.
elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'
Note that according to Amazon, at http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html "Amazon Elastic MapReduce - File System Configuration", the S3 Block FileSystem is deprecated and its URI prefix is now s3bfs:// and they specifically discourage using it since "it can trigger a race condition that might cause your job flow to fail".
According to the same page, HDFS is now 'first-class' file system under S3 although it is ephemeral (goes away when the Hadoop jobs ends).