Flink on EMR cannot access S3 bucket from "flink run" command - amazon-s3

I'm prototyping the use of AWS EMR for a Flink-based system that we're planning to deploy. My cluster has the following versions:
Release label: emr-5.10.0
Hadoop distribution: Amazon 2.7.3
Applications: Flink 1.3.2
In the documentation provided by Amazon here: Amazon flink documentation
and the documentation from Flink: Apache flink documentation
both mention directly using S3 resources as an integrated file system with the s3://<bucket>/<file> pattern. I have verified that all the correct permissions are set, I can use the AWS CLI to copy S3 resources to the Master node with no problem, but attempting to start a Flink job using a Jar from S3 does not work.
I am executing the following step:
JAR location : command-runner.jar
Main class : None
Arguments : flink run -m yarn-cluster -yid application_1513333002475_0001 s3://mybucket/myapp.jar
Action on failure: Continue
The step always fails with
JAR file does not exist: s3://mybucket/myapp.jar
I have spoken to AWS support, and they suggested having a previous step copy the S3 file to the local Master node and then referencing it with a local path. While this would obviously work, I would rather get the native S3 integration working.
I have also tried using the s3a filesystem and get the same result.

You need to download your jar from s3 to be available in the classpath.
aws s3 cp s3://mybucket/myapp.jar myapp.jar
and then run the run -m yarn-cluster myapp.jar

Related

Apache Flink to use S3 for backend state and checkpoints

Background
I was planning to use S3 to store the Flink's checkpoints using the FsStateBackend. But somehow I was getting the following error.
Error
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
Flink version: I am using Flink 1.10.0 version.
I have found the solution for the above issue, so here I am listing it in steps that are required.
Steps
We need to add some configs in the flink-conf.yaml file which I have listed below.
state.backend: filesystem
state.checkpoints.dir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
state.backend.fs.checkpointdir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
s3.access-key: XXXXXXXXXXXXXXXXXXX #your-access-key
s3.secret-key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx #your-secret-key
s3.endpoint: http://127.0.0.1:9000 #your-endpoint-hostname (I have used Minio)
After completing the first step we need to copy the respective(flink-s3-fs-hadoop-1.10.0.jar and flink-s3-fs-presto-1.10.0.jar) JAR files from the opt directory to the plugins directory of your Flink.
E.g:--> 1. Copy /flink-1.10.0/opt/flink-s3-fs-hadoop-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-hadoop/flink-s3-fs-hadoop-1.10.0.jar // Recommended for StreamingFileSink
2. Copy /flink-1.10.0/opt/flink-s3-fs-presto-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-presto/flink-s3-fs-presto-1.10.0.jar //Recommended for checkpointing
Add this in checkpointing code
env.setStateBackend(new FsStateBackend("s3://s3-bucket/checkpoints/"))
After completing all the above steps re-start the Flink if it is already running.
Note:
If you are using both(flink-s3-fs-hadoop and flink-s3-fs-presto) in Flink then please use s3p:// specificly for flink-s3-fs-presto and s3a:// for flink-s3-fs-hadoop instead of s3://.
For more details click here.

AWS EMR s3a filesystem not found

I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script:
py4j.protocol.Py4JJavaError: An error occurred while calling o36.json.:
java.lang.RuntimeException:
java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
How can we resolve this?
Thanks in advance.
It was an issue with dependencies of spark. I had to add jars config in park-defaults.conf .
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
Please follow below link:
https://gist.github.com/eddies/f37d696567f15b33029277ee9084c4a0
Download the hadoop-aws-3.2.1.jar (or any version above 2.7.10 based on your EMR version) and put it in /usr/lib/spark/jars
Download the latest aws SDK and put it in /usr/lib/spark/jars
update /usr/lib/spark/conf/spark-defaults.conf
update spark.driver.extraClasspath - in the end add the full path of these 2 new jars, seperated by colon
run spark submit after that
Note: I used AWS EMR version 6.0+
For Amazon EMR, use the "s3:" prefix. The S3A connector is the ASF's open source one; Amazon have their own (closed source) connector, which is the only one they support

How do I download one or more files from a stopped Fargate task?

I have an ECS task that runs some test cases. I have it running in Fargate. Yay!
Now I want to download the test results file(s) from the container. I have the task and container IDs handy. I can find the exit code with
aws ecs describe-tasks --cluster Fargate --tasks <my-task-id>
How do I download the log and/or files produced?
It looks like, as of right now, the only way to get test results off of my server is to send the results to S3 before the container shuts down.
From this thread, there's no way to mount a volume / EFS onto a Fargate container.
Here's my bash script for running my tests (in build.sh) and then uploading the results to S3:
#!/bin/bash
echo Running tests...
pushd ~circleci/project/
export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_KEY
commandToRun="~/project/.circleci/build_scripts/build.sh";
# Run the command
eval $commandToRun 2>&1 | tee /tmp/build.log
# Get the exit code
exitCode=$?
aws s3 cp /tmp/build-$FEATURE.log s3://$CICD_BUCKET/build.log \
--storage-class REDUCED_REDUNDANCY \
--region us-east-1
exit ${exitCode}
Of course, you'll have to pass in the AWS_ACCESS_KEY, AWS_SECRET_KEY and CICD_BUCKET environment variables. The bucket name you choose needs to be pre-created, but any directory structure below it does NOT need to be created in advance.
You probably want to look at using CodeBuild for this use case, which can automatically copy artifacts to S3.
It's actually quite easy to orchestrate the following using a simple bash script and the AWS CLI:
Idempotently Create/Update a CodeBuild project (using a simple CloudFormation template you can define in your source repository)
Run a Codebuild job that executes a given revision of your source repository (using again a self-defining buildspec.yml specification defined in your source repository)
Attach to the CloudWatch logs log group for your CodeBuild job and stream log output
Finally detect when the job has completed successfully or not and then download any artifacts locally using S3
I use this approach to run builds in CodeBuild, with Bamboo as the overarching continuous delivery system.

External checkpoints to S3 on EMR

I am trying to deploy a production cluster for my Flink program. I am using a standard hadoop-core EMR cluster with Flink 1.3.2 installed, using YARN to run it.
I am trying to configure my RocksDB to write my checkpoints to an S3 bucket. I am trying to go through these docs: https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/aws.html#set-s3-filesystem. The problem seems to be getting the dependencies working correctly. I receive this error when trying run the program:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:93)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:328)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:350)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:389)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:293)
at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory.<init>(FsCheckpointStreamFactory.java:99)
at org.apache.flink.runtime.state.filesystem.FsStateBackend.createStreamFactory(FsStateBackend.java:282)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createStreamFactory(RocksDBStateBackend.java:273
I have tried both leaving and adjusting the core-site.xml and leaving it as is. I have tried setting the HADOOP_CLASSPATH to the /usr/lib/hadoop/share that contains(what I assume are) most of the JARs described in the above guide. I tried downloading the hadoop 2.7.2 binaries, and copying over them into the flink/libs directory. All resulting in the same error.
Has anyone successfully gotten Flink being able to write to S3 on EMR?
EDIT: My cluster setup
AWS Portal:
1) EMR -> Create Cluster
2) Advanced Options
3) Release = emr-5.8.0
4) Only select Hadoop 2.7.3
5) Next -> Next -> Next -> Create Cluster ( I do fill out names/keys/etc)
Once the cluster is up I ssh into the Master and do the following:
1 wget http://apache.claz.org/flink/flink-1.3.2/flink-1.3.2-bin-hadoop27-scala_2.11.tgz
2 tar -xzf flink-1.3.2-bin-hadoop27-scala_2.11.tgz
3 cd flink-1.3.2
4 ./bin/yarn-session.sh -n 2 -tm 5120 -s 4 -d
5 Change conf/flink-conf.yaml
6 ./bin/flink run -m yarn-cluster -yn 1 ~/flink-consumer.jar
My conf/flink-conf.yaml I add the following fields:
state.backend: rocksdb
state.backend.fs.checkpointdir: s3:/bucket/location
state.checkpoints.dir: s3:/bucket/location
My program's checkpointing setup:
env.enableCheckpointing(getCheckpointRate,CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(getCheckpointMinPause)
env.getCheckpointConfig.setCheckpointTimeout(getCheckpointTimeout)
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
env.setStateBackend(new RocksDBStateBackend("s3://bucket/location", true))
If there are any steps you think I am missing, please let me know
I assume that you installed Flink 1.3.2 on your own on the EMR Yarn cluster, because Amazon does not yet offer Flink 1.3.2, right?
Given that, it seems as if you have a dependency conflict. The method org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration) was only introduced with Hadoop 2.4.0. Therefore I assume that you have deployed a Flink 1.3.2 version which was built with Hadoop 2.3.0. Please deploy a Flink version which was built with the Hadoop version running on EMR. This will most likely solve all dependency conflicts.
Putting the Hadoop dependencies into the lib folder seems to not reliably work because the flink-shaded-hadoop2-uber.jar appears to have precedence in the classpath.

How to install sqoop in Amazon EMR?

I've created a cluster in Amazon EMR and using -emr-4.0.0. Hadoop distribution:Amazon 2.6.0 and Hive 1.0.0. Need to install Sqoop so that I can communicate between Hive and Redshift? What are the steps to install Sqoop in EMR cluster? Requesting to provide the steps. Thank You!
Note that in EMR 4.0.0 hadoop fs -copyToLocal will throw errors.
Use aws s3 cp instead.
To be more specific than Amal:
Download the latest version of SQOOP and upload it to an S3 location. I am using sqoop-1.4.4.bin__hadoop-2.0.4-alpha and it seems to work just fine with EMR 4.0.0
Download the JAR connector for Redshift and upload it to same S3 location. This page might help.
Upload a script similar to the one below to S3
#!/bin/bash
# Install sqoop and mysql connector. Store in s3 and load
# as bootstrap step.
bucket_location='s3://your-sqoop-jars-location/'
sqoop_jar='sqoop-1.4.4.bin__hadoop-2.0.4-alpha'
sqoop_jar_gz=$sqoop_jar.tar.gz
redshift_jar='RedshiftJDBC41-1.1.7.1007.jar'
cd /home/hadoop
aws s3 cp $bucket_location$sqoop_jar_gz .
tar -xzf $sqoop_jar_gz
aws s3 cp $bucket_location$redshift_jar .
cp $redshift_jar $sqoop_jar/lib/
Set SQOOP_HOME and add SQOOP_HOME to the PATH to be able to call sqoop from anywhere. These entries should be made in /etc/bashrc. Otherwise you will have to use the full path, in this case: /home/hadoop/sqoop-1.4.4.bin__hadoop-2.0.4-alpha/bin/sqoop
I am using Java to programatically launch my EMR cluster. To configure bootstrap steps in Java I create a BootstrapActionConfigFactory:
public final class BootstrapActionConfigFactory {
private static final String bucket = Config.getBootstrapBucket();
// make class non-instantiable
private BootstrapActionConfigFactory() {
}
/**
* Adds an install Sqoop step to the job that corresponds to the version set in the Config class.
*/
public static BootstrapActionConfig newInstallSqoopBootstrapActionConfig() {
return newInstallSqoopBootstrapActionConfig(Config.getHadoopVersion().charAt(0));
}
/**
* Adds an install Sqoop step to the job that corresponds to the version specified in the parameter
*
* #param hadoopVersion the main version number for Hadoop. E.g.: 1, 2
*/
public static BootstrapActionConfig newInstallSqoopBootstrapActionConfig(char hadoopVersion) {
return new BootstrapActionConfig().withName("Install Sqoop")
.withScriptBootstrapAction(
new ScriptBootstrapActionConfig().withPath("s3://" + bucket + "/sqoop-tools/hadoop" + hadoopVersion + "/bootstrap-sqoop-emr4.sh"));
}
}
Then when creating the job:
Job job = new Job(Region.getRegion(Regions.US_EAST_1));
job.addBootstrapAction(BootstrapActionConfigFactory.newInstallSqoopBootstrapActionConfig());
Download the tarball of sqoop and keep it in an s3 bucket. Create a bootstrap script that performs the following activity
Download the sqoop tarball to the required instances
extract the tarball
set SQOOP_HOME and add SQOOP_HOME to the PATH. These entries should be made in /etc/bashrc
Add the required connector jars to the lib of SQOOP.
Keep this script in S3 and point this script in the bootstrap actions.
Note that from Emr-4.4.0 AWS added support for Sqoop 1.4.6 to the EMR cluster. Installation is done with couple clicks on setup. No need for manual installation.
References:
https://aws.amazon.com/blogs/aws/amazon-emr-4-4-0-sqoop-hcatalog-java-8-and-more/
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-sqoop.html