AWS EMR s3a filesystem not found - amazon-s3

I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script:
py4j.protocol.Py4JJavaError: An error occurred while calling o36.json.:
java.lang.RuntimeException:
java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
How can we resolve this?
Thanks in advance.

It was an issue with dependencies of spark. I had to add jars config in park-defaults.conf .
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
Please follow below link:
https://gist.github.com/eddies/f37d696567f15b33029277ee9084c4a0

Download the hadoop-aws-3.2.1.jar (or any version above 2.7.10 based on your EMR version) and put it in /usr/lib/spark/jars
Download the latest aws SDK and put it in /usr/lib/spark/jars
update /usr/lib/spark/conf/spark-defaults.conf
update spark.driver.extraClasspath - in the end add the full path of these 2 new jars, seperated by colon
run spark submit after that
Note: I used AWS EMR version 6.0+

For Amazon EMR, use the "s3:" prefix. The S3A connector is the ASF's open source one; Amazon have their own (closed source) connector, which is the only one they support

Related

Checkpoint s3p flink on EMR

I have problem with checkpoint by s3p in the flink of EMR.
When creating the EMR cluster, I have a tick in Presto and added jar file as instructed at https://ci.apache.org/projects/flink/flink-docs-stable/ops/plugins.html.
But when checking point by s3p in flink, it still reports
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3p'. The scheme is directly supported by Flink through the following plugin: flink-s3-fs-presto. Please ensure that each plugin resides within its own subfolder within the plugins directory. See https://ci.apache.org/projects/flink/flink-docs-stable/ops/plugins.html for more information. If you want to use a Hadoop file system for that scheme, please add the scheme to the configuration fs.allowed-fallback-filesystems. For a full list of supported file systems, please see https://ci.apache.org/projects/flink/flink-docs-stable/ops/filesystems/.
Can you help me checkpoint s3p on the flink of EMR?
Thanks.
Presto in EMR has nothing to do with the flink-s3-fs-presto plugin in Flink. You can leave it unticked in the future (doesn't hurt either except blowing things up).
The most likely reason is that you forgot to create a subfolder in the plugins folder. Could you give me an ls of your Flink distribution?

Apache Flink to use S3 for backend state and checkpoints

Background
I was planning to use S3 to store the Flink's checkpoints using the FsStateBackend. But somehow I was getting the following error.
Error
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
Flink version: I am using Flink 1.10.0 version.
I have found the solution for the above issue, so here I am listing it in steps that are required.
Steps
We need to add some configs in the flink-conf.yaml file which I have listed below.
state.backend: filesystem
state.checkpoints.dir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
state.backend.fs.checkpointdir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
s3.access-key: XXXXXXXXXXXXXXXXXXX #your-access-key
s3.secret-key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx #your-secret-key
s3.endpoint: http://127.0.0.1:9000 #your-endpoint-hostname (I have used Minio)
After completing the first step we need to copy the respective(flink-s3-fs-hadoop-1.10.0.jar and flink-s3-fs-presto-1.10.0.jar) JAR files from the opt directory to the plugins directory of your Flink.
E.g:--> 1. Copy /flink-1.10.0/opt/flink-s3-fs-hadoop-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-hadoop/flink-s3-fs-hadoop-1.10.0.jar // Recommended for StreamingFileSink
2. Copy /flink-1.10.0/opt/flink-s3-fs-presto-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-presto/flink-s3-fs-presto-1.10.0.jar //Recommended for checkpointing
Add this in checkpointing code
env.setStateBackend(new FsStateBackend("s3://s3-bucket/checkpoints/"))
After completing all the above steps re-start the Flink if it is already running.
Note:
If you are using both(flink-s3-fs-hadoop and flink-s3-fs-presto) in Flink then please use s3p:// specificly for flink-s3-fs-presto and s3a:// for flink-s3-fs-hadoop instead of s3://.
For more details click here.

Alluxio + Hive on EMR

I have Alluxio 1.8 installed on an EMR 5.19.0 cluster, and can see my S3 tables using /usr/local/alluxio/bin/alluxio fs ls /.
However, when I start up hive and issue
hive> [[DDL w/ LOCATION = alluxio://master_host:19998/my_table ]]], I get the following:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found
Is there a way of getting past this? I've tried starting hive with --auxpath pointing to both /usr/local/alluxio/client/alluxio-1.8.1-client.jar and a copy of the jar on hdfs without any success.
Any help?
I posted a blog talking about the reasons for the error message java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found. Here are some tips, hope they can help:
For Hive, set environment variable HIVE_AUX_JARS_PATH in conf/hive-env.sh:
export HIVE_AUX_JARS_PATH=/<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar:${HIVE_AUX_JARS_PATH}
which I guess is equivalent to what you have done to set --auxpath.
Depending on your setting of Hive (e.g., Hive on MR or Spark or Tez), you may also need to make sure the runtime is also able to access the client jar. Take Hive on MR as an example, you perhaps also need to append the path to Alluxio client jar to mapreduce.application.classpath or yarn.application.classpath to ensure each task of the MR jobs can access this jar.

Streaming Sink to S3

I am trying create an s3 sink for my streaming output. I figured a BucketingSink would be fine, since it is used for HDFS. But it seems that an S3 url is not recognized as hdfs. I get the following error:
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: java.lang.RuntimeException: Error while creating FileSystem when initializing the state of the BucketingSink.
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Cannot support file system for 's3' via Hadoop, because Hadoop is not in the classpath, or some classes are missing from the classpath.
Is there a way to make S3 work for BucketingSink, or is there another option other than BucketingSink that I can use? I am currently running 1.5.2. Would be happy to provide any additional information.
Thank you!
Edit:
My sink creation/using looks like the following:
val s3Sink = new BucketingSink[String]("s3://s3bucket/sessions")
s3Sink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd--HHmm"))
s3Sink.setWriter(new StringWriter[String]())
s3Sink.setBatchSize(200)
s3Sink.setPendingPrefix("sessions-")
s3Sink.setPendingSuffix(".csv")
// Create stream and do stuff here
stream.addSink(s3Sink)
Probably you have to include hadoop-aws jar to your Flink job. Refer to this link will help: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/deployment/aws.html#provide-s3-filesystem-dependency

Flink on EMR cannot access S3 bucket from "flink run" command

I'm prototyping the use of AWS EMR for a Flink-based system that we're planning to deploy. My cluster has the following versions:
Release label: emr-5.10.0
Hadoop distribution: Amazon 2.7.3
Applications: Flink 1.3.2
In the documentation provided by Amazon here: Amazon flink documentation
and the documentation from Flink: Apache flink documentation
both mention directly using S3 resources as an integrated file system with the s3://<bucket>/<file> pattern. I have verified that all the correct permissions are set, I can use the AWS CLI to copy S3 resources to the Master node with no problem, but attempting to start a Flink job using a Jar from S3 does not work.
I am executing the following step:
JAR location : command-runner.jar
Main class : None
Arguments : flink run -m yarn-cluster -yid application_1513333002475_0001 s3://mybucket/myapp.jar
Action on failure: Continue
The step always fails with
JAR file does not exist: s3://mybucket/myapp.jar
I have spoken to AWS support, and they suggested having a previous step copy the S3 file to the local Master node and then referencing it with a local path. While this would obviously work, I would rather get the native S3 integration working.
I have also tried using the s3a filesystem and get the same result.
You need to download your jar from s3 to be available in the classpath.
aws s3 cp s3://mybucket/myapp.jar myapp.jar
and then run the run -m yarn-cluster myapp.jar