Create EMR Hive cluster with glue catalog using CLI - hive

I would like to create EMR Hive cluster, which will use Glue as data catalog, using AWS CLI.
I didn't find anything related to that in AWS docs or searching in other places.
Is this possible?

First we create a configuration classification named emr.json that specifies AWS Glue Data Catalog as the metastore for Hive:
[
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
"hive.metastore.schema.verification": "false"
}
}
]
Note: On EMR release version 5.28.0, 5.28.1, or 5.29.0, if you're creating a cluster using the AWS Glue Data Catalog as the metastore, we set the hive.metastore.schema.verification to false.
Finally, we combine the configuration classification file with our final command as follows :
aws emr create-cluster --name "syumaK-cluster" --configurations file://emr.json --release-label emr-5.28.0 --use-default-roles --applications Name=Hadoop Name=Spark Name=Hive Name=HUE --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
Response:
{
"ClusterId": "j-2NZ6xxxxxx",
"ClusterArn": "arn:aws:elasticmapreduce:us-east-1:1925xxxxx:cluster/j-2NZ6xxxxxx"
}
Hope this helps!

Related

Is it possible to trigger lambda by changing the file of local s3 manually in serverless framework?

I used the serverless-s3-local to trigger aws lambda locally with serverless framework.
Now it worked when I created or updated a file by function in local s3 folder, but when I added a file or changed the context of the file in local s3 folder manually, it didn’t trigger the lambda.
Is there any good way to solve it?
Thanks for using serverlss-s3-local. I'm the author of serverless-s3-local.
How did you add a file or change the context of the file? Did you use the AWS command as following?
$ AWS_ACCESS_KEY_ID=S3RVER AWS_SECRET_ACCESS_KEY=S3RVER aws --endpoint http://localhost:8000 s3 cp ./face.jpg s3://local-bucket/incoming/face.jpg
{
"ETag": "\"6fa1ab0763e315d8b1a0e82aea14a9d0\""
}
If you don't use the aws command and apply these operations to the files directory, these modifications aren't detected by S3rver which is the local S3 emurator. resize_image example may be useful for you.

Is there a way to setup bootstrap actions to run on EMR after core services are installed (Spark etc)?

Is there a way to setup bootstrap actions to run on EMR after core services are installed (Spark etc)? I am using emr-5.27.0.
You can submit some script as a step, not a bootstrap. For example, I made an SSL certificate update script and it is applied to the EMR by a step. This is a part of my lambda function in Python language. But you can add this step by manually on the console, or other languages.
Steps=[{
'Name': 'PrestoCertificate',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3://ap-northeast-2.elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': ['s3://myS3/PrestoSteps_InstallCertificate.sh']
}
}]
The key point is script-runner.jar that is pre-built by amazon and you can use that for each region by changing the region prefix. It receives a .sh file and runs it.
One thing you should know is, the script will run on all the nodes and if you want to do it only the master instance then you have to use if-else statement.
#!/bin/bash
BOOL=`cat /emr/instance-controller/lib/info/instance.json | jq .isMaster`
if [ $BOOL == "true" ]
then
<your code>
fi

Flink on EMR cannot access S3 bucket from "flink run" command

I'm prototyping the use of AWS EMR for a Flink-based system that we're planning to deploy. My cluster has the following versions:
Release label: emr-5.10.0
Hadoop distribution: Amazon 2.7.3
Applications: Flink 1.3.2
In the documentation provided by Amazon here: Amazon flink documentation
and the documentation from Flink: Apache flink documentation
both mention directly using S3 resources as an integrated file system with the s3://<bucket>/<file> pattern. I have verified that all the correct permissions are set, I can use the AWS CLI to copy S3 resources to the Master node with no problem, but attempting to start a Flink job using a Jar from S3 does not work.
I am executing the following step:
JAR location : command-runner.jar
Main class : None
Arguments : flink run -m yarn-cluster -yid application_1513333002475_0001 s3://mybucket/myapp.jar
Action on failure: Continue
The step always fails with
JAR file does not exist: s3://mybucket/myapp.jar
I have spoken to AWS support, and they suggested having a previous step copy the S3 file to the local Master node and then referencing it with a local path. While this would obviously work, I would rather get the native S3 integration working.
I have also tried using the s3a filesystem and get the same result.
You need to download your jar from s3 to be available in the classpath.
aws s3 cp s3://mybucket/myapp.jar myapp.jar
and then run the run -m yarn-cluster myapp.jar

How do you make use of cloudformation outputs within serverless framework?

If you deploy a cloudformation creating a kinesis stream how can you provide the outputs such as an arn to a lambda created in the same deployment. Does cf happen before serverless creates the lambdas and is there a way to store the cloudformation values in the lambda?
To store the Arn from your CloudFormation Template "s-resource-cf.json", add some items into the "Outputs" section.
"Outputs": {
"InsertVariableNameForLaterUse": {
"Description": "This is the Arn of My new Kinesis Stream",
"Value": {
"Fn::GetAtt": [
"InsertNameOfCfSectionToFindArnOf",
"Arn"
]
}
}
}
The Fn::GetAtt is a function in CF to get a reference from another resource being created.
When you deploy the CF Template using serverless resources deploy -s dev -r eu-west-1 the Kinesis Stream is created for that Stage/Region and the Arn will be saved into the region properties file /_meta/resources/variables/s-variables-dev-euwest1.json. Note the initial capitalisation change insertVariableNameForLaterUse.
You can then use that in the function's s-function.json as
${insertVariableNameForLaterUse}, such as the environment section:
"environment": {
"InsertVariableNameWeWantToUseInLambda": "${insertVariableNameForLaterUse}"
...
}
and reference this variable in your Lambda using something like:
var myKinesisStreamArn = process.env.InsertVariableNameWeWantToUseInLambda;
CloudFormation happens before Lambda Deployments. Though you should probably control that with a script rather than just using the dashboard:
serverless resources deploy -s dev -r eu-west-1
serverless function deploy --a -s dev -r eu-west-1
serverless endpoint deploy --a -s dev -r eu-west-1
Hope that helps.
What are the steps of deployment you are following here from Serverless? For the first part of your ask, I believe you can do a 'sls resources deploy' to deploy all CF related resources, and then you do a 'sls function deploy' OR 'sls dash deploy' to deploy the lambda functions. So technically, resource deploy (CF) does not actually deploy lambda functions.
For the second part of your ask, if you have a use-case where you want to use the output of a CF resource being created, (as of now) this feature has been added/merged to v0.5 of Serverless which has not yet been released.

How to install sqoop in Amazon EMR?

I've created a cluster in Amazon EMR and using -emr-4.0.0. Hadoop distribution:Amazon 2.6.0 and Hive 1.0.0. Need to install Sqoop so that I can communicate between Hive and Redshift? What are the steps to install Sqoop in EMR cluster? Requesting to provide the steps. Thank You!
Note that in EMR 4.0.0 hadoop fs -copyToLocal will throw errors.
Use aws s3 cp instead.
To be more specific than Amal:
Download the latest version of SQOOP and upload it to an S3 location. I am using sqoop-1.4.4.bin__hadoop-2.0.4-alpha and it seems to work just fine with EMR 4.0.0
Download the JAR connector for Redshift and upload it to same S3 location. This page might help.
Upload a script similar to the one below to S3
#!/bin/bash
# Install sqoop and mysql connector. Store in s3 and load
# as bootstrap step.
bucket_location='s3://your-sqoop-jars-location/'
sqoop_jar='sqoop-1.4.4.bin__hadoop-2.0.4-alpha'
sqoop_jar_gz=$sqoop_jar.tar.gz
redshift_jar='RedshiftJDBC41-1.1.7.1007.jar'
cd /home/hadoop
aws s3 cp $bucket_location$sqoop_jar_gz .
tar -xzf $sqoop_jar_gz
aws s3 cp $bucket_location$redshift_jar .
cp $redshift_jar $sqoop_jar/lib/
Set SQOOP_HOME and add SQOOP_HOME to the PATH to be able to call sqoop from anywhere. These entries should be made in /etc/bashrc. Otherwise you will have to use the full path, in this case: /home/hadoop/sqoop-1.4.4.bin__hadoop-2.0.4-alpha/bin/sqoop
I am using Java to programatically launch my EMR cluster. To configure bootstrap steps in Java I create a BootstrapActionConfigFactory:
public final class BootstrapActionConfigFactory {
private static final String bucket = Config.getBootstrapBucket();
// make class non-instantiable
private BootstrapActionConfigFactory() {
}
/**
* Adds an install Sqoop step to the job that corresponds to the version set in the Config class.
*/
public static BootstrapActionConfig newInstallSqoopBootstrapActionConfig() {
return newInstallSqoopBootstrapActionConfig(Config.getHadoopVersion().charAt(0));
}
/**
* Adds an install Sqoop step to the job that corresponds to the version specified in the parameter
*
* #param hadoopVersion the main version number for Hadoop. E.g.: 1, 2
*/
public static BootstrapActionConfig newInstallSqoopBootstrapActionConfig(char hadoopVersion) {
return new BootstrapActionConfig().withName("Install Sqoop")
.withScriptBootstrapAction(
new ScriptBootstrapActionConfig().withPath("s3://" + bucket + "/sqoop-tools/hadoop" + hadoopVersion + "/bootstrap-sqoop-emr4.sh"));
}
}
Then when creating the job:
Job job = new Job(Region.getRegion(Regions.US_EAST_1));
job.addBootstrapAction(BootstrapActionConfigFactory.newInstallSqoopBootstrapActionConfig());
Download the tarball of sqoop and keep it in an s3 bucket. Create a bootstrap script that performs the following activity
Download the sqoop tarball to the required instances
extract the tarball
set SQOOP_HOME and add SQOOP_HOME to the PATH. These entries should be made in /etc/bashrc
Add the required connector jars to the lib of SQOOP.
Keep this script in S3 and point this script in the bootstrap actions.
Note that from Emr-4.4.0 AWS added support for Sqoop 1.4.6 to the EMR cluster. Installation is done with couple clicks on setup. No need for manual installation.
References:
https://aws.amazon.com/blogs/aws/amazon-emr-4-4-0-sqoop-hcatalog-java-8-and-more/
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-sqoop.html