How to create a bootstrap action for Impala on EMR - amazon-emr

The latest version of Impala that I can find an EMR bootstrap action This one this is from 2015 and installs Impala 2.2.0
Is there an easy way to update this to 2.7 or 2.8? Spinning up an Ubuntu 14.04 box to do a build is one option, but I'm unclear how to ultimately install it on an EMR cluster.

As this moment, there is no documented script to easily update impala to work on EMR. Its a sys admin thing and yes, you might look at the install script and tweak it to include your own build. Also, make sure you have all dependencies.
Once you install it, you will need to ensure you have relevant configuration files placed somewhere in the CLASSPATH established by set-classpath.sh for Impala's use of EMR's HDFS, HBase , Hive metastore or S3 .

Related

Installing spark-sql-kinesis_2.11-2.2.0.jar in my Spark environment

I am using an EMR cluster to develop/run Spark jobs. Now I need to access Kinesis and need to install spark-sql-kinesis_2.11-2.2.0.jar. I am not very clear on how to do that yet. Any pointer/experience will be very helpful
I ended up building it myself from the source code (github) and then specified it in my spark-submit code. That seems to be recommended.

/home/hadoop/bin/hadoop missing in ami 4.x

I am trying to migrate a legacy mapreduce pipeline that is using ami 3.x to ami 4.x. It currently has bash scripts as part of the bootstrapping and one of them calls hadoop fs-get s3n://somefile ~/otherfile. This fails in my current migration attempt to ami 4.x. And adding ls /home/hadoop/bin the script shows that the directory /home/hadoop/bin does not exist so of course the binary /home/hadoop/bin/hadoop would not exist. Is there something I need to configure to ensure the hadoop binary exists? I can't seem to find anything obvious in the documentation.
The file system layout changed considerably between 3.x and 4.x. The differences between 3.x and 4.x and instructions for migrating can be found here: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-4.1.0/emr-release-differences.html
The short answer for solving your issue though is that you should use "aws s3 cp" instead of "hadoop fs-get" in bootstrap actions, since Hadoop is not installed until after bootstrap actions run on 4.x+.

How to integrate hadoop with zookeeper and hbase

I have set up a single node cluster of Hadoop 2.6 but i need to integrate zookeeper and hbase with it.
I am a beginner with no prior experience in big data tools.
How do you set up zookeeper to coordinate hadoop cluster and how do we use hbase over hdfs?
How do they combine to make a ecosystem?
For standalone mode, just follow the steps provided in this HBase guide:http://hbase.apache.org/book.html#quickstart
HBase has a standalone mode that makes it easy for starters to get going. In standalone mode hbase,hdfs, and zk runs in single JVM process.
It depends on the kind of system that you want to build. As you said, the Hadoop ecosystem is made my three major components: HBase, HDFS, and zookeeper. Although they can be installed independently from each other, sometimes there is no need to install them all depending on the kind of cluster that you want to setup.
Since you are using a single node cluster, there are two HBase run modes that you can choose: the standalone mode and the pseudo-distributed mode. In the standalone mode there is no need to install HDFS or Zookeeper. HBase will do everything in a transparent way. If you want to use the pseudo-distributed mode, you can run HBase against the local filesystem or against HDFS. If you want to use HDFS, you'll have to install Hadoop. About the zookeeper, again, the HBase will also do the job by itself (you just need to tell him that through the configuration files).
In case you want to use HDFS in the pseudo-distributed mode, downloading Hadoop you will get HDFS and MapReduce. If you don't want to execute MapReduce jobs, just ignore its tools.
If you want to learn more, I think that this guide explains it all very well: https://hbase.apache.org/book.html (check the HBase run modes).

How do I start an Amazon EC2 VM from a saved AMI using Jenkins?

I'm trying to create a Jenkins job to spin up a VM on Amazon EC2 based on an AMI that I currently have saved. I've done my searching and can't find an easy way to do this other than through Amazon's GUI. This isn't very ideal as there are a lot of manual steps involved and it's time-consuming.
If anyone's had any luck doing this or could point me in the right direction that would be great.
Cheers,
Darwin
Unless I'm misunderstanding the question this should be possible using the cli, assuming you can install and configure the cli on your jenkins server you can just run the command as a shell script as part of the build.
Create an instance with CLI.
The command would be something along the lines of:
[path to cli]/aws ec2 run-instances --image-id ami-xyz
If your setup is too complicated for a single cli command, I would recommend creating a simple cloudformation template.
If you are unable to install the cli, you could use any number of sdk's e.g. java to make a simple application you could run with jenkins.
There is the Jenkins EC2 Plugin
Looking at the document it looks like you may be able to reuse your AMI. If not, you can configure it with an init script
Next, configure AMIs that you want to launch. For this, you need to
find the AMI IDs for the OS of your choice. ElasticFox is a good tool
for doing that, but there are a number of other ways to do it. Jenkins
can work with any Unix AMIs. If using an Ubuntu EC2 or UEC AMI you
need to fill out the rootCommandPrefix and remoteAdmin fields under
'advanced'. Windows is currently unsupported.

How to find hadoop-exmaples.jar in version 2.4?

I just set up the hadoop environment on my Mac and wanted to try to test whether it was installed properly.
And
namenode -format
works fine, however, nearly all the online tutorials used "hadoop-examples.jar" which was in libexec.
My hadoop is the newest release, 2.4 and there's no such jar in libexec or any other folder. Do they remove it or this is not used for testing the environment anymore?
In Hadoop-2.4.0 there is different examples available for each application,
For MapReduce rleated jar files,
cd hadoop/mapreduce
For HDFS,
cd hadoop/hdfs
"hadoop-examples.jar" not available in latest version; for that they add more jar files for different applications.
You can use if from Hadoop.