How Can I Automate Running Pig Batch Jobs on Elastic MapReduce without Amazon GUI? - apache-pig

I have some pig batch jobs in .pig files I'd love to automatically run on EMR once every hour or so. I found a tutorial for doing that here, but that requires using Amazon's GUI for every job I setup, which I'd really rather avoid. Is there a good way to do this using Whirr? Or the Ruby Elastic-mapreduce client? I have all my files in s3, along with a couple pig jars with functions I need to use.

Though I don't know how to run pig scripts with the tools that you mention, I know of two possible ways:
To run files locally: you can use cron
To run files on the cluster: you can use OOZIE
That being said, most tools with a GUI, can be controlled via the command line as well. (Though setup may be easier if you have the GUI available).

Related

How to schedule a Penatho job on the Skybot scheduler?

I could not see any reference to the Pentaho on the Skybot documentation. Is there a way to schedule Pentaho transformations and jobs on the Skybot? I have tried creating agents and referring to the file path but nothing is working! Any pointers?
For executing or scheduling in Pentaho, you need to have Pentaho installed in your system. If you are using Linux system, first of all install Pentaho DI in your system. Once you have done that make use of the Skybot schedular. Point the Kitchen.sh or Pan.sh file of pentaho DI and the files you need to schedule/execute. You can take help of this link:
How to schedule Pentaho Kettle transformations?
If all is done you can execute a transformation. Skybot needs OS and Pentaho to execute/schedule a job. The same goes with Windows schedular or any other scheduling tool.
Hope it helps :)
This should be pretty simple. Just use the CLI tools to start your job. Kitchen I believe runs jobs. Pan will run transformations.
Here's the documentation for kitchen. It's very straight forward.
http://wiki.pentaho.com/display/EAI/Kitchen+User+Documentation

Using different virtualenv for jobs in apscheduler

I have an apscheduler implementation which is able to run different kinds of tasks. These tasks might have different dependencies which needs to be installed when they are executed. The best was is to create a virtualenv install these respective dependencies taken from a resource file for each task and then may be release it when the task is done.
I have been trying to implement this but haven't had much success. The idea is probably to have a custom executer which can start a subprocess connected to separate python interpreter in a respective virtualenv and run the task there and get some results back. Note: I have only process pools for running tasks.
Does anybody have any idea how to proceed with this or any code snippets?
Nobody has asked for this yet, so I'd say you need to implement the custom executor you mentioned.

How to integrate hadoop with zookeeper and hbase

I have set up a single node cluster of Hadoop 2.6 but i need to integrate zookeeper and hbase with it.
I am a beginner with no prior experience in big data tools.
How do you set up zookeeper to coordinate hadoop cluster and how do we use hbase over hdfs?
How do they combine to make a ecosystem?
For standalone mode, just follow the steps provided in this HBase guide:http://hbase.apache.org/book.html#quickstart
HBase has a standalone mode that makes it easy for starters to get going. In standalone mode hbase,hdfs, and zk runs in single JVM process.
It depends on the kind of system that you want to build. As you said, the Hadoop ecosystem is made my three major components: HBase, HDFS, and zookeeper. Although they can be installed independently from each other, sometimes there is no need to install them all depending on the kind of cluster that you want to setup.
Since you are using a single node cluster, there are two HBase run modes that you can choose: the standalone mode and the pseudo-distributed mode. In the standalone mode there is no need to install HDFS or Zookeeper. HBase will do everything in a transparent way. If you want to use the pseudo-distributed mode, you can run HBase against the local filesystem or against HDFS. If you want to use HDFS, you'll have to install Hadoop. About the zookeeper, again, the HBase will also do the job by itself (you just need to tell him that through the configuration files).
In case you want to use HDFS in the pseudo-distributed mode, downloading Hadoop you will get HDFS and MapReduce. If you don't want to execute MapReduce jobs, just ignore its tools.
If you want to learn more, I think that this guide explains it all very well: https://hbase.apache.org/book.html (check the HBase run modes).

How do I start an Amazon EC2 VM from a saved AMI using Jenkins?

I'm trying to create a Jenkins job to spin up a VM on Amazon EC2 based on an AMI that I currently have saved. I've done my searching and can't find an easy way to do this other than through Amazon's GUI. This isn't very ideal as there are a lot of manual steps involved and it's time-consuming.
If anyone's had any luck doing this or could point me in the right direction that would be great.
Cheers,
Darwin
Unless I'm misunderstanding the question this should be possible using the cli, assuming you can install and configure the cli on your jenkins server you can just run the command as a shell script as part of the build.
Create an instance with CLI.
The command would be something along the lines of:
[path to cli]/aws ec2 run-instances --image-id ami-xyz
If your setup is too complicated for a single cli command, I would recommend creating a simple cloudformation template.
If you are unable to install the cli, you could use any number of sdk's e.g. java to make a simple application you could run with jenkins.
There is the Jenkins EC2 Plugin
Looking at the document it looks like you may be able to reuse your AMI. If not, you can configure it with an init script
Next, configure AMIs that you want to launch. For this, you need to
find the AMI IDs for the OS of your choice. ElasticFox is a good tool
for doing that, but there are a number of other ways to do it. Jenkins
can work with any Unix AMIs. If using an Ubuntu EC2 or UEC AMI you
need to fill out the rootCommandPrefix and remoteAdmin fields under
'advanced'. Windows is currently unsupported.

Automatic Jenkins deployment

I want to be able to automate Jenkins server installation using a script.
I want, given Jenkins release version and a list of {(plugin,version)}, to run a script that will deploy me a new jenkins server and start it using Jetty or Tomcat.
It sounds like a common thing to do (in need to replicate Jenkins master enviroment or create a clean one). Do you know what's the best practice in this case?
Searching Google only gives me examples of how to deploy products with Jenkins but I want to actually deploy Jenkins.
Thanks!
this may require some additional setup at the beginning but perhaps could save you time in the long run. You could use a product called puppet (puppetlabs.com) to automatically trigger the script when you want. I'm basically using that to trigger build outs of my development environments. As I find new things that need to be modified, I simply update my puppet modules and don't need to worry about what needs to be done to recreate the environments through testing for the next go round.