create a directory in HDFS using ssh action in Oozie - ssh

I have to create a directory in HDFS using ssh action in Oozie.
My sample workflow is
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<action name="testjob">
<ssh>
<host>name#host<host>
<command>mkdir</command>
<args>hdfs://host/user/xyz/</args>
</ssh>
<ok to="end"/>
<error to="fail"/>
</action>
</workflow-app>
I am getting error during execution.
Can anybody please guide me what point i am missing here?

You cannot make a directory in the hdfs using the *nix mkdir command. The usage that you have shown in the code will try to execute mkdir command on the local file system, whereas you want to create a directory in HDFS.
Quoting the oozie documentation # http://oozie.apache.org/docs/3.3.0/DG_SshActionExtension.html ; it states
The shell command is executed in the home directory of the specified user in the remote host.
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<action name="testjob">
<ssh>
<host>name#host<host>
<command>/usr/bin/hadoop</command>
<args>dfs</args>
<args>-mkdir</args>
<args>NAME_OF_THE_DIRECTORY_YOU_WANT_TO_CREATE</arg>
</ssh>
<ok to="end"/>
<error to="fail"/>
</action>
The above code depends upon the path to your hadoop binary.

Related

Failing Oozie Launcher, Cannot Run Program

I am trying to run an oozie workflow with a very basic script. The script itself is very simple, it takes a csv file and loads it into a table in impala. My workflow looks like this.
<workflow-app xmlns='uri:oozie:workflow:0.5' name='TEST'>
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapreduce.job.queuename</name>
<value>${queueName}</value>
</property>
</configuration>
</global>
<start to='postLoad' />
<action name="postLoad">
<shell xmlns="uri:oozie:shell-action:0.3">
<exec>${nameNode}/${baseCodePath}/Util/Test.sh</exec>
<env-var>impalaConn=${impalaConn}</env-var>
<env-var>xferUser=${xferUser}</env-var>
<env-var>ingUser=${ingUser}</env-var>
<env-var>explTableName=${explTableName}</env-var>
<env-var>stagingPath=${stagingPath}</env-var>
<file>${nameNode}/${baseCodePath}/Util/Test.sh</file>
<file>${nameNode}/${commonCodePath}/Util/loadUsrEnv.sh</file>
</shell>
...
However when I run it, I always seem to get this error and I'm not sure why it cannot run the program. The directories/files are all pointed to the right places.
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], main() threw exception, Cannot run program "Test.sh" (in directory "/data13/yarn/nm/usercache/user/appcache/application"): error=2, No such file or directory
java.io.IOException: Cannot run program "Test.sh" (in directory "/data13/yarn/nm/usercache/user/appcache/application"): error=2, No such file or directory
Use
<exec>Test.sh</exec>
The <file> tag tells Oozie to download the file from a HDFS location to the YARN container directory

Oozie shell action failing

I am trying to test oozie shell action in my cloudera vm (quickstart vm). When running a simple hdfs command (hadoop fs -put ...) script its working but when I am triggering a hive script the oozie job is finished with status "KILLED". On oozie consol only error message I am getting is
"Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]"
While the underlying job in history server(name node logs) is coming as SUCCEEDED. Below are oozie job details :
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.5" name="WorkFlow1">
<start to="shell-node" />
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${myscript}</exec>
<file>${myscriptpath}#${myscript}</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}] </message>
</kill>
<end name="end" />
</workflow-app>
------------------------------------
job.properties
nameNode=hdfs://quickstart.cloudera:8020
jobTracker=hdfs://quickstart.cloudera:8032
queueName=default
myscript=test.sh
myscriptpath=${nameNode}/oozie/sl/test.sh
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/oozie/sl/
workflowAppUri=${nameNode}/oozie/sl/
-----------------------------------------------
test.sh
hive -e "create table test2 as select * from test"
Would really appreciate if anyone can point me in direction I am getting it wrong.
It would be good if you have a look into the Oozie Hive action.
Its pretty easy to configure. Hive action will take care of setting everything.
https://oozie.apache.org/docs/4.3.0/DG_HiveActionExtension.html
To connect hive , you need to explicitly add the hive-site.xml or the Hive server details for it to connect.

Submit pig job from oozie

I am working on automating Pig jobs using oozie in hadoop cluster.
I was able to run a sample pig script from oozie but my next requirement is to run a pig job where the pig script recieves it's input parameters from a shell script.
Please share your thoughts
UPDATE:
OK make the original question clear, how can you pass a parameter form a shell script output. Here's the working example:
WORKFLOW.XML
<workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'>
<start to='shell1' />
<action name='shell1'>
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>so.sh</exec>
<argument>A</argument>
<argument>B</argument>
<file>so.sh</file>
<capture-output/>
</shell>
<ok to="shell2" />
<error to="fail" />
</action>
<action name='shell2'>
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>so2.sh</exec>
<argument>${wf:actionData('shell1')['out']}</argument>
<file>so2.sh</file>
</shell>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
SO.SH
echo "out=test"
SO2.SH
echo "I'm so2.sh and I get the following param:"
echo $1
If you replace the 2nd shell action with your pig action and pass the param to the pig script like this:
...
<param>MY_PARAM=${wf:actionData('shell1')['out']}</param>
...
Than your original question is solved.
Regarding your sharelib issue, you have to be sure that in the properties you configured the LIB_PATH=where/you/jars/are and hand over this param to the pig action,
<param>LIB_PATH=${LIB_PATH}</param>
than just register the jars from there:
REGISTER '$LIB_PATH/my_jar'
==========================================================================
What you are looking for is the
Map wf:actionData(String node)
This function is only applicable to action nodes that produce output
data on completion.
The output data is in a Java Properties format and via this EL
function it is available as a Map .
Documentation
Here's a nice example: http://www.infoq.com/articles/oozieexample
(actually you have to capture the output as Samson wrote in the comments)
Some more details:
"If the capture-output element is present, it indicates Oozie to capture output of the STDOUT of the shell command execution. The Shell command output must be in Java Properties file format and it must not exceed 2KB. From within the workflow definition, the output of an Shell action node is accessible via the String action:output(String node, String key) function (Refer to section '4.2.6 Action EL Functions')."
Or you can use a not so nice but simple work-a-round and execute your shell script in the pig itself and save it's result in a variable, and using that. Like this:
%DEFINE MY_VAR `echo "/abc/cba'`
A = LOAD '$MY_VAR' ...
But this is not nice at all, the first solution is the suggested.

Saving hive output through oozie using ">"

Is something like this possible in oozie?
hive -f hiveScript.hql > output.txt
I have the following oozie hive action for the above code as follows:
<hive xmlns="uri:oozie:hive-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>hiveScript.hql</script>
</hive>
<ok to="end" />
<error to="kill" />
</hive>
How can I tell the script where the output should go?
That is not possible with Oozie in the way that you want. This is because Oozie starts (most) of it's workflow actions on nodes within the cluster.
With this you could run the Oozie Shell action to run hive -f hiveScript.hql > output.txt... however this has different implications of requiring Hive to be installed everywhere, your hiveScript.hql to be everywhere, etc. Another way this doesn't quite work is your output file would be on whichever node was assigned to run this shell action. https://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html
I think you best bet would be to include INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM ... in your hiveScript.hql file and pulling the results down from HDFS afterwards.
Edit:
Another option I just thought of would be to use the SSH Action. https://oozie.apache.org/docs/3.2.0-incubating/DG_SshActionExtension.html You could potentially have the SSH Action shell to your target machine and run hive -f hiveScript.hql > output.txt.

Falcon & Oozie - How to configure job.properties for oozie in falcon

I have a oozie workflow which calls a sqoop and hive action. This individual workflow works fine when I run oozie from command line.
Since the sqoop and hive scripts vary, I pass the values to the workflow.xml using job.properties file.
sudo oozie job -oozie http://hostname:port/oozie -config job.properties -run
Now I want to configure this oozie workflow in Falcon. Can you please help me in figuring out where can I configure or pass the job.properties?
Below is the falcon process.xml
<process name="demoProcess" xmlns="uri:falcon:process:0.1">
<tags>pipeline=degIngestDataPipeline,owner=hadoop, externalSystem=svServers</tags>
<clusters>
<cluster name="demoCluster">
<validity start="2015-01-30T00:00Z" end="2016-02-28T00:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<outputs>
<output name="output" feed="demoFeed" instance="now(0,0)" />
</outputs>
<workflow name="dev-wf" version="0.2.07"
engine="oozie" path="/apps/demo/workflow.xml" />
<retry policy="periodic" delay="minutes(15)" attempts="3" />
</process>
I could not find much help on the web or the falcon documentation regarding this.
I worked on some development in falcon but did not try falcon vanilla a lot, but from what I understand from the tutorial below:
http://hortonworks.com/hadoop-tutorial/defining-processing-data-end-end-data-pipeline-apache-falcon/
I would try creating the oozie-workflow.xml which accepts the job.properties dynamically. Place the properties file in the respective HDFS folder where workflow.xml picks it from and you can change it for every process. Now you can use your falcon process.xml and call it from command line using:
falcon entity -type process -submit -file process.xml
Also in path=/apps/demo/workflow.xml you need not specify the workflow.xml explicitly. You can just give the folder name, example:
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
<tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup,externalSystem=USWestEmailServers</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<outputs>
<output name="output" feed="rawEmailFeed" instance="now(0,0)" />
</outputs>
<workflow name="emailIngestWorkflow" version="2.0.0"
engine="oozie" path="/user/ambari-qa/falcon/demo/apps/ingest/fs" />
<retry policy="periodic" delay="minutes(15)" attempts="3" />
On a second thought I felt like you can create a oozie with shell action to call sqoop_hive.sh which has the following line of code in it:
sudo oozie job -oozie http://hostname:port/oozie -config job.properties -run.
Workflow.xml looks like:
<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>sqoop_hive.sh</exec>
<argument>${feedInstancePaths}</argument>
<file>${wf:appPath()}/sqoop_hive.sh#sqoop_hive.sh</file>
<!-- <file>/tmp/ingest.sh#ingest.sh</file> -->
<!-- <capture-output/> -->
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
Call this using falcon process call like:
falcon entity -type process -submit -file process.xml. job.properties can be changed locally if you create a shell action in oozie which calls the oozie in command line within the shell script.