Oozie coordinator configuration properties - apache-pig

Configuration properties specified in coordinator job are not seen in workflow's param tags.
Coordinator:
<action>
<workflow>
<app-path>${workflowRoot}/report_action.xml</app-path>
<configuration>
<property>
<name>OUTPUT_COORD</name>
<value>${workflowRoot}/2014_05_01</value>
</property>
</configuration>
</workflow>
</action>
Workflow:
<action name="pig-node">
<pig>
...
<param>OUTPUT=${OUTPUT_COORD}</param>
</pig>
<ok to="end"/>
<error to="fail"/>
</action>
What I get is 'EL_ERROR', variable OUTPUT_COORD cannot be resolved.
What could be the problem?

Syntax wise, variables are well defined in Coordinator and Workflow. As Mzf pointed out, it seems you are running workflow directly. Instead you need to run coordinator which will eventually run workflow (as defined) and pass the value of OUTPUT_COORD to Workflow from Coordinator as well.

Related

Oozie shell action failing

I am trying to test oozie shell action in my cloudera vm (quickstart vm). When running a simple hdfs command (hadoop fs -put ...) script its working but when I am triggering a hive script the oozie job is finished with status "KILLED". On oozie consol only error message I am getting is
"Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]"
While the underlying job in history server(name node logs) is coming as SUCCEEDED. Below are oozie job details :
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.5" name="WorkFlow1">
<start to="shell-node" />
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${myscript}</exec>
<file>${myscriptpath}#${myscript}</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}] </message>
</kill>
<end name="end" />
</workflow-app>
------------------------------------
job.properties
nameNode=hdfs://quickstart.cloudera:8020
jobTracker=hdfs://quickstart.cloudera:8032
queueName=default
myscript=test.sh
myscriptpath=${nameNode}/oozie/sl/test.sh
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/oozie/sl/
workflowAppUri=${nameNode}/oozie/sl/
-----------------------------------------------
test.sh
hive -e "create table test2 as select * from test"
Would really appreciate if anyone can point me in direction I am getting it wrong.
It would be good if you have a look into the Oozie Hive action.
Its pretty easy to configure. Hive action will take care of setting everything.
https://oozie.apache.org/docs/4.3.0/DG_HiveActionExtension.html
To connect hive , you need to explicitly add the hive-site.xml or the Hive server details for it to connect.

Several hive2 actions in oozie worflow receiving same timestamp

I built a workflow that has two hive2 actions and I am running it using Hue. I need to get the current time from the system when the workflow starts and pass it to both actions.
This is the structure of the workflow:
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.5" name="workflow.xml">
<global>
<job-tracker>host1:1234</job-tracker>
<name-node>hdfs://myhost:4312</name-node>
<configuration>
<property>
<name>execution_start</name>
<value>${timestamp()}</value>
</property>
</configuration>
</global>
<start to="script1" />
<action name="script1">
<hive2 xmlns="uri:oozie:hive2-action:0.2">
<jdbc-url>jdbc:hive2://myhost:10/default</jdbc-url>
<script>script1.hql</script>
<param>execution_start=${execution_start}</param>
</hive2>
<ok to="script2" />
<error to="fail" />
</action>
<action name="script2">
<hive2 xmlns="uri:oozie:hive2-action:0.2">
<jdbc-url>jdbc:hive2://myhost:10/default</jdbc-url>
<script>script2.hql</script>
<param>execution_start=${execution_start}</param>
</hive2>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Sub workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
</workflow-app>
I need to have the same timestamp in both hive actions. So far Hue asks to input the parameter with name execution_start.
I also tried: <param>execution_start=${wf:conf('execution_start')}>. I'm not prompted to input the parameter with this but I get a NULL value inside the script.
Notice that <param>execution_start=${timestamp()}> works, but it doesn't do the job for me as the timestamps would be different in each action.
You can invoke first an Oozie Shell action that just returns a timestamp, capture the output from this first action, and pass it to Hive2 actions using <param>execution_start=${wf:actionData('TimestampShell')}</param>

Submit pig job from oozie

I am working on automating Pig jobs using oozie in hadoop cluster.
I was able to run a sample pig script from oozie but my next requirement is to run a pig job where the pig script recieves it's input parameters from a shell script.
Please share your thoughts
UPDATE:
OK make the original question clear, how can you pass a parameter form a shell script output. Here's the working example:
WORKFLOW.XML
<workflow-app xmlns='uri:oozie:workflow:0.3' name='shell-wf'>
<start to='shell1' />
<action name='shell1'>
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>so.sh</exec>
<argument>A</argument>
<argument>B</argument>
<file>so.sh</file>
<capture-output/>
</shell>
<ok to="shell2" />
<error to="fail" />
</action>
<action name='shell2'>
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>so2.sh</exec>
<argument>${wf:actionData('shell1')['out']}</argument>
<file>so2.sh</file>
</shell>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
SO.SH
echo "out=test"
SO2.SH
echo "I'm so2.sh and I get the following param:"
echo $1
If you replace the 2nd shell action with your pig action and pass the param to the pig script like this:
...
<param>MY_PARAM=${wf:actionData('shell1')['out']}</param>
...
Than your original question is solved.
Regarding your sharelib issue, you have to be sure that in the properties you configured the LIB_PATH=where/you/jars/are and hand over this param to the pig action,
<param>LIB_PATH=${LIB_PATH}</param>
than just register the jars from there:
REGISTER '$LIB_PATH/my_jar'
==========================================================================
What you are looking for is the
Map wf:actionData(String node)
This function is only applicable to action nodes that produce output
data on completion.
The output data is in a Java Properties format and via this EL
function it is available as a Map .
Documentation
Here's a nice example: http://www.infoq.com/articles/oozieexample
(actually you have to capture the output as Samson wrote in the comments)
Some more details:
"If the capture-output element is present, it indicates Oozie to capture output of the STDOUT of the shell command execution. The Shell command output must be in Java Properties file format and it must not exceed 2KB. From within the workflow definition, the output of an Shell action node is accessible via the String action:output(String node, String key) function (Refer to section '4.2.6 Action EL Functions')."
Or you can use a not so nice but simple work-a-round and execute your shell script in the pig itself and save it's result in a variable, and using that. Like this:
%DEFINE MY_VAR `echo "/abc/cba'`
A = LOAD '$MY_VAR' ...
But this is not nice at all, the first solution is the suggested.

Falcon & Oozie - How to configure job.properties for oozie in falcon

I have a oozie workflow which calls a sqoop and hive action. This individual workflow works fine when I run oozie from command line.
Since the sqoop and hive scripts vary, I pass the values to the workflow.xml using job.properties file.
sudo oozie job -oozie http://hostname:port/oozie -config job.properties -run
Now I want to configure this oozie workflow in Falcon. Can you please help me in figuring out where can I configure or pass the job.properties?
Below is the falcon process.xml
<process name="demoProcess" xmlns="uri:falcon:process:0.1">
<tags>pipeline=degIngestDataPipeline,owner=hadoop, externalSystem=svServers</tags>
<clusters>
<cluster name="demoCluster">
<validity start="2015-01-30T00:00Z" end="2016-02-28T00:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<outputs>
<output name="output" feed="demoFeed" instance="now(0,0)" />
</outputs>
<workflow name="dev-wf" version="0.2.07"
engine="oozie" path="/apps/demo/workflow.xml" />
<retry policy="periodic" delay="minutes(15)" attempts="3" />
</process>
I could not find much help on the web or the falcon documentation regarding this.
I worked on some development in falcon but did not try falcon vanilla a lot, but from what I understand from the tutorial below:
http://hortonworks.com/hadoop-tutorial/defining-processing-data-end-end-data-pipeline-apache-falcon/
I would try creating the oozie-workflow.xml which accepts the job.properties dynamically. Place the properties file in the respective HDFS folder where workflow.xml picks it from and you can change it for every process. Now you can use your falcon process.xml and call it from command line using:
falcon entity -type process -submit -file process.xml
Also in path=/apps/demo/workflow.xml you need not specify the workflow.xml explicitly. You can just give the folder name, example:
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
<tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup,externalSystem=USWestEmailServers</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<outputs>
<output name="output" feed="rawEmailFeed" instance="now(0,0)" />
</outputs>
<workflow name="emailIngestWorkflow" version="2.0.0"
engine="oozie" path="/user/ambari-qa/falcon/demo/apps/ingest/fs" />
<retry policy="periodic" delay="minutes(15)" attempts="3" />
On a second thought I felt like you can create a oozie with shell action to call sqoop_hive.sh which has the following line of code in it:
sudo oozie job -oozie http://hostname:port/oozie -config job.properties -run.
Workflow.xml looks like:
<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>sqoop_hive.sh</exec>
<argument>${feedInstancePaths}</argument>
<file>${wf:appPath()}/sqoop_hive.sh#sqoop_hive.sh</file>
<!-- <file>/tmp/ingest.sh#ingest.sh</file> -->
<!-- <capture-output/> -->
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
Call this using falcon process call like:
falcon entity -type process -submit -file process.xml. job.properties can be changed locally if you create a shell action in oozie which calls the oozie in command line within the shell script.

create a directory in HDFS using ssh action in Oozie

I have to create a directory in HDFS using ssh action in Oozie.
My sample workflow is
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<action name="testjob">
<ssh>
<host>name#host<host>
<command>mkdir</command>
<args>hdfs://host/user/xyz/</args>
</ssh>
<ok to="end"/>
<error to="fail"/>
</action>
</workflow-app>
I am getting error during execution.
Can anybody please guide me what point i am missing here?
You cannot make a directory in the hdfs using the *nix mkdir command. The usage that you have shown in the code will try to execute mkdir command on the local file system, whereas you want to create a directory in HDFS.
Quoting the oozie documentation # http://oozie.apache.org/docs/3.3.0/DG_SshActionExtension.html ; it states
The shell command is executed in the home directory of the specified user in the remote host.
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<action name="testjob">
<ssh>
<host>name#host<host>
<command>/usr/bin/hadoop</command>
<args>dfs</args>
<args>-mkdir</args>
<args>NAME_OF_THE_DIRECTORY_YOU_WANT_TO_CREATE</arg>
</ssh>
<ok to="end"/>
<error to="fail"/>
</action>
The above code depends upon the path to your hadoop binary.