Running Sqoop job on YARN using Oozie - hadoop-yarn

I've got a problem with running Sqoop job on YARN in Oozie using Hue. I want to download table from Oracle database and upload that table to HDFS. I've got multinode cluster consists of 4 nodes.
I want to run simple Sqoop statement:
import --options-file /tmp/oracle_dos.txt --table BD.BD_TABLE --target-dir /user/user1/files/user_temp_20160930_30 --m 1
Options file is located on local system on node number 1. Other nodes have no options file in /tmp/ dir. I created Oozie workflow with Sqoop job and tried to run it, but I got error:
3432 [main] ERROR org.apache.sqoop.Sqoop - Error while expanding arguments
java.lang.Exception: Unable to read options file: /tmp/oracle_dos.txt
The weirdest thing is that the job is sometimes ok, but sometimes fails. The log file gave me answer why - Oozie runs Sqoop jobs on YARN.
Resource Manager (which is component of YARN) decides which node will execute Sqoop job. When Resource Manager decided that Node 1 (which has options file on local file system) should execute job, everything is ok. But when RM decided that one of other 3 nodes should execute Sqoop job, it failed.
This is big problem for me, because I don't want to upload options file on every node (because what if I will have 1000 nodes?). So my question is - is there any way to tell Resource Manager which node it should use?

You can make a custom file available for you oozie action on a node, it can be done by using <file> tag in your sqoop action, look at this syntax:
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
...
<action name="[NODE-NAME]">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>[JOB-TRACKER]</job-tracker>
<name-node>[NAME-NODE]</name-node>
<prepare>
<delete path="[PATH]"/>
...
<mkdir path="[PATH]"/>
...
</prepare>
<configuration>
<property>
<name>[PROPERTY-NAME]</name>
<value>[PROPERTY-VALUE]</value>
</property>
...
</configuration>
<command>[SQOOP-COMMAND]</command>
<arg>[SQOOP-ARGUMENT]</arg>
...
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
...
</sqoop>
<ok to="[NODE-NAME]"/>
<error to="[NODE-NAME]"/>
</action>
...
</workflow-app>
Also read this:
The file , archive elements make available, to map-reduce jobs, files
and archives. If the specified path is relative, it is assumed the
file or archiver are within the application directory, in the
corresponding sub-path. If the path is absolute, the file or archive
it is expected in the given absolute path.
Files specified with the file element, will be symbolic links in the
home directory of the task.
...
So in simplest case you put your file oracle_dos.txt in your workflow directory, add element oracle_dos.txt in workflow.xml and change you command to something like this:
import --options-file ./oracle_dos.txt --table BD.BD_TABLE --target-dir /user/user1/files/user_temp_20160930_30 --m 1
In this case nevertheless your sqoop action is running on some randomly picked node in a cluster, oozie will copy oracle_dos.txt to this node and you can refer to it as to local file.

Perhaps this is about file permissions. Try to put this file in /home/{user}.

Related

Not able to start hiveserver2 for Apache Hive

Could any one help to resolve below problem, I'm trying to start hserver2 and I configured hive_site.xml and configuration file for Hadoop Directory path as well and jar file hive-service-rpc-2.1.1.jar also available at directory lib. And I am able to start using hive but not hiveserver2
$ hive --service hiveserver2 Exception in thread "main" java.lang.ClassNotFoundException: /home/directory/Hadoop/Hive/apache-hive-2/1/1-bin/lib/hive-service-rpc-2/1/1/jar
export HIVE_HOME=/usr/local/hive-1.2.1/
export HIVE_HOME=/usr/local/hive-2.1.1
I am glad that I solve it's problem. Here is my question ,I have different version hive ,and My command use 1.2.1, but it find it's jar form 2.1.1.
you can user command which hive server 2 ,find where is you command from .

how to configure job.properties for oozie workflow

Firt of all i am new to this oozie, i don't know how to do in practical. i am triying to run the default examples file with hive actions in oozie...
This is my Job.properties file
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
examplesRoot=exampless
oozie.libpath=/user/${user.name}/${examplesRoot}/apps/examples-lib
oozie.wf.application.path=/practical/examples/apps/hive
note: /practical/examples/apps/hive its a path of hdfs containing workflow.xml file
I am getting this error:
Error: E0504 : E0504: Workflow app directory
[/practical/examples/apps/hive] does not exist (but im having
workflow.xml file in that path)
I think you should set
oozie.wf.application.path to
hdfs://hostname:8020/practical/examples/apps/hive
instead of /practical/examples/apps/hive as it is not an hdfs location and
it will start searching in local file system not hdfs.
Example:
nameNode=hdfs://quickstart.cloudera:8020
oozie.wf.application.path =
${nameNode}/user/${user.name}/practical/examples/apps/hive

Apache oozie sharedlib is showing a blank list

Relatively new to Apache OOZIE and did an installation on Ubuntu 14.04, Hadoop 2.6.0, JDK 1.8. I was able to install oozie and the web console is visible at the 11000 port of my server.
Now while i copied the examples bundled with oozie and tried to run them i am running into an error which says no sharedlib exists.
Installed the sharedlib as below-
bin/oozie-setup.sh sharelib create -fs hdfs://localhost:54310
(my namenode is running on localhost 54310 and JT on localhost 54311)
hadoop fs -ls /user/hduser/share/lib is showing shared library created as per the oozie-site.xml file. However when i check the shared library using the command -
oozie admin -oozie http://localhost:11000/oozie -shareliblist the list is blank and also jobs are failing for the same reason.
Any clues on how should i approach this problem?
Thanks.
The sharelib create command looks fine.
If you havent done so already copy the core-site.xml from your hadoop installation folder into $OOZIE_HOME/conf/hadoop-conf/.
There might already be a "placeholder" core-site.xml in the hadoop-conf folder, delete or rename that one. Oozie doesnt get its hadoop configuration directly from your hadoop install (like hive for example) but from the core-site.xml you place in that hadoop-conf folder.
Okay i got a solution for this.
So when i was trying to create the sharedlib directory it was doing on HDFS but while running the job local path was being refereed. So i extracted the oozie-sharedlib tar.gz file in my local /user/hduser/share/lib directory and its working now.
But did not get the reason so its still an open question.
I have encountered the same issue and it turned out that
oozie was not able to communicate with hdfs, as it was not able to find the location for core-site.xml or any other hadoop configuration which has to be declared inside oozie-site.xml.
Corresponding property in oozie-site.xml is oozie.service.HadoopAccessorService.hadoop.configurations
this property was defined wrongly in my case.
changed it to point to where my Hadoop configuration xmls are present and then it started communicating with hdfs and hence was able to locate the sharelib on hdfs

How can I reload oozie job configuration file without restart oozie job

I'd like to know if there is a way to reload the configuration file of the oozie job without restart the oozie job ( coordinator ).
Because the coordinator actually runs many our tasks, maybe sometimes we only need change one line of the job configuration file, then make the update , without disturbing other tasks.
Thank you very much.
The properties of oozie coordinator can be updated using below command once the coordinators start running. Update the property file in unix file system and then submit as below.
oozie job -oozie http://namenodeinfo/oozie -config job.properties -update coordinator_job_id
Note that all the created coordinator versions (including the ones in WAITING status) will still use old configuration. New configurations will be applied to new versions of coordinators when they materialize.
the latest oozie 4.1 allows to update coordinator definition. See https://oozie.apache.org/docs/4.1.0/DG_CommandLineTool.html#Updating_coordinator_definition_and_properties
Not really (well you could go into the database table and make the change but that might require a shutdown of OOZIE if your using an embedded Derby DB, and besides probably isn't advisable).
If you need to change the configuration often then consider pushing the value down into the launched workflow.xml file - you can change this file's contents between coordinator instantiations.
You could also (if this is a one time change) kill the running coordinator, make the change and start the coordinator up again amending the start time such that previous instances won't be scheduled to run again.
Not really :-)
Here is what you can do.
Create another config file with properties that you want to be able to change in hdfs.
Read this file in the beginning of your workflow.
Example:
<action name="devices-location">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>hadoop</exec>
<argument>fs</argument>
<argument>-cat</argument>
<argument>/path/to/config/file.properties</argument>
<capture-output/>
</shell>
<ok to="report"/>
<error to="kill"/>
</action>
<action name="report">
<java>
...
<main-class>com.twitter.scalding.Tool</main-class>
<arg>--device-graph</arg>
<arg>${wf:actionData('devices-location')['path']}</arg>
<file>${scalding_jar}</file>
</java>
<ok to="end"/>
<error to="kill"/>
</action>
Where the config file in hdfs at /path/to/config/file.properties looks like this:
path=/some/path/to/data

Hadoop DFS permission issue when running job

I'm getting this following permission error, and am not sure why hadoop is trying to write to this particular folder:
hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
Number of Maps = 2
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Starting Job
org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=myuser, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
Any idea why it is trying to write to the root of my hdfs?
Update: After temporarily setting hdfs root (/) to be 777 permissions, I seen that a "/tmp" folder is being written. I suppose one option is to just create a "/tmp" folder with open permissions for all to write to, but it would be nice from a security standpoint if this is instead written to the user folder (i.e. /user/myuser/tmp)
I was able to get this working with the following setting:
<configuration>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
#...
</configuration>
Restart of jobtracker service required as well (special thanks to Jeff on Hadoop mailing list for helping me track down problem!)
1) Create the {mapred.system.dir}/mapred directory in hdfs using the following command
sudo -u hdfs hadoop fs -mkdir /hadoop/mapred/
2) Give permission to mapred user
sudo -u hdfs hadoop fs -chown mapred:hadoop /hadoop/mapred/
You can also make a new user named "hdfs". Quite simple solution but not as clean probably.
Of course this is when you are using Hue with Cloudera Hadoop Manager (CDH3)
You need to set the permission for hadoop root directory (/) instead of setting the permission for the system's root directory. Even I was confused, but then realized that the directory mentioned was of hadoop's file system and not the system's.