how to configure job.properties for oozie workflow - hive

Firt of all i am new to this oozie, i don't know how to do in practical. i am triying to run the default examples file with hive actions in oozie...
This is my Job.properties file
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
examplesRoot=exampless
oozie.libpath=/user/${user.name}/${examplesRoot}/apps/examples-lib
oozie.wf.application.path=/practical/examples/apps/hive
note: /practical/examples/apps/hive its a path of hdfs containing workflow.xml file
I am getting this error:
Error: E0504 : E0504: Workflow app directory
[/practical/examples/apps/hive] does not exist (but im having
workflow.xml file in that path)

I think you should set
oozie.wf.application.path to
hdfs://hostname:8020/practical/examples/apps/hive
instead of /practical/examples/apps/hive as it is not an hdfs location and
it will start searching in local file system not hdfs.
Example:
nameNode=hdfs://quickstart.cloudera:8020
oozie.wf.application.path =
${nameNode}/user/${user.name}/practical/examples/apps/hive

Related

How to fix 'Not found: Files /bigstore/project/testing/filename.json' error when loading into Bigquery

I'm trying to load multiple json (4000) files into a table in Bigquery using the following command bq load --source_format=NEWLINE_DELIMITED_JSON --replace=true kx-test.store_requests gs://kx-gam-test/store/requests/*, and I am getting the following error:
Error encountered during job execution:
Not found: Files /bigstore/kx-gam-test/store/requests/7fb27d63-5581-43a1-821d-fcf47b3412fd.json.gz
Failure details:
- Not found: Files /bigstore/kx-gam-test/store/requests/93b54246-2284-4b85-8620-76657f4a338b.json.gz
- Not found: Files /bigstore/kx-gam-test/store/requests/fd24a53d-2c49-4f66-bf54-a7ccf14a1cfe.json.gz
- Not found: Files /bigstore/kx-gam-test/store/requests/35a27032-930c-456a-846d-67481a21e52d.json.gz
I am not sure why it is not working, is it possibly due to the number of files I am trying to load? And what is this bigstore folder prefixed in front of my GCS bucket?
I would like to highlight that the folder structure is such that there are some folders inside of kx-gam-test/store/requests, and I would want to load the json gzip files inside all these folders.
According to the documentation:
BigQuery does not support source URIs that include multiple consecutive slashes after the initial double slash.
Also, here is some additional info to consider when loading data to cloud storage.
Few things you can check:
Make sure that you have the necessary permissions
Make sure that the files do exist in GCS
Do you have any process which deletes the file after the loading? Check the audit logs for any traces whether the file might have been deleted while BQ is actually reading/loading it.

HIVE> FAILED: SemanticException Line 1:23 Invalid path

I tired to load the data into my table 'users' in LOCAL mode and i am using cloudera on my virtual box. I have a file placed my file inside /home/cloudera/Desktop/Hive/ directory but i am getting an error
FAILED: SemanticException Line 1:23 Invalid path ''/home/cloudera/Desktop/Hive/hive_input.txt'': No files matching path file:/home/cloudera/Desktop/Hive/hive_input.txt
My syntax to load data into table
Load DATA LOCAL INPATH '/home/cloudera/Desktop/Hive/hive_input.txt' INTO Table users
Yes I removed the Local as per #Bhaskar, and path is my HDFS path where file exists not underlying linux path.
Load DATA INPATH '/user/cloudera/input_project/' INTO Table users;
You should change permission on the folder that contains your file.
chmod -R 755 /home/user/
Another reason could be the file access issue. If you are running hive CLI from user01 and accessing a file (your INPATH) from user02 home directory, it will give you the same error.
So the solution could be
1. Move the file to a location where user01 can access the file.
OR
2. Relaunch the Hive CLI after logging in with user02.
Check if you are using a Sqoop import in your script, try to import data to hive from an empty table.
This may cause the scoop import to delete the HDFS location of the hive table.
to confirm run: hdfs dfs -ls before and after you execute the sqoop import, re-create the directory using hdfs dfs -mkdir
My path to the file in HDFS was data/file.csv, note, it is not /data/file.csv.
I specified the LOCATION during table creation as data/file.csv.
Executing
LOAD DATA INPATH '/data/file.csv' INTO TABLE example_table;
failed with the mentioned exception. However, executing
LOAD DATA INPATH 'data/file.csv' INTO TABLE example_table;
worked as desired.

Running Sqoop job on YARN using Oozie

I've got a problem with running Sqoop job on YARN in Oozie using Hue. I want to download table from Oracle database and upload that table to HDFS. I've got multinode cluster consists of 4 nodes.
I want to run simple Sqoop statement:
import --options-file /tmp/oracle_dos.txt --table BD.BD_TABLE --target-dir /user/user1/files/user_temp_20160930_30 --m 1
Options file is located on local system on node number 1. Other nodes have no options file in /tmp/ dir. I created Oozie workflow with Sqoop job and tried to run it, but I got error:
3432 [main] ERROR org.apache.sqoop.Sqoop - Error while expanding arguments
java.lang.Exception: Unable to read options file: /tmp/oracle_dos.txt
The weirdest thing is that the job is sometimes ok, but sometimes fails. The log file gave me answer why - Oozie runs Sqoop jobs on YARN.
Resource Manager (which is component of YARN) decides which node will execute Sqoop job. When Resource Manager decided that Node 1 (which has options file on local file system) should execute job, everything is ok. But when RM decided that one of other 3 nodes should execute Sqoop job, it failed.
This is big problem for me, because I don't want to upload options file on every node (because what if I will have 1000 nodes?). So my question is - is there any way to tell Resource Manager which node it should use?
You can make a custom file available for you oozie action on a node, it can be done by using <file> tag in your sqoop action, look at this syntax:
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
...
<action name="[NODE-NAME]">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>[JOB-TRACKER]</job-tracker>
<name-node>[NAME-NODE]</name-node>
<prepare>
<delete path="[PATH]"/>
...
<mkdir path="[PATH]"/>
...
</prepare>
<configuration>
<property>
<name>[PROPERTY-NAME]</name>
<value>[PROPERTY-VALUE]</value>
</property>
...
</configuration>
<command>[SQOOP-COMMAND]</command>
<arg>[SQOOP-ARGUMENT]</arg>
...
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
...
</sqoop>
<ok to="[NODE-NAME]"/>
<error to="[NODE-NAME]"/>
</action>
...
</workflow-app>
Also read this:
The file , archive elements make available, to map-reduce jobs, files
and archives. If the specified path is relative, it is assumed the
file or archiver are within the application directory, in the
corresponding sub-path. If the path is absolute, the file or archive
it is expected in the given absolute path.
Files specified with the file element, will be symbolic links in the
home directory of the task.
...
So in simplest case you put your file oracle_dos.txt in your workflow directory, add element oracle_dos.txt in workflow.xml and change you command to something like this:
import --options-file ./oracle_dos.txt --table BD.BD_TABLE --target-dir /user/user1/files/user_temp_20160930_30 --m 1
In this case nevertheless your sqoop action is running on some randomly picked node in a cluster, oozie will copy oracle_dos.txt to this node and you can refer to it as to local file.
Perhaps this is about file permissions. Try to put this file in /home/{user}.

What is the path for a bootstrapped file for a Pig job running in Amazon EMR

I bootstrap a data file in my EMR job. The bootstrapping succeeds and the file is copied to /home/hadoop/contents/ folder with right permissions.
However when I try to access it in the Pig script like below:
userdidstopick = load '/home/hadoop/contents/UserIdsToPick.txt' AS (uid:chararray);
I get an error that the input path does not exist:
hdfs://10.183.166.176:9000/home/hadoop/contents/UserIdsToPick.txt
When running Ruby jobs the bootstrapped file was always accessible under /home/hadoop/contents/ folder and everything worked for me.
Is it different for Pig?
By default Pig on EMR is configured to access HDFS location instead of local filesystem. The error shows the HDFS location.
There are 2 ways to solve this:
Either copy the file on S3, and directly load file from s3
userdidstopick = load 's3_bucket_location/UserIdsToPick.txt' AS (uid:chararray);
Or you can first copy the file on HDFS (instead of local filesystem), and then directly use it as path you are doing today.
I would prefer first option.

Apache oozie sharedlib is showing a blank list

Relatively new to Apache OOZIE and did an installation on Ubuntu 14.04, Hadoop 2.6.0, JDK 1.8. I was able to install oozie and the web console is visible at the 11000 port of my server.
Now while i copied the examples bundled with oozie and tried to run them i am running into an error which says no sharedlib exists.
Installed the sharedlib as below-
bin/oozie-setup.sh sharelib create -fs hdfs://localhost:54310
(my namenode is running on localhost 54310 and JT on localhost 54311)
hadoop fs -ls /user/hduser/share/lib is showing shared library created as per the oozie-site.xml file. However when i check the shared library using the command -
oozie admin -oozie http://localhost:11000/oozie -shareliblist the list is blank and also jobs are failing for the same reason.
Any clues on how should i approach this problem?
Thanks.
The sharelib create command looks fine.
If you havent done so already copy the core-site.xml from your hadoop installation folder into $OOZIE_HOME/conf/hadoop-conf/.
There might already be a "placeholder" core-site.xml in the hadoop-conf folder, delete or rename that one. Oozie doesnt get its hadoop configuration directly from your hadoop install (like hive for example) but from the core-site.xml you place in that hadoop-conf folder.
Okay i got a solution for this.
So when i was trying to create the sharedlib directory it was doing on HDFS but while running the job local path was being refereed. So i extracted the oozie-sharedlib tar.gz file in my local /user/hduser/share/lib directory and its working now.
But did not get the reason so its still an open question.
I have encountered the same issue and it turned out that
oozie was not able to communicate with hdfs, as it was not able to find the location for core-site.xml or any other hadoop configuration which has to be declared inside oozie-site.xml.
Corresponding property in oozie-site.xml is oozie.service.HadoopAccessorService.hadoop.configurations
this property was defined wrongly in my case.
changed it to point to where my Hadoop configuration xmls are present and then it started communicating with hdfs and hence was able to locate the sharelib on hdfs