Oozie - PigStorage - Scheme not present in uri

Oozie - PigStorage - Scheme not present in uri - apache-pig

When i run my pig script locally (pig -file script.pig -param INPUT=val -param OUTPUT=val) everything is working fine. But when I schedule my pig script using Oozie (Coordinator/Workflow) then script fails. I dont understand why...
Can someone help me out?
Pig script
alarms = LOAD '$INPUT' USING PigStorage('|', '-noschema') AS (
row_num:long,
timestamp:chararray,
protocol_name:chararray,
source_ip:chararray,
destination_ip:chararray,
source_port:int,
destination_port:int
);
alarms_projection = FOREACH alarms {
GENERATE
SUBSTRING(timestamp, 0, 10) as alarm_date:chararray,
SUBSTRING(timestamp, 11, 19) as alarm_time:chararray,
protocol_name,
source_ip,
destination_ip,
source_port,
destination_port;
}
STORE alarms_projection INTO '$OUTPUT' USING PigStorage('|');
ERROR
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.PigMain], exception invoking main(), Scheme not present in uri /etl/av/complete/alarms
org.apache.oozie.action.hadoop.LauncherException: Scheme not present in uri /etl/av/complete/alarms
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:177)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.oozie.action.hadoop.LauncherException: Scheme not present in uri /etl/av/complete/alarms
at org.apache.oozie.action.hadoop.LauncherURIHandlerFactory.getURIHandler(LauncherURIHandlerFactory.java:41)
at org.apache.oozie.action.hadoop.PrepareActionsDriver.doOperations(PrepareActionsDriver.java:65)
at org.apache.oozie.action.hadoop.LauncherMapper.executePrepare(LauncherMapper.java:444)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:173)
... 8 more
Oozie Launcher failed, finishing Hadoop job gracefully

It was a configuration error in the workflow.xml. I was using a prepare statement to empty the output directory. But instead of setting the path hdfs://node:port/path/to/the/file I used /path/to/the/files.
The right way of using prepare
<prepare>
<delete path="hdfs://node:8020/path/to/files"/>
</prepare>

Related

Pentaho commandline nullpointer exception

Pentaho PDI version 8.3.0 CE if it matters
When I try to run a job or transformation commandline using kitchen or pan respectively I get a nullpointer exception. This happens only when trying to run something from a repository.
When I try to run the same transformation or job from spoon, all is fine and the job runs great.
I use the following commands, which both provide the same error:
./pan.sh -trans=get_clusters -rep=myrepo -user=admin -pass=mypass -dir=/Transformations
and
./kitchen.sh -job=scheduled_update_job -rep=myrepo -user=admin -pass=mypass -dir=/Jobs
NOTE: This error also happens when I try to run the job or transformation from a docker container.
The error I receive is as follows and identical for PAN and Kitchen:
020/02/05 09:07:56 - Pan - Start of run.
Processing has stopped because of an error: null
java.lang.NullPointerException
at org.pentaho.di.core.plugins.PluginRegistry.getPluginId(PluginRegistry.java:689)
at org.pentaho.di.core.plugins.PluginRegistry.getPlugin(PluginRegistry.java:715)
at org.pentaho.di.core.plugins.PluginRegistry.loadClass(PluginRegistry.java:370)
at org.pentaho.di.base.AbstractBaseCommandExecutor.establishRepositoryConnection(AbstractBaseCommandExecutor.java:195)
at org.pentaho.di.pan.PanCommandExecutor.execute(PanCommandExecutor.java:119)
at org.pentaho.di.pan.Pan.main(Pan.java:270)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.pentaho.commons.launcher.Launcher.main(Launcher.java:92)
Any help would be appreciated.

Run the job from your home directory (as working directory) using the full path of pan.sh or kitchen.sh.
I'm not sure what exactly causes the trouble. Likely causes:
Your KETTLE_HOME is not valid, causing Pentaho to look for .kettle in the working directory. (Do not include .kettle in the HOME)
A variant of this is that you don't have permissions on the files if you copied/moved them as root.
Your user does not have write access to the data-integration directory, causing some failure writing a configuration that would normally go into the working dir. It is normal to run Pentaho with an account that does not have write access here, that is not the problem, just that it doesn't like a non-writable working dir.

Apache Hive query on Tez FileNotFoundException

I'm receiving this exception when executing a Hive query on Tez with Hive 2.3.6 and Tez 0.9.2
I know Tez is configured correctly because I can manually run map-reduce jobs via Hadoop.
Dag submit failed due to java.io.FileNotFoundException: Path is not a file: /tmp/hive/root/_tez_session_dir/f4f4b17c-0657-41fa-8674-df83fa3ad362/lib
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:62)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:150)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1829)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:709)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:381)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

This error is seen on Hive 2.2+ or Hive 2.3.x+ when
hive.aux.jars.path in hive-site.xml is configured with an invalid path.
or
HIVE_AUX_JARS_PATH environment variable is configured improperly (usually in hive-env.sh)

Apache Pig : Job in state DEFINE instead of RUNNING

I am using Apache Pig. I am trying to load a comma separated file as a Pig table. It does not throw any error while loading the file.
But when I try to print that table using "dump" command, it gives error.
File I loaded
Error,fdgdf
Error,dfgdf
Error,dfgdf
Info,dfgdf
Info,dfgdf
Info,dfgdf
Info,dfgdf
Info,dfgdf
Info,dfgdf
Debug,dfgdf
Debug,dfgdf
Debug,dfgdf
Debug,dfgdf
Debug,dfgdf
Debug,dfgdf
Command to load
logFile1 = LOAD 'PigTestFile' using PigStorage();
Command to print table
dump logFile1
Error I get
led Jobs:
JobId Alias Feature Message Outputs
job_1454617624671_0152 logFile1 MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs:
//ip-172-31-53-48.ec2.internal:8020/user/e1681fe26eed362777aabca1682510/PigTestFile
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-53-48.ec2.internal:8020/user/e1681fe26eed362777aabca1682510/PigTestFile
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:265)
... 18 more
hdfs://ip-172-31-53-48.ec2.internal:8020/tmp/temp1258481141/tmp-1928081547,
:
:
2016-02-07 06:31:20,100 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2016-02-07 06:31:20,107 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias logFile1. Backend error : java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
[EDIT]
When I closely read the log I found that it is not able to find the file which was used to load the table. It is expecting it to be in HDFS. Where as my file was on local box.
I then moved the file into HDFS and then ran same commands. It worked well.
But then why did it not give error while executing "Load" command itself ??

As explained by Murali in his answer (which I have accepted) Map/ Reduce jobs for a script will get triggered only when STORE/ DUMP is encountered.
Here is more explanation about it from Apache Pig documentation
In general, Pig processes Pig Latin statements as follows:
First, Pig validates the syntax and semantics of all statements.
Next, if Pig encounters a DUMP or STORE, Pig will execute the statements.
In this example Pig will validate, but not execute, the LOAD and FOREACH statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
In this example, Pig will validate and then execute the LOAD, FOREACH, and DUMP statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)
(Bill)
(Joe)

Map/ Reduce jobs for a script will get triggered only when STORE/ DUMP is encountered.
In this case, Map phase for LOAD command will start only when STORE/ DUMP is encountered in script.
Default execution mode is map reduce. If the file is in local path you have use local mode for execution.
pig -x local {pigfilename.pig}
Refer : https://pig.apache.org/docs/r0.9.1/start.html#execution-modes
Extract from above link :
Pig has two execution modes or exectypes:
Local Mode - To run Pig in local mode, you need access to a single
machine; all files are installed and run using your local host and
file system. Specify local mode using the -x flag (pig -x local).
Mapreduce Mode - To run Pig in mapreduce mode, you need access to a
Hadoop cluster and HDFS installation. Mapreduce mode is the default
mode; you can, but don't need to, specify it using the -x flag (pig OR
pig -x mapreduce).

Nutch problems executing crawl

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in windows 7.
Nutch is running, I am getting results from running bin/nutch, but I keep getting error messages when I try to run a crawl.
I am getting the following error when I try to run a crawl execute with nutch:
Error running: /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl/crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/seed.txt
Failed with exit value 127.
I have my JAVA_HOME classpath set, and I have altered the host file to include the 127.0.0.1 as the localhost.
I am curious if I am calling the write directory correctly, if maybe that is the problem.
The full printout looks like:
User5#User5-PC /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local
$ bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/ TestCrawl/ 2
Injecting seed URLs
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/
Injector: starting at 2015-12-23 17:48:21
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Error running:
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/
Failed with exit value 127.
The hadoop log that I think may have something to do with the error I am getting is:
2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:432)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:478)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
2016-01-07 12:24:40,450 ERROR crawl.Injector - Injector: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parse(URI.java:3048)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 4 more

You are running linux commands from Cygwin and there is no C:\ path in linux systems. Correct command should be something like
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/bin/nutch inject TestCrawl/crawldb /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/urls/seed.txt

You have answer to your problem in this message:
2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
This is happening because hadoop version included with nutch 1.11 is designed to work in linux out of the box and not on windows.
I had same situation and I ended up using nutch1.11 in ubuntu virtual box.

hadoop-core jar file is needed when you are working with nutch
with nutch 1.11 compatible hadoop-core jar is 0.20.0
please download jar from this link :
http://www.java2s.com/Code/Jar/h/Downloadhadoop0200corejar.htm
paste that jar into "C:\cygwin64\home\apache-nutch-1.11\lib" folder
and it will run successfully.

The problem is pretty clear. According to your hadoop log, it cannnot find the winutils.exe file. Include winutils.exe in %HADOOP_HOME%/bin folder

Oozie hive action fails

I am creating oozie workflow for hive create table command.
I have added hive-site.xml in hdfs location.
I am getting below error:-
Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.HiveMain], main() threw exception, com/facebook/fb303/FacebookService$Iface
java.lang.NoClassDefFoundError: com/facebook/fb303/FacebookService$Iface
at java.lang.ClassLoader.defineClass1(Native Method)

This might be because you are missing Thrift jar or version mismatch.
Refer the following
Error while executing program with Hive JDBC

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Oozie - PigStorage - Scheme not present in uri - apache-pig

Related

Pentaho commandline nullpointer exception

Apache Hive query on Tez FileNotFoundException

Apache Pig : Job in state DEFINE instead of RUNNING

Nutch problems executing crawl

Oozie hive action fails

Categories

Resources