Using Pig on Hortonworks Sandbox

Using Pig on Hortonworks Sandbox - apache-pig

I've been trying to use CurrentTime() on the sandbox provided by Hortonworks and can't get it to work.
This is all I have in the Pig script:
<code>
REGISTER zookeeper.jar
REGISTER piggybank.jar
REGISTER hbase-common-0.98.4.2.2.0.0-2041-hadoop2.jar
REGISTER hbase-common-0.98.4.2.2.0.0-2041-hadoop2-tests.jar
REGISTER hbase-client-0.98.4.2.2.0.0-2041-hadoop2.jar
REGISTER guava.jar
a = CurrentTime();
dump a;
</code>
The error I see in the logs say:
15/04/13 21:23:22 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/04/13 21:23:22 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/04/13 21:23:22 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-04-13 21:23:22,428 [main] INFO org.apache.pig.Main - Apache Pig version 0.14.0.2.2.0.0-2041 (rexported) compiled Nov 19 2014, 15:24:46
2015-04-13 21:23:22,429 [main] INFO org.apache.pig.Main - Logging error messages to: /hadoop/yarn/local/usercache/hue/appcache/application_1428957295391_0006/container_1428957295391_0006_01_000002/pig_1428960202427.log
2015-04-13 21:23:24,041 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/yarn/.pigbootup not found
2015-04-13 21:23:24,615 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox.hortonworks.com:8020
2015-04-13 21:23:27,864 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. <file script.pig, line 10> Cannot expand macro 'CurrentTime'. Reason: Macro must be defined before expansion.
Failed to parse: <file script.pig, line 10> Cannot expand macro 'CurrentTime'. Reason: Macro must be defined before expansion.
at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
at org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java:301)
at org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java:290)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:183)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1735)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1443)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:387)
at org.apache.pig.PigServer.executeBatch(PigServer.java:412)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:741)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:495)
at org.apache.pig.Main.main(Main.java:170)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
2015-04-13 21:23:27,871 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 10> Cannot expand macro 'CurrentTime'. Reason: Macro must be defined before expansion.
Details at logfile: /hadoop/yarn/local/usercache/hue/appcache/application_1428957295391_0006/container_1428957295391_0006_01_000002/pig_1428960202427.log
2015-04-13 21:23:27,908 [main] INFO org.apache.pig.Main - Pig script completed in 5 seconds and 684 milliseconds (5684 ms)
I included those register lines since I thought maybe the UDFs weren't somehow included. I have no idea what to do now

I think you have to use complete path of the method current time. I have an example below where I am converting a given time into unix date format.
register '/usr/lib/pig/piggybank.jar' ;
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
date_filter = FOREACH parsed_log GENERATE ISOToUnix(date) AS unixTime:long;
STORE date_filter INTO '/root/pig/output/parselogdate/';
I think for your application you have to use below, just try if it works. Your currenttime will take tuple as input and return current time as output.
//Users is a file with tuples in it.
a = load users;
b = foreach a generate org.apache.pig.builtin.CurrentTime();
dump b;
I tried executing above and got output. You can find java doc.

Related

Error initializing plugins - GraphDB 8.7.2

I have tried updating my POM from v8.6.1 to v8.7.2 and in the process successfully re-created a sample repo with the new version's preload tool.
Although I have not altered my java code at all (which runs perfectly with v.8.6.1), now I get an error when trying to retrieve the repository from the manager with the following command:
repository = repositoryManager.getRepository(repositoryId);
The error is the following:
197822 [main] INFO com.ontotext.plugin.magic-predicates - Registering InverseMagicPredicate: http://jena.hpl.hp.com/ARQ/property#strSplit
197823 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'literals-index'
198002 [main] INFO com.ontotext.plugin.literals-index - Literals indices restored.
198003 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'geospatial'
198009 [main] INFO com.ontotext.trree.plugin.geo.GeoSpatialPlugin - Plugin:geospatial initialized
198010 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'sparql-mm'
198400 [main] INFO com.ontotext.graphdb.sparqlmm.FunctionLoader - Registered 48 functions from package com.github.tkurz.sparqlmm.function.
198400 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'dependencies-plugin'
198409 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'similarity'
198429 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'GeoSPARQL'
231881 [main] INFO com.ontotext.trree.geosparql.FunctionLoader - Registered 50 functions from package com.useekm.geosparql.
231882 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'lucene-connector'
231896 [main] ERROR com.ontotext.trree.sdk.impl.PluginManager - Plugin 'lucene-connector' failed to initialize:org/json/simple/parser/ParseException
231897 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'rdfrank'
232224 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'notifications'
232237 [main] ERROR com.ontotext.trree.free.GraphDBFreeSchemaRepository - Error initializing plugins:
java.lang.NullPointerException
at com.ontotext.trree.plugin.externalsync.ExternalSyncPlugin.shutdown(ExternalSyncPlugin.java:803)
at com.ontotext.trree.sdk.PluginBase.shutdown(PluginBase.java:100)
at com.ontotext.trree.sdk.impl.PluginManager.disablePluginInt(PluginManager.java:986)
at com.ontotext.trree.sdk.impl.PluginManager.removePlugin(PluginManager.java:361)
at com.ontotext.trree.sdk.impl.PluginManager.initialize(PluginManager.java:128)
at com.ontotext.trree.OwlimSchemaRepository.initPlugins(OwlimSchemaRepository.java:1979)
at com.ontotext.trree.OwlimSchemaRepository.initializeInternal(OwlimSchemaRepository.java:242)
at org.eclipse.rdf4j.sail.helpers.AbstractSail.initialize(AbstractSail.java:188)
at org.eclipse.rdf4j.repository.sail.SailRepository.initializeInternal(SailRepository.java:151)
at org.eclipse.rdf4j.repository.base.AbstractRepository.initialize(AbstractRepository.java:34)
at org.eclipse.rdf4j.repository.manager.LocalRepositoryManager.createRepository(LocalRepositoryManager.java:270)
at org.eclipse.rdf4j.repository.manager.RepositoryManager.getRepository(RepositoryManager.java:424)
I have specified the -Dregister-external-plugins=.... in the VM Options.
Any ideas what might be wrong? Should I go for a previous version and if so, which one?
Thanks

It looks like you have an incompatible Lucene connector configuration. I recommend deleting the Lucene connector directory and once the repository starts you can recreate the connector(s). The Lucene connector directory is located in the repository's data directory: <graphdb-data-dir>/repositories/<repository-id>/storage/lucene-connector. The easiest way to find <graphdb-data-dir> is looking at the startup messages of GraphDB where it will print something like:
GraphDB Data directory: /opt/test/graphdb-free-8.7.2/data
As Konstantin mentioned the problem might also have to do with register-external-plugins.

Solr Indexing using Nutch Crawler

I'm using Apache Nutch-1.13 and solr 6.6.0 versions.
I'm running the following command to crawl the content:
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch urls/seed.txt TestCrawl 2
I got this exception:
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Error running:
/Users/myedlapalli/documents/nutch-solr-3/apache-nutch-1.13/runtime/local/bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch TestCrawl/crawldb -linkdb TestCrawl/linkdb TestCrawl/segments/20171017090519
Failed with exit value 255.
And in the logs:
2017-10-17 09:36:35,032 INFO solr.SolrIndexWriter - Indexing 1/1 documents
2017-10-17 09:36:35,032 INFO solr.SolrIndexWriter - Deleting 0 documents
2017-10-17 09:36:35,161 INFO solr.SolrIndexWriter - Indexing 1/1 documents
2017-10-17 09:36:35,161 INFO solr.SolrIndexWriter - Deleting 0 documents
2017-10-17 09:36:35,174 WARN mapred.LocalJobRunner - job_local193014604_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/nutch: ERROR: [doc=http://www.cmo.com/features/articles/2017/8/21/5-emerging-technologies-rewrite-the-media-and-entertainment-script-.html] unknown field 'sp_type'
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/nutch: ERROR: [doc=http://www.cmo.com/features/articles/2017/8/21/5-emerging-technologies-rewrite-the-media-and-entertainment-script-.html] unknown field 'sp_type'
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:210)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:188)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2017-10-17 09:36:36,109 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Can someone please help me?
Thanks in advance.

In this cases is usually a good idea to check the logs on the Solr side, but for this particular error. You already have the answer, especially the portion:
ERROR: [doc=http://www.cmo.com/features/articles/2017/8/21/5-emerging-technologies-rewrite-the-media-and-entertainment-script-.html] unknown field 'sp_type'
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
Solr, is complaining that you're sending a document (with id http://www.cmo.com/features/articles/2017/8/21/5...) with one field that is not defined in the schema: sp_type.
You should check what you're sending in that field, or just add the field in the Solr schema.
Keep in mind that if you have more fields that are not defined in the Solr schema, this error will continue to appear. Is usually a good idea to run the bin/nutch indexchecker <URL> command to see what Nutch is going to send to Solr.

The first thing i can tell you is to follow the crawling process with bin/nutch over the crawl command. https://cwiki.apache.org/confluence/display/nutch/NutchTutorial gives you more details.

Apache Pig : Job in state DEFINE instead of RUNNING

I am using Apache Pig. I am trying to load a comma separated file as a Pig table. It does not throw any error while loading the file.
But when I try to print that table using "dump" command, it gives error.
File I loaded
Error,fdgdf
Error,dfgdf
Error,dfgdf
Info,dfgdf
Info,dfgdf
Info,dfgdf
Info,dfgdf
Info,dfgdf
Info,dfgdf
Debug,dfgdf
Debug,dfgdf
Debug,dfgdf
Debug,dfgdf
Debug,dfgdf
Debug,dfgdf
Command to load
logFile1 = LOAD 'PigTestFile' using PigStorage();
Command to print table
dump logFile1
Error I get
led Jobs:
JobId Alias Feature Message Outputs
job_1454617624671_0152 logFile1 MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs:
//ip-172-31-53-48.ec2.internal:8020/user/e1681fe26eed362777aabca1682510/PigTestFile
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-53-48.ec2.internal:8020/user/e1681fe26eed362777aabca1682510/PigTestFile
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:265)
... 18 more
hdfs://ip-172-31-53-48.ec2.internal:8020/tmp/temp1258481141/tmp-1928081547,
:
:
2016-02-07 06:31:20,100 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2016-02-07 06:31:20,107 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias logFile1. Backend error : java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
[EDIT]
When I closely read the log I found that it is not able to find the file which was used to load the table. It is expecting it to be in HDFS. Where as my file was on local box.
I then moved the file into HDFS and then ran same commands. It worked well.
But then why did it not give error while executing "Load" command itself ??

As explained by Murali in his answer (which I have accepted) Map/ Reduce jobs for a script will get triggered only when STORE/ DUMP is encountered.
Here is more explanation about it from Apache Pig documentation
In general, Pig processes Pig Latin statements as follows:
First, Pig validates the syntax and semantics of all statements.
Next, if Pig encounters a DUMP or STORE, Pig will execute the statements.
In this example Pig will validate, but not execute, the LOAD and FOREACH statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
In this example, Pig will validate and then execute the LOAD, FOREACH, and DUMP statements.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)
(Bill)
(Joe)

Map/ Reduce jobs for a script will get triggered only when STORE/ DUMP is encountered.
In this case, Map phase for LOAD command will start only when STORE/ DUMP is encountered in script.
Default execution mode is map reduce. If the file is in local path you have use local mode for execution.
pig -x local {pigfilename.pig}
Refer : https://pig.apache.org/docs/r0.9.1/start.html#execution-modes
Extract from above link :
Pig has two execution modes or exectypes:
Local Mode - To run Pig in local mode, you need access to a single
machine; all files are installed and run using your local host and
file system. Specify local mode using the -x flag (pig -x local).
Mapreduce Mode - To run Pig in mapreduce mode, you need access to a
Hadoop cluster and HDFS installation. Mapreduce mode is the default
mode; you can, but don't need to, specify it using the -x flag (pig OR
pig -x mapreduce).

Nutch problems executing crawl

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in windows 7.
Nutch is running, I am getting results from running bin/nutch, but I keep getting error messages when I try to run a crawl.
I am getting the following error when I try to run a crawl execute with nutch:
Error running: /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl/crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/seed.txt
Failed with exit value 127.
I have my JAVA_HOME classpath set, and I have altered the host file to include the 127.0.0.1 as the localhost.
I am curious if I am calling the write directory correctly, if maybe that is the problem.
The full printout looks like:
User5#User5-PC /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local
$ bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/ TestCrawl/ 2
Injecting seed URLs
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/
Injector: starting at 2015-12-23 17:48:21
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Error running:
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/
Failed with exit value 127.
The hadoop log that I think may have something to do with the error I am getting is:
2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:432)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:478)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
2016-01-07 12:24:40,450 ERROR crawl.Injector - Injector: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parse(URI.java:3048)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 4 more

You are running linux commands from Cygwin and there is no C:\ path in linux systems. Correct command should be something like
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/bin/nutch inject TestCrawl/crawldb /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/urls/seed.txt

You have answer to your problem in this message:
2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
This is happening because hadoop version included with nutch 1.11 is designed to work in linux out of the box and not on windows.
I had same situation and I ended up using nutch1.11 in ubuntu virtual box.

hadoop-core jar file is needed when you are working with nutch
with nutch 1.11 compatible hadoop-core jar is 0.20.0
please download jar from this link :
http://www.java2s.com/Code/Jar/h/Downloadhadoop0200corejar.htm
paste that jar into "C:\cygwin64\home\apache-nutch-1.11\lib" folder
and it will run successfully.

The problem is pretty clear. According to your hadoop log, it cannnot find the winutils.exe file. Include winutils.exe in %HADOOP_HOME%/bin folder

Apache Giraph cannot run on CDH4.4.0

I try to run latest version of apache giraph examples, describe on the quickstart page (http://giraph.apache.org/quick_start.html). I use CDH 4.4.0 (Cloudera distribution of Hadoop)
I have built Giraph with the dependecies updated to CDH 4.4.0. Everything went ok
When I run the examples I got following output
-bash-4.1$ hadoop jar /usr/local/giraph/giraph-examples/target/giraph-examples-1.1.0- SNAPSHOT-for-hadoop-2.0.0-cdh4.4.0-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner
org.apache.giraph.examples.SimpleShortestPathsComputation
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
-vip /user/hdfs/input/tiny_graph.txt
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat
-op /user/hdfs/output/shortestpaths -w 1
13/10/02 18:31:58 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
13/10/02 18:31:58 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one.
13/10/02 18:31:58 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4)
13/10/02 18:31:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/10/02 18:32:00 INFO job.GiraphJob: run: Tracking URL: http://hadoop57:50030/jobdetails.jsp?jobid=job_201310021452_0015
13/10/02 18:32:22 INFO mapred.JobClient: Running job: job_201310021452_0015
13/10/02 18:32:22 INFO mapred.JobClient: Job complete: job_201310021452_0015
13/10/02 18:32:22 INFO mapred.JobClient: Counters: 6
13/10/02 18:32:22 INFO mapred.JobClient: Job Counters
13/10/02 18:32:22 INFO mapred.JobClient: Failed map tasks=1
13/10/02 18:32:22 INFO mapred.JobClient: Launched map tasks=2
13/10/02 18:32:22 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=29054
13/10/02 18:32:22 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
13/10/02 18:32:22 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/10/02 18:32:22 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
and the job log shows exception:
java.lang.IllegalStateException: run: Caught an unrecoverable exception
java.io.FileNotFoundException: File
_bsp/_defaultZkManagerDir/job_201310021452_0015/_zkServer does not exist.
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File
_bsp/_defaultZkManagerDir/job_201310021452_0015/_zkServer does not exist.
at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:792)
at org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java
The file _bsp/_defaultZkManagerDir/job_201310021452_0015/_zkServer sometimes gets created and sometimes not.
Could you please give any hints where to start hunting for this issue.
BR
Konrad

Looks like Giraph is starting it's own zookeeper session. Just try passing the following as a VM argument to the GiraphRunner.
-Dgiraph.zkList=<zookeeper server address>:<port>
e.g.
-Dgiraph.zkList=localhost:2181
Your command will look something like this:
-bash-4.1$ hadoop jar /usr/local/giraph/giraph-examples/target/giraph-examples-1.1.0- SNAPSHOT-for-hadoop-2.0.0-cdh4.4.0-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner
org.apache.giraph.examples.SimpleShortestPathsComputation
-Dgiraph.zkList=localhost:2181
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
-vip /user/hdfs/input/tiny_graph.txt
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat
-op /user/hdfs/output/shortestpaths -w 1
Best luck..!!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using Pig on Hortonworks Sandbox - apache-pig

Related

Error initializing plugins - GraphDB 8.7.2

Solr Indexing using Nutch Crawler

Apache Pig : Job in state DEFINE instead of RUNNING

Nutch problems executing crawl

Apache Giraph cannot run on CDH4.4.0

Categories

Resources