Apache Giraph cannot run on CDH4.4.0

Apache Giraph cannot run on CDH4.4.0 - apache

I try to run latest version of apache giraph examples, describe on the quickstart page (http://giraph.apache.org/quick_start.html). I use CDH 4.4.0 (Cloudera distribution of Hadoop)
I have built Giraph with the dependecies updated to CDH 4.4.0. Everything went ok
When I run the examples I got following output
-bash-4.1$ hadoop jar /usr/local/giraph/giraph-examples/target/giraph-examples-1.1.0- SNAPSHOT-for-hadoop-2.0.0-cdh4.4.0-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner
org.apache.giraph.examples.SimpleShortestPathsComputation
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
-vip /user/hdfs/input/tiny_graph.txt
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat
-op /user/hdfs/output/shortestpaths -w 1
13/10/02 18:31:58 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
13/10/02 18:31:58 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one.
13/10/02 18:31:58 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4)
13/10/02 18:31:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/10/02 18:32:00 INFO job.GiraphJob: run: Tracking URL: http://hadoop57:50030/jobdetails.jsp?jobid=job_201310021452_0015
13/10/02 18:32:22 INFO mapred.JobClient: Running job: job_201310021452_0015
13/10/02 18:32:22 INFO mapred.JobClient: Job complete: job_201310021452_0015
13/10/02 18:32:22 INFO mapred.JobClient: Counters: 6
13/10/02 18:32:22 INFO mapred.JobClient: Job Counters
13/10/02 18:32:22 INFO mapred.JobClient: Failed map tasks=1
13/10/02 18:32:22 INFO mapred.JobClient: Launched map tasks=2
13/10/02 18:32:22 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=29054
13/10/02 18:32:22 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
13/10/02 18:32:22 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/10/02 18:32:22 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
and the job log shows exception:
java.lang.IllegalStateException: run: Caught an unrecoverable exception
java.io.FileNotFoundException: File
_bsp/_defaultZkManagerDir/job_201310021452_0015/_zkServer does not exist.
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File
_bsp/_defaultZkManagerDir/job_201310021452_0015/_zkServer does not exist.
at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:792)
at org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java
The file _bsp/_defaultZkManagerDir/job_201310021452_0015/_zkServer sometimes gets created and sometimes not.
Could you please give any hints where to start hunting for this issue.
BR
Konrad

Looks like Giraph is starting it's own zookeeper session. Just try passing the following as a VM argument to the GiraphRunner.
-Dgiraph.zkList=<zookeeper server address>:<port>
e.g.
-Dgiraph.zkList=localhost:2181
Your command will look something like this:
-bash-4.1$ hadoop jar /usr/local/giraph/giraph-examples/target/giraph-examples-1.1.0- SNAPSHOT-for-hadoop-2.0.0-cdh4.4.0-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner
org.apache.giraph.examples.SimpleShortestPathsComputation
-Dgiraph.zkList=localhost:2181
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
-vip /user/hdfs/input/tiny_graph.txt
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat
-op /user/hdfs/output/shortestpaths -w 1
Best luck..!!

Related

Selenium Chrome nodes fail to stop the processes after execution

I'm using the SeleniumGrid in the most recent version 4.1.2 in a Kubernetes cluster.
In many cases (I would say in about half) when I execute a test through the grid, the node fails to kill the processes and does not go back to being idle. The container then keeps using one full CPU all the time until I kill it manually.
The log in the container is the following:
10:51:34.781 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:51:35.680 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
Starting ChromeDriver 98.0.4758.102 (273bf7ac8c909cde36982d27f66f3c70846a3718-refs/branch-heads/4758#{#1151}) on port 39592
Only local connections are allowed.
Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe.
[1C6h4r6o1m2e9D1r2i3v.e9r8 7w]a[sS EsVtEaRrEt]e:d bsiuncdc(e)s sffauillleyd.:
Cannot assign requested address (99)
11:08:24.970 WARN [SeleniumSpanExporter$1.lambda$export$0] - {"traceId": "99100300a4e6b4fe2afe5891b50def09","eventTime": 1646129304968456597,"eventName": "No slot matched the requested capabilities. ","attributes"
11:08:44.672 INFO [OsProcess.destroy] - Unable to drain process streams. Ignoring but the exception being swallowed follows.
org.apache.commons.exec.ExecuteException: The stop timeout of 2000 ms was exceeded (Exit value: -559038737)
at org.apache.commons.exec.PumpStreamHandler.stopThread(PumpStreamHandler.java:295)
at org.apache.commons.exec.PumpStreamHandler.stop(PumpStreamHandler.java:180)
at org.openqa.selenium.os.OsProcess.destroy(OsProcess.java:135)
at org.openqa.selenium.os.CommandLine.destroy(CommandLine.java:152)
at org.openqa.selenium.remote.service.DriverService.stop(DriverService.java:281)
at org.openqa.selenium.grid.node.config.DriverServiceSessionFactory.apply(DriverServiceSessionFactory.java:183)
at org.openqa.selenium.grid.node.config.DriverServiceSessionFactory.apply(DriverServiceSessionFactory.java:65)
at org.openqa.selenium.grid.node.local.SessionSlot.apply(SessionSlot.java:143)
at org.openqa.selenium.grid.node.local.LocalNode.newSession(LocalNode.java:314)
at org.openqa.selenium.grid.node.NewNodeSession.execute(NewNodeSession.java:52)
at org.openqa.selenium.remote.http.Route$TemplatizedRoute.handle(Route.java:192)
at org.openqa.selenium.remote.http.Route.execute(Route.java:68)
at org.openqa.selenium.grid.security.RequiresSecretFilter.lambda$apply$0(RequiresSecretFilter.java:64)
at org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler.execute(SpanWrappedHttpHandler.java:86)
at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)
at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:336)
at org.openqa.selenium.remote.http.Route.execute(Route.java:68)
at org.openqa.selenium.grid.node.Node.execute(Node.java:240)
at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:336)
at org.openqa.selenium.remote.http.Route.execute(Route.java:68)
at org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0(AddWebDriverSpecHeaders.java:35)
at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)
at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)
at org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0(SeleniumHandler.java:44)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
11:08:44.673 ERROR [OsProcess.destroy] - Unable to kill process Process[pid=75, exitValue=143]
11:08:44.675 WARN [SeleniumSpanExporter$1.lambda$export$0] - {"traceId": "99100300a4e6b4fe2afe5891b50def09","eventTime": 1646129316638154262,"eventName": "exception","attributes": {"driver.url": "http:\u002f\u002f
Here's an excerpt from the Kubernetes manifest:
- name: selenium-node-chrome
image: selenium/node-chrome:latest
...
env:
- name: TZ
value: Europe/Berlin
- name: START_XVFB
value: "false"
- name: SE_NODE_OVERRIDE_MAX_SESSIONS
value: "true"
- name: SE_NODE_MAX_SESSIONS
value: "1"
envFrom:
- configMapRef:
name: selenium-event-bus-config
...
volumeMounts:
- name: dshm
mountPath: /dev/shm
...
volumes:
- name: dshm
emptyDir:
medium: Memory
The selenium-event-bus-config contains the following vars:
data:
SE_EVENT_BUS_HOST: selenium-hub
SE_EVENT_BUS_PUBLISH_PORT: "4442"
SE_EVENT_BUS_SUBSCRIBE_PORT: "4443"
Did I misconfigure anything? Has anyone any idea how I can fix this?

If you don't need to use Xvfb you can remove it from your code and your problem will be resolved.
Apparently the issue resolves when removing the START_XVFB parameter. With a node with only the timezone config I did not yet have the problem.
For the workaround you can try to change your driver for example to Chromedriver. You can read about the differences between them here.
See also this similar problem.

yarn application accepted but not running cloudera despite resource allocation

I am using a Cloudera quickstart VM 5.13.0.0 to run Spark applications in yarn-client mode. I have allocated 10GB and 3 cores to my Cloudera VM. When I submit the application, the application is ACCEPTED but never moves on to RUNNING. When I try to look for logs using yarn logs -applicationId I do not see anything. Its absolutely blank.
I have looked up this issue on:
here
here
here
here
here
here
here
I have practically meddled with all the configs that these links see a problem with. I still do not have an answer to my problem which on the face of it looks like the ones in the links above. Here are the config parameters of my cloudera cluster:
mapreduce.map.memory.mb 128M
mapreduce.reduce.memory.mb 128M
mapreduce.job.heap.memory-mb.ratio 0.8
yarn.nodemanager.resource.memory-mb 1900M
yarn.nodemanager.resource.percentage-physical-cpu-limit 100
yarn.nodemanager.resource.cpu-vcores 1
yarn.scheduler.minimum-allocation-mb 1M
yarn.scheduler.increment-allocation-mb 100M
yarn.scheduler.maximum-allocation-mb 1600M
yarn.scheduler.minimum-allocation-vcores 1
yarn.scheduler.increment-allocation-vcores 1
yarn.scheduler.maximum-allocation-vcores 2
yarn.scheduler.fair.continuous-scheduling-enabled unchecked
mapreduce.am.max-attempts 1
yarn.resourcemanager.am.max-retries, yarn.resourcemanager.am.max-attempts 1
yarn.app.mapreduce.am.resource.mb 1G
yarn.app.mapreduce.am.resource.cpu-vcores 1
ApplicationMaster Java Maximum Heap Size 512M
yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
yarn.scheduler.fair.user-as-default-queue unchecked
yarn.scheduler.fair.preemption unchecked
yarn.scheduler.fair.preemption.cluster-utilization-threshold 0.8
yarn.scheduler.fair.sizebasedweight unchecked
Fair Scheduler Allocations (deployed) {"defaultFairSharePreemptionThreshold":null,"defaultFairSharePreemptionTimeout":null,"defaultMinSharePreemptionTimeout":null,"defaultQueueSchedulingPolicy":"drf","queueMaxAMShareDefault":-1.0,"queueMaxAppsDefault":null,"queuePlacementRules":[{"create":true,"name":"specified","queue":null,"rules":null},{"create":null,"name":"nestedUserQueue","queue":null,"rules":[{"create":true,"name":"default","queue":"users","rules":null}]},{"create":null,"name":"default","queue":null,"rules":null}],"queues":[{"aclAdministerApps":null,"aclSubmitApps":null,"allowPreemptionFrom":null,"fairSharePreemptionThreshold":null,"fairSharePreemptionTimeout":null,"minSharePreemptionTimeout":null,"name":"root","queues":[{"aclAdministerApps":null,"aclSubmitApps":null,"allowPreemptionFrom":null,"fairSharePreemptionThreshold":null,"fairSharePreemptionTimeout":null,"minSharePreemptionTimeout":null,"name":"default","queues":[],"schedulablePropertiesList":[{"impalaDefaultQueryMemLimit":null,"impalaDefaultQueryOptions":null,"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"impalaQueueTimeout":null,"maxAMShare":-1.0,"maxChildResources":null,"maxResources":null,"maxRunningApps":null,"minResources":null,"scheduleName":"default","weight":1.0}],"schedulingPolicy":"drf","type":null},{"aclAdministerApps":null,"aclSubmitApps":null,"allowPreemptionFrom":null,"fairSharePreemptionThreshold":null,"fairSharePreemptionTimeout":null,"minSharePreemptionTimeout":null,"name":"users","queues":[],"schedulablePropertiesList":[{"impalaDefaultQueryMemLimit":null,"impalaDefaultQueryOptions":null,"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"impalaQueueTimeout":null,"maxAMShare":-1.0,"maxChildResources":null,"maxResources":null,"maxRunningApps":null,"minResources":null,"scheduleName":"default","weight":1.0}],"schedulingPolicy":"drf","type":"parent"}],"schedulablePropertiesList":[{"impalaDefaultQueryMemLimit":null,"impalaDefaultQueryOptions":null,"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"impalaQueueTimeout":null,"maxAMShare":null,"maxChildResources":null,"maxResources":null,"maxRunningApps":null,"minResources":null,"scheduleName":"default","weight":1.0}],"schedulingPolicy":"drf","type":null}],"userMaxAppsDefault":1,"users":[]}
Here is what the queue description looks like when the application is still in ACCEPTED state:
Likewise, here is the record from the Yarn RM UI (Note that the resources are allocated (memory/cpu) and Running Containers shows 1 container running):
Here is the Application Summary:
Here are the application logs (empty):
And, lastly, here is what the driver sees:
enter code here19/12/26 00:16:42 INFO Client:
client token: N/A
diagnostics: Application application_1577297544619_0002 failed 1 times due to AM Container for appattempt_1577297544619_0002_000001 exited with exitCode: 10
For more detailed output, check application tracking page:http://quickstart.cloudera:8088/proxy/application_1577297544619_0002/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1577297544619_0002_01_000001
Exit code: 10
Stack trace: ExitCodeException exitCode=10:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 10
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.default
start time: 1577299469533
final status: FAILED
tracking URL: http://quickstart.cloudera:8088/cluster/app/application_1577297544619_0002
user: shepanch
19/12/26 00:16:42 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:165)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:512)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2511)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at cloudera.jobs.ClouderaSampleJob$.delayedEndpoint$cloudera$jobs$ClouderaSampleJob$1(ClouderaSampleJob.scala:17)
at cloudera.jobs.ClouderaSampleJob$delayedInit$body.apply(ClouderaSampleJob.scala:6)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at cloudera.jobs.ClouderaSampleJob$.main(ClouderaSampleJob.scala:6)
at cloudera.jobs.ClouderaSampleJob.main(ClouderaSampleJob.scala)
Is there anything that can be done to solve this issue?

After all the research and apart from the reasons mentioned in the links I have mentioned in the question, I found that this can happen due to various reasons:
when you have different versions of spark in the client (driver) and the cluster. Once you ensure that both bundle the same version of spark, it runs fine.
you might need to mention the property spark.driver.host. Make sure the IP passed in here can be pinged from the guest VM.

Error initializing plugins - GraphDB 8.7.2

I have tried updating my POM from v8.6.1 to v8.7.2 and in the process successfully re-created a sample repo with the new version's preload tool.
Although I have not altered my java code at all (which runs perfectly with v.8.6.1), now I get an error when trying to retrieve the repository from the manager with the following command:
repository = repositoryManager.getRepository(repositoryId);
The error is the following:
197822 [main] INFO com.ontotext.plugin.magic-predicates - Registering InverseMagicPredicate: http://jena.hpl.hp.com/ARQ/property#strSplit
197823 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'literals-index'
198002 [main] INFO com.ontotext.plugin.literals-index - Literals indices restored.
198003 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'geospatial'
198009 [main] INFO com.ontotext.trree.plugin.geo.GeoSpatialPlugin - Plugin:geospatial initialized
198010 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'sparql-mm'
198400 [main] INFO com.ontotext.graphdb.sparqlmm.FunctionLoader - Registered 48 functions from package com.github.tkurz.sparqlmm.function.
198400 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'dependencies-plugin'
198409 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'similarity'
198429 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'GeoSPARQL'
231881 [main] INFO com.ontotext.trree.geosparql.FunctionLoader - Registered 50 functions from package com.useekm.geosparql.
231882 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'lucene-connector'
231896 [main] ERROR com.ontotext.trree.sdk.impl.PluginManager - Plugin 'lucene-connector' failed to initialize:org/json/simple/parser/ParseException
231897 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'rdfrank'
232224 [main] INFO com.ontotext.trree.sdk.impl.PluginManager - Initializing plugin 'notifications'
232237 [main] ERROR com.ontotext.trree.free.GraphDBFreeSchemaRepository - Error initializing plugins:
java.lang.NullPointerException
at com.ontotext.trree.plugin.externalsync.ExternalSyncPlugin.shutdown(ExternalSyncPlugin.java:803)
at com.ontotext.trree.sdk.PluginBase.shutdown(PluginBase.java:100)
at com.ontotext.trree.sdk.impl.PluginManager.disablePluginInt(PluginManager.java:986)
at com.ontotext.trree.sdk.impl.PluginManager.removePlugin(PluginManager.java:361)
at com.ontotext.trree.sdk.impl.PluginManager.initialize(PluginManager.java:128)
at com.ontotext.trree.OwlimSchemaRepository.initPlugins(OwlimSchemaRepository.java:1979)
at com.ontotext.trree.OwlimSchemaRepository.initializeInternal(OwlimSchemaRepository.java:242)
at org.eclipse.rdf4j.sail.helpers.AbstractSail.initialize(AbstractSail.java:188)
at org.eclipse.rdf4j.repository.sail.SailRepository.initializeInternal(SailRepository.java:151)
at org.eclipse.rdf4j.repository.base.AbstractRepository.initialize(AbstractRepository.java:34)
at org.eclipse.rdf4j.repository.manager.LocalRepositoryManager.createRepository(LocalRepositoryManager.java:270)
at org.eclipse.rdf4j.repository.manager.RepositoryManager.getRepository(RepositoryManager.java:424)
I have specified the -Dregister-external-plugins=.... in the VM Options.
Any ideas what might be wrong? Should I go for a previous version and if so, which one?
Thanks

It looks like you have an incompatible Lucene connector configuration. I recommend deleting the Lucene connector directory and once the repository starts you can recreate the connector(s). The Lucene connector directory is located in the repository's data directory: <graphdb-data-dir>/repositories/<repository-id>/storage/lucene-connector. The easiest way to find <graphdb-data-dir> is looking at the startup messages of GraphDB where it will print something like:
GraphDB Data directory: /opt/test/graphdb-free-8.7.2/data
As Konstantin mentioned the problem might also have to do with register-external-plugins.

Using Pig on Hortonworks Sandbox

I've been trying to use CurrentTime() on the sandbox provided by Hortonworks and can't get it to work.
This is all I have in the Pig script:
<code>
REGISTER zookeeper.jar
REGISTER piggybank.jar
REGISTER hbase-common-0.98.4.2.2.0.0-2041-hadoop2.jar
REGISTER hbase-common-0.98.4.2.2.0.0-2041-hadoop2-tests.jar
REGISTER hbase-client-0.98.4.2.2.0.0-2041-hadoop2.jar
REGISTER guava.jar
a = CurrentTime();
dump a;
</code>
The error I see in the logs say:
15/04/13 21:23:22 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/04/13 21:23:22 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/04/13 21:23:22 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-04-13 21:23:22,428 [main] INFO org.apache.pig.Main - Apache Pig version 0.14.0.2.2.0.0-2041 (rexported) compiled Nov 19 2014, 15:24:46
2015-04-13 21:23:22,429 [main] INFO org.apache.pig.Main - Logging error messages to: /hadoop/yarn/local/usercache/hue/appcache/application_1428957295391_0006/container_1428957295391_0006_01_000002/pig_1428960202427.log
2015-04-13 21:23:24,041 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/yarn/.pigbootup not found
2015-04-13 21:23:24,615 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox.hortonworks.com:8020
2015-04-13 21:23:27,864 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. <file script.pig, line 10> Cannot expand macro 'CurrentTime'. Reason: Macro must be defined before expansion.
Failed to parse: <file script.pig, line 10> Cannot expand macro 'CurrentTime'. Reason: Macro must be defined before expansion.
at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
at org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java:301)
at org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java:290)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:183)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1735)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1443)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:387)
at org.apache.pig.PigServer.executeBatch(PigServer.java:412)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:741)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:495)
at org.apache.pig.Main.main(Main.java:170)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
2015-04-13 21:23:27,871 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 10> Cannot expand macro 'CurrentTime'. Reason: Macro must be defined before expansion.
Details at logfile: /hadoop/yarn/local/usercache/hue/appcache/application_1428957295391_0006/container_1428957295391_0006_01_000002/pig_1428960202427.log
2015-04-13 21:23:27,908 [main] INFO org.apache.pig.Main - Pig script completed in 5 seconds and 684 milliseconds (5684 ms)
I included those register lines since I thought maybe the UDFs weren't somehow included. I have no idea what to do now

I think you have to use complete path of the method current time. I have an example below where I am converting a given time into unix date format.
register '/usr/lib/pig/piggybank.jar' ;
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
date_filter = FOREACH parsed_log GENERATE ISOToUnix(date) AS unixTime:long;
STORE date_filter INTO '/root/pig/output/parselogdate/';
I think for your application you have to use below, just try if it works. Your currenttime will take tuple as input and return current time as output.
//Users is a file with tuples in it.
a = load users;
b = foreach a generate org.apache.pig.builtin.CurrentTime();
dump b;
I tried executing above and got output. You can find java doc.

MapReduce jobs in hive-0.8.1-cdh4.0.1 Failed.

Queries in hive-0.8.1-cdh4.0.1 that invoke the Reducer results in Task Failed.
The queries having MAPJOIn is working fine but JOIN gives error.
eg:
hive> select count(*) from table1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
12/10/15 23:07:02 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
12/10/15 23:07:02 WARN conf.Configuration: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
12/10/15 23:07:02 WARN conf.Configuration: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
12/10/15 23:07:02 WARN conf.HiveConf: hive-site.xml not found on CLASSPATH
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/XXXX
/XXXX_20121015230707_c93521d0-4a97-4972-92b9-0fdd3ab42e5f.log
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/XXXX/hadoop-2.0.0-cdh4.0.1/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/XXXX/hive-0.8.1-cdh4.0.1/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See <http://www.slf4j.org/codes.html#multiple_bindings> for an explanation.
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2012-10-15 23:07:04,721 null map = 0%, reduce = 0%
Ended Job = job_local_0001 with errors
Error during job, obtaining debugging information...
**Execution failed with exit status: 2**
Obtaining error information
**Task failed!**
Task ID:
Stage-1
Logs:
/tmp/XXXX/hive.log
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
The log file shows that its due to Java heap space problem.
**java.lang.Exception: java.lang.OutOfMemoryError: Java heap space**
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:400)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:912)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:232)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

For hadoop 2.0.0 +,
in etc/hadoop/mapred-site.xml
set:
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>1</value>
</property>
It will work

map join will need more memory.
increase your mapreduce jvm memory size in file conf/mapred-site.xml. mapreduce conf
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m -server</value>
</property>

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Apache Giraph cannot run on CDH4.4.0 - apache

Related

Selenium Chrome nodes fail to stop the processes after execution

yarn application accepted but not running cloudera despite resource allocation

Error initializing plugins - GraphDB 8.7.2

Using Pig on Hortonworks Sandbox

MapReduce jobs in hive-0.8.1-cdh4.0.1 Failed.

Categories

Resources