Using Storm 1.2.2
Kafka 1.1.0
After submitting topology, supervisor launches a worker processes. When checking the worker.log file for that launched Worker Process, it was found out that, somewhere between loading of all the executors, worker process gets killed by supervisor.
Following are the supervisor logs,
{"#timestamp":"2020-01-09 11:18:57,719","message":"SLOT 6700: Assignment Changed from LocalAssignment(topology_id:trident-Topology-578320979, executors:[ExecutorInfo(task_start:22, task_end:22), ExecutorInfo(task_start:2, task_end:2), ExecutorInfo(task_start:42, task_end:42), ExecutorInfo(task_start:18, task_end:18), ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:14, task_end:14), ExecutorInfo(task_start:6, task_end:6), ExecutorInfo(task_start:38, task_end:38), ExecutorInfo(task_start:30, task_end:30), ExecutorInfo(task_start:34, task_end:34), ExecutorInfo(task_start:50, task_end:50), ExecutorInfo(task_start:46, task_end:46), ExecutorInfo(task_start:26, task_end:26), ExecutorInfo(task_start:39, task_end:39), ExecutorInfo(task_start:47, task_end:47), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:51, task_end:51), ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:35, task_end:35), ExecutorInfo(task_start:31, task_end:31), ExecutorInfo(task_start:27, task_end:27), ExecutorInfo(task_start:43, task_end:43), ExecutorInfo(task_start:23, task_end:23), ExecutorInfo(task_start:11, task_end:11), ExecutorInfo(task_start:19, task_end:19), ExecutorInfo(task_start:15, task_end:15), ExecutorInfo(task_start:24, task_end:24), ExecutorInfo(task_start:12, task_end:12), ExecutorInfo(task_start:8, task_end:8), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:32, task_end:32), ExecutorInfo(task_start:40, task_end:40), ExecutorInfo(task_start:36, task_end:36), ExecutorInfo(task_start:28, task_end:28), ExecutorInfo(task_start:20, task_end:20), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:48, task_end:48), ExecutorInfo(task_start:44, task_end:44), ExecutorInfo(task_start:21, task_end:21), ExecutorInfo(task_start:33, task_end:33), ExecutorInfo(task_start:41, task_end:41), ExecutorInfo(task_start:37, task_end:37), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:9, task_end:9), ExecutorInfo(task_start:13, task_end:13), ExecutorInfo(task_start:17, task_end:17), ExecutorInfo(task_start:5, task_end:5), ExecutorInfo(task_start:29, task_end:29), ExecutorInfo(task_start:25, task_end:25), ExecutorInfo(task_start:45, task_end:45), ExecutorInfo(task_start:49, task_end:49)], resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0), owner:root) to null","thread_name":"SLOT_6700","level":"WARN"}
{"#timestamp":"2020-01-09 11:18:57,724","message":"Killing 29a1f333-55f1-45c2-988d-daf0712c2862:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:00,808","message":"STATE RUNNING msInState: 120187 topo:trident-Topology-578320979 worker:5e19382e-c3e5-4c8d-8706-185e00e658a8 -> KILL msInState: 0 topo:trident-Topology-578320979 worker:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:00,809","message":"GET worker-user for 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:00,828","message":"SLOT 6700 force kill and wait...","thread_name":"SLOT_6700","level":"WARN"}
{"#timestamp":"2020-01-09 11:19:00,831","message":"Force Killing 29a1f333-55f1-45c2-988d-daf0712c2862:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:01,432","message":"Worker Process 5e19382e-c3e5-4c8d-8706-185e00e658a8 exited with code: 137","thread_name":"Thread-30","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,851","message":"GET worker-user for 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,858","message":"SLOT 6700 all processes are dead...","thread_name":"SLOT_6700","level":"WARN"}
{"#timestamp":"2020-01-09 11:19:03,859","message":"Cleaning up 29a1f333-55f1-45c2-988d-daf0712c2862:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,859","message":"GET worker-user for 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,859","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/pids/3100","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,860","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/heartbeats","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,871","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/pids","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,872","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/tmp","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,872","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,873","message":"REMOVE worker-user 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,874","message":"Deleting path /data/workers-users/5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,876","message":"Removed Worker ID 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"#timestamp":"2020-01-09 11:19:03,876","message":"STATE KILL msInState: 3068 topo:trident-Topology-578320979 worker:null -> EMPTY msInState: 0","thread_name":"SLOT_6700","level":"INFO"}
After this worker with id 5e19382e-c3e5-4c8d-8706-185e00e658a8 was killed, a new worker process was launched by supervisor with different Id and loading of executors starts again and then after some of the executors have done loading, the worker process will receive a kill signal from supervisor.
Following are the worker logs at port 6700,
...
2020-01-09 14:42:19.455 o.a.s.d.executor main [INFO] Loading executor b-14:[10 10]
2020-01-09 14:42:20.942 o.a.s.d.executor main [INFO] Loaded executor tasks b-14:[10 10]
2020-01-09 14:42:20.945 o.a.s.d.executor main [INFO] Finished loading executor b-14[10 10]
2020-01-09 14:42:20.962 o.a.s.d.executor main [INFO] Loading executor b-39:[37 37]
2020-01-09 14:42:22.547 o.a.s.d.executor main [INFO] Loaded executor tasks b-39:[37 37]
2020-01-09 14:42:22.549 o.a.s.d.executor main [INFO] Finished loading executor b-39:[37 37]
2020-01-09 14:42:22.566 o.a.s.d.executor main [INFO] Loading executor b-5:[46 46]
2020-01-09 14:42:25.267 o.a.s.d.executor main [INFO] Loaded executor tasks b-5:[46 46]
2020-01-09 14:42:25.269 o.a.s.d.executor main [INFO] Finished loading executor b-5:[46 46]
2020-01-09 14:42:31.175 o.a.s.d.executor main [INFO] Loading executor b-0:[4 4]
2020-01-09 14:42:37.512 o.s.c.n.e.InstanceInfoFactory Thread-10 [INFO] Setting initial instance status as: STARTING
2020-01-09 14:42:37.637 o.s.s.c.ThreadPoolTaskScheduler [Ljava.lang.String;#174cb0d8.container-0-C-1 [INFO] Shutting down ExecutorService
2020-01-09 14:42:37.851 o.s.k.l.KafkaMessageListenerContainer$ListenerConsumer [Ljava.lang.String;#174cb0d8.container-0-C-1 [INFO] Consumer stopped
2020-01-09 14:42:37.855 o.s.i.k.i.KafkaMessageDrivenChannelAdapter Thread-10 [INFO] stopped org.springframework.integration.kafka.inbound.KafkaMessageDrivenChannelAdapter#2459333a
2020-01-09 14:42:37.870 o.s.s.c.ThreadPoolTaskScheduler [Ljava.lang.String;#6e355249.container-0-C-1 [INFO] Shutting down ExecutorService
2020-01-09 14:42:38.054 o.s.k.l.KafkaMessageListenerContainer$ListenerConsumer [Ljava.lang.String;#6e355249.container-0-C-1 [INFO] Consumer stopped
After this, it will again start with 'Launching worker for trident-Topology-578320979 ...' and loading all the executors and tasks.
Can anyone please explain what does "Worker Process 5e19382e-c3e5-4c8d-8706-185e00e658a8 exited with code: 137" mean?
Following link [https://issues.apache.org/jira/browse/STORM-2176], explains that the configuration property supervisor.worker.shutdown.sleep.secs, which is set by default to 1 second. This corresponds to how long the supervisor will wait for a worker to exit gracefully before forcibly killing it with kill -9. When this happens the supervisor will log that the worker terminated with exit code 137 (128 + 9).
Would it help increasing the value of supervisor.worker.shutdown.sleep.secs?
Or Can it be because JVM doesn't have enough Memory? But then, It should throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, whereas in any of the logs, there is no such exception visible.
Is it recommended to try by increasing the JVM memory using the configuration settings ('worker.childopts') in storm.yaml.
Any help would be greatly appreciated.
P.S. Trying to find out solution since few days but no success.
I'm working with Hive/Hadoop/Sqoop through the Cloudera 5.8.0 which includes sqoop 1.4.6. My Hadoop cluster has 4 Hadoop datanodes each with 16 GB memory and all are running ImpalaDaemons and Yarn NodeManagers. The Yarn server is running along with Hue, Hive and Sqoop2 on a server with 32 GB of RAM (has many roles).
Using Sqoop to import (from the main server using Sqoop 1 via bash script to parquetfile format in a incremental job) from a MySQL database it seemed slow (50 seconds average) even when importing a table as little as 200 rows (or even 30 rows in one case). It would always hang (and finally succeed at the end) on this step on Sqoop for 30 seconds consistently even in Ubermode:
Notes: Clean Phase and repeats omitted for brevity..
2016-11-03 10:07:50,534 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/192.168.1.31:58178, remote=/192.168.1.34:50010, for file /user/(user profile name)/.staging/job_1478124814973_0001/libjars/commons-math-2.1.jar, for pool BP-15528599-192.168.1.31-1472851278753 block 1074078887_338652
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:265)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-11-03 10:07:50,541 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /192.168.1.34:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/192.168.1.31:58178, remote=/192.168.1.34:50010, for file /user/(user profile name)/.staging/job_1478124814973_0001/libjars/commons-math-2.1.jar, for pool BP-15528599-192.168.1.31-1472851278753 block 1074078887_338652
java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/192.168.1.31:58178, remote=/192.168.1.34:50010, for file /user/(user profile name)/.staging/job_1478124814973_0001/libjars/commons-math-2.1.jar, for pool BP-15528599-192.168.1.31-1472851278753 block 1074078887_338652
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:265)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-11-03 10:07:50,543 INFO org.apache.hadoop.hdfs.DFSClient: Successfully connected to /192.168.1.33:50010 for BP-15528599-192.168.1.31-1472851278753:blk_1074078887_338652
This error repeated 4 times.
When ran the job again I got this:
2016-11-03 10:37:38,093 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_e86_1478124814973_0002_01_000001 by user (user profile name)
2016-11-03 10:37:38,093 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Creating a new application reference for app application_1478124814973_0002
2016-11-03 10:37:38,095 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=(user profile name) IP=192.168.1.34 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1478124814973_0002 CONTAINERID=container_e86_1478124814973_0002_01_000001
2016-11-03 10:37:38,096 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1478124814973_0002 transitioned from NEW to INITING
2016-11-03 10:37:38,096 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_e86_1478124814973_0002_01_000001 to application application_1478124814973_0002
2016-11-03 10:37:38,106 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2016-11-03 10:37:38,134 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1478124814973_0002 transitioned from INITING to RUNNING
2016-11-03 10:37:38,138 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e86_1478124814973_0002_01_000001 transitioned from NEW to LOCALIZING
2016-11-03 10:37:38,138 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1478124814973_0002
2016-11-03 10:37:38,147 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_e86_1478124814973_0002_01_000001
2016-11-03 10:37:38,148 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /yarn/nm/nmPrivate/container_e86_1478124814973_0002_01_000001.tokens. Credentials list:
2016-11-03 10:37:38,149 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Initializing user (user profile name)
2016-11-03 10:37:38,151 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying from /yarn/nm/nmPrivate/container_e86_1478124814973_0002_01_000001.tokens to /yarn/nm/usercache/(user profile name)/appcache/application_1478124814973_0002/container_e86_1478124814973_0002_01_000001.tokens
2016-11-03 10:37:38,151 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Localizer CWD set to /yarn/nm/usercache/(user profile name)/appcache/application_1478124814973_0002 = file:/yarn/nm/usercache/(user profile name)/appcache/application_1478124814973_0002
2016-11-03 10:37:41,791 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/192.168.1.31:39276, remote=/192.168.1.35:50010, for file /user/(user profile name)/.staging/job_1478124814973_0002/libjars/jackson-core-2.3.1.jar, for pool BP-15528599-192.168.1.31-1472851278753 block 1074079133_338898
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:265)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-11-03 10:37:41,792 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /192.168.1.35:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/192.168.1.31:39276, remote=/192.168.1.35:50010, for file /user/(user profile name)/.staging/job_1478124814973_0002/libjars/jackson-core-2.3.1.jar, for pool BP-15528599-192.168.1.31-1472851278753 block 1074079133_338898
java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/192.168.1.31:39276, remote=/192.168.1.35:50010, for file /user/(user profile name)/.staging/job_1478124814973_0002/libjars/jackson-core-2.3.1.jar, for pool BP-15528599-192.168.1.31-1472851278753 block 1074079133_338898
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:265)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-11-03 10:37:41,795 INFO org.apache.hadoop.hdfs.DFSClient: Successfully connected to /192.168.1.32:50010 for BP-15528599-192.168.1.31-1472851278753:blk_1074079133_338898
2016-11-03 10:37:42,928 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e86_1478124814973_0002_01_000001 transitioned from LOCALIZING to LOCALIZED
2016-11-03 10:37:42,951 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e86_1478124814973_0002_01_000001 transitioned from LOCALIZED to RUNNING
2016-11-03 10:37:42,955 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /yarn/nm/usercache/(user profile name)/appcache/application_1478124814973_0002/container_e86_1478124814973_0002_01_000001/default_container_executor.sh]
2016-11-03 10:37:43,011 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_e86_1478124814973_0002_01_000001
2016-11-03 10:37:43,034 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 25215 for container-id container_e86_1478124814973_0002_01_000001: 1.4 MB of 2 GB physical memory used; 103.6 MB of 4.2 GB virtual memory used
2016-11-03 10:37:46,242 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 25215 for container-id container_e86_1478124814973_0002_01_000001: 268.1 MB of 2 GB physical memory used; 1.4 GB of 4.2 GB virtual memory used
2016-11-03 10:37:49,261 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 25215 for container-id container_e86_1478124814973_0002_01_000001: 398.4 MB of 2 GB physical memory used; 1.5 GB of 4.2 GB virtual memory used
2016-11-03 10:37:52,279 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 25215 for container-id container_e86_1478124814973_0002_01_000001: 408.5 MB of 2 GB physical memory used; 1.5 GB of 4.2 GB virtual memory used
2016-11-03 10:37:55,297 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 25215 for container-id container_e86_1478124814973_0002_01_000001: 416.6 MB of 2 GB physical memory used; 1.5 GB of 4.2 GB virtual memory used
2016-11-03 10:37:58,315 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 25215 for container-id container_e86_1478124814973_0002_01_000001: 414.1 MB of 2 GB physical memory used; 1.5 GB of 4.2 GB virtual memory used
2016-11-03 10:38:00,934 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_e86_1478124814973_0002_01_000001 succeeded
2016-11-03 10:38:00,934 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e86_1478124814973_0002_01_000001 transitioned from RUNNING to EXITED_WITH_SUCCESS
2016-11-03 10:38:00,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e86_1478124814973_0002_01_000001
2016-11-03 10:38:00,967 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/(user profile name)/appcache/application_1478124814973_0002/container_e86_1478124814973_0002_01_000001
2016-11-03 10:38:00,968 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=(user profile name) OPERATION=Container Finished - Succeeded TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1478124814973_0002 CONTAINERID=container_e86_1478124814973_0002_01_000001
2016-11-03 10:38:00,968 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e86_1478124814973_0002_01_000001 transitioned from EXITED_WITH_SUCCESS to DONE
2016-11-03 10:38:00,968 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_e86_1478124814973_0002_01_000001 from application application_1478124814973_0002
2016-11-03 10:38:00,968 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_e86_1478124814973_0002_01_000001 for log-aggregation
2016-11-03 10:38:00,968 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_e86_1478124814973_0002_01_000001
2016-11-03 10:38:00,980 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=(user profile name) IP=192.168.1.34 OPERATION=Stop Container Request
TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1478124814973_0002 CONTAINERID=container_e86_1478124814973_0002_01_000001
2016-11-03 10:38:01,316 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_e86_1478124814973_0002_01_000001
2016-11-03 10:38:01,972 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_e86_1478124814973_0002_01_000001]
2016-11-03 10:38:01,972 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1478124814973_0002 transitioned from RUNNING to APPLICATION_RESOURCES_CLEANINGUP
2016-11-03 10:38:01,973 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/(user profile name)/appcache/application_1478124814973_0002
2016-11-03 10:38:01,973 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_STOP for appId application_1478124814973_0002
2016-11-03 10:38:01,973 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1478124814973_0002 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED
2016-11-03 10:38:01,973 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Application just finished : application_1478124814973_0002
2016-11-03 10:38:02,072 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Uploading logs for container container_e86_1478124814973_0002_01_000001. Current good log dirs are /yarn/container-logs
2016-11-03 10:38:02,073 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting path : /yarn/container-logs/application_1478124814973_0002/container_e86_1478124814973_0002_01_000001/stderr
2016-11-03 10:38:02,074 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting path : /yarn/container-logs/application_1478124814973_0002/container_e86_1478124814973_0002_01_000001/stdout
2016-11-03 10:38:02,074 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting path : /yarn/container-logs/application_1478124814973_0002/container_e86_1478124814973_0002_01_000001/syslog
2016-11-03 10:38:02,160 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting path : /yarn/container-logs/application_1478124814973_0002
After subsequent test it it having problems with nodes 3 and 4 (192.168.1.34 and 192.168.1.35). The Cloudera interface says all nodes are healthy (I realize it may not be accurate). I can believe one might be bad (and tried decomminishing and deleting it and later recomminished it) but two seems odd, especially when I can query Impala or hive with no issues and Cloudera and fsck say the nodes are healthy.
I've run hdfs fsck on the root directory and no errors were found. Anybody understand why this is happening and better yet, can this be fixed?
Oh, it should be noted all nodes are virtual machines on the same physical server and all /etc/hosts files are configured to see all node hostnames (not using internal DNS for now). I've checked the iptables service on both 192.168.1.34 and 192.168.1.35 and ip tables is not running. Also verified that both machines are listening on port 50010.
Thanks All!
Okay ,managed to get rid of the error. I changed this setting:
mapreduce.client.submit.file.replication
Setting this value to 4 (the number of yarn nodeManagers in the cluster, it was 2 earlier) the I/O Exception error went away. As to speed, research indicates small tables with small files are handled inefficiently if Parquet format it used. So I'm guessing if I have smaller tables import intermediate tables are convert to HBase file format (or perhaps all of them as they are basically used as intermediate tables from raw sqoop import to native table format with timestamp columns (sqoop converts them to longint if using parquet format). If I make those Hbase I don't need to convert them anymore ironically.
First off all, i'm on OSX 10.7.3 using MonoDevelop 2.8.8.4 with MonoDroid 4.0.6 and Mono 2.10.9.
So I have purchased MFA and have created the generic "Mono for Android Application" project for testing.
I have checked the ABIs to "armeabi", "armeabi-v7a" and "x86" in the Advanced tab under Option/Build/MonoForAndroidBuild.
I have also set the build to release.
I then go to Project/CreateAndroidProject in the fileMenu to build my apk file I will use to upload to the Logitech Revue GoogleTV device or x86 Emulator.
After uploading and running the application I get the error::
"The Application AndroidTest(process AndroidTest.AndroidTest) has stopped unexpectedly. Please try again.".
I also get this same error when using the Android Emulator "API lvl 10 Intel Atom x86".
Has anyone got MonoDroid to work on any x86 platforms? If so, which one and what were the setting you used? Were you using VirtualBox or the standard AndroidEmulator? Also what API level did you use and what were the MonoDroid proj/sln setting you needed to set to get it to work?
NOTE: The proj I used works on my ARM android phone device and the ARM AndroidEmulator.
I have also set the AndroidManifest.xml flag::
<uses-feature android:name="android.hardware.touchscreen" android:required="false" />
When I use "adb logcat" it gives the error on x86 emulators:: "java.lang.UnsatisfiedLinkError: Cannot load library: reloc_library[1311]: 799 cannot locate 'atexit'..."
EDIT - Here is the logcat information when running the application on a Logitech GoogleTV::
"
I/ActivityManager( 193): Starting: Intent {
act=android.intent.action.MAIN flg=0x10200000
cmp=com.Reign.WaterDemo_Android/waterdemo_android.Activity1 } from pid
247 I/ActivityManager( 193): Start proc com.Reign.WaterDemo_Android
for activity com.Reign.WaterDemo_Android/waterdemo_android.Activity1:
pid=2084 uid=10060 gids={1015} I/ActivityThread( 2084): Pub
com.Reign.WaterDemo_Android.mono_init: mono.MonoRuntimeProvider
D/AndroidRuntime( 2084): Shutting down VM W/dalvikvm( 2084):
threadid=1: thread exiting with uncaught exception (group=0x66995778)
E/AndroidRuntime( 2084): FATAL EXCEPTION: main E/AndroidRuntime(
2084): java.lang.UnsatisfiedLinkError: Couldn't load monodroid:
findLibrary returned null E/AndroidRuntime( 2084): at
java.lang.Runtime.loadLibrary(Runtime.java:425) E/AndroidRuntime(
2084): at java.lang.System.loadLibrary(System.java:554)
E/AndroidRuntime( 2084): at
mono.MonoPackageManager.LoadApplication(MonoPackageManager.java:24)
E/AndroidRuntime( 2084): at
mono.MonoRuntimeProvider.attachInfo(MonoRuntimeProvider.java:22)
E/AndroidRuntime( 2084): at
android.app.ActivityThread.installProvider(ActivityThread.java:3938)
E/AndroidRuntime( 2084): at
android.app.ActivityThread.installContentProviders(ActivityThread.java:3693)
E/AndroidRuntime( 2084): at
android.app.ActivityThread.handleBindApplication(ActivityThread.java:3649)
E/AndroidRuntime( 2084): at
android.app.ActivityThread.access$2200(ActivityThread.java:124)
E/AndroidRuntime( 2084): at
android.app.ActivityThread$H.handleMessage(ActivityThread.java:1054)
E/AndroidRuntime( 2084): at
android.os.Handler.dispatchMessage(Handler.java:99) E/AndroidRuntime(
2084): at android.os.Looper.loop(Looper.java:132) E/AndroidRuntime(
2084): at android.app.ActivityThread.main(ActivityThread.java:4083)
E/AndroidRuntime( 2084): at
java.lang.reflect.Method.invokeNative(Native Method) E/AndroidRuntime(
2084): at java.lang.reflect.Method.invoke(Method.java:491)
E/AndroidRuntime( 2084): at
com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:841)
E/AndroidRuntime( 2084): at
com.android.internal.os.ZygoteInit.main(ZygoteInit.java:599)
E/AndroidRuntime( 2084): at dalvik.system.NativeStart.main(Native
Method) W/ActivityManager( 193): Force finishing activity
com.Reign.WaterDemo_Android/waterdemo_android.Activity1 D/dalvikvm(
193): GC_FOR_ALLOC freed 324K, 18% free 9559K/11591K, paused 59ms
I/dalvikvm-heap( 193): Grow heap (frag case) to 9.816MB for
178700-byte allocation D/dalvikvm( 193): GC_FOR_ALLOC freed 9K, 18%
free 9723K/11783K, paused 59ms D/dalvikvm( 193): GC_FOR_ALLOC freed
117K, 19% free 9606K/11783K, paused 58ms I/dalvikvm-heap( 193): Grow
heap (frag case) to 10.794MB for 1155900-byte allocation D/dalvikvm(
193): GC_FOR_ALLOC freed 2K, 18% free 10733K/12935K, paused 56ms
D/dalvikvm( 193): GC_FOR_ALLOC freed <1K, 18% free 10733K/12935K,
paused 57ms I/dalvikvm-heap( 193): Grow heap (frag case) to 12.752MB
for 2054924-byte allocation D/dalvikvm( 193): GC_FOR_ALLOC freed 0K,
15% free 12740K/14983K, paused 57ms W/ActivityManager( 193): Activity
pause timeout for ActivityRecord{66e1c680
com.Reign.WaterDemo_Android/waterdemo_android.Activity1} D/dalvikvm(
193): GC_CONCURRENT freed 12K, 15% free 12867K/14983K, paused 1ms+3ms
"
Google TV does not support NDK, so the MonoDroid Java framework cannot load the libmonodroid.so library. There are no ABIs that will work at this time.
There is a feature request open for NDK support on Google TV:
http://code.google.com/p/googletv-issues/issues/detail?id=12
This is a known issue which affects Mono for Android apps in all x86 emulators, and a fix for this is going to be included on the next release of Mono for Android. It's a bug in the Google x86 ndk which was supposedly fixed (but, it turns out, it isn't), so we had to do a little workaround. Debug builds of your app should work correctly, this should only affect Release builds.