Last night, my node.js API threw error
All host(s) tried for query failed.First host tried, 127.0.0.1:Host considered as DOWN.
As suggested in error, it looks like cassandra node was down, so i dig into cassandra logs and found that their is an exception thrown by cassandra, but i could not figure out when this error could have been thrown.
Cassandra logs
INFO [MemtableFlushWriter:73] 2015-04-02 19:12:58,844 Memtable.java:370 - Completed flushing; nothing needed to be retained. Commitlog position was ReplayPosition(segmentId=1427954898588, position=3473692)
INFO [CompactionExecutor:60] 2015-04-02 19:18:28,208 CompactionManager.java:521 - No files to compact for user defined compaction
INFO [CompactionExecutor:61] 2015-04-02 19:28:28,209 CompactionManager.java:521 - No files to compact for user defined compaction
ERROR [CompactionExecutor:62] 2015-04-02 19:38:17,782 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:62,1,main]
java.lang.NullPointerException: null
at org.apache.cassandra.service.CacheService$KeyCacheSerializer.serialize(CacheService.java:475) ~[apache-cassandra-2.1.1.jar:2.1.1]
at org.apache.cassandra.service.CacheService$KeyCacheSerializer.serialize(CacheService.java:463) ~[apache-cassandra-2.1.1.jar:2.1.1]
at org.apache.cassandra.cache.AutoSavingCache$Writer.saveCache(AutoSavingCache.java:236) ~[apache-cassandra-2.1.1.jar:2.1.1]
at org.apache.cassandra.db.compaction.CompactionManager$11.run(CompactionManager.java:1089) ~[apache-cassandra-2.1.1.jar:2.1.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_67]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]
INFO [ScheduledTasks:1] 2015-04-02 19:38:22,991 ColumnFamilyStore.java:856 - Enqueuing flush of sstable_activity: 65753 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:74] 2015-04-02 19:38:22,992 Memtable.java:324 - Writing Memtable-sstable_activity#166239084(6642 serialized bytes, 2952 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:74] 2015-04-02 19:38:23,105 Memtable.java:363 - Completed flushing /var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-208-Data.db (3897 bytes) for commitlog position ReplayPosition(segmentId=1427954898588, position=3589687)
INFO [CompactionExecutor:63] 2015-04-02 19:38:23,106 CompactionTask.java:136 - Compacting [SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-205-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-207-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-208-Data.db'), SSTableReader(path='/var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-206-Data.db')]
INFO [CompactionExecutor:63] 2015-04-02 19:38:23,221 CompactionTask.java:252 - Compacted 4 sstables to [/var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-209,]. 15,482 bytes to 3,897 (~25% of original) in 113ms = 0.032889MB/s. 328 total partitions merged to 82. Partition merge counts were {4:82, }
INFO [CompactionExecutor:62] 2015-04-02 19:38:28,210 CompactionManager.java:521 - No files to compact for user defined compaction
INFO [CompactionExecutor:64] 2015-04-02 19:48:28,213 CompactionManager.java:521 - No files to compact for user defined compaction
INFO [CompactionExecutor:65] 2015-04-02 19:58:28,214 CompactionManager.java:521 - No files to compact for user defined compaction
INFO [CompactionExecutor:66] 2015-04-02 20:08:28,215 CompactionManager.java:521 - No files to compact for user defined compaction
INFO [CompactionExecutor:67] 2015-04-02 20:18:28,216 CompactionManager.java:521 - No files to compact for user defined compaction
INFO [BatchlogTasks:1] 2015-04-02 20:19:58,913 ColumnFamilyStore.java:856 - Enqueuing flush of batchlog: 5848 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:75] 2015-04-02 20:19:58,914 Memtable.java:324 - Writing Memtable-batchlog#1542233641(4538 serialized bytes, 10 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:75] 2015-04-02 20:19:58,915 Memtable.java:370 - Completed flushing; nothing needed to be retained. Commitlog position was ReplayPosition(segmentId=1427954898588, position=3789073)
INFO [CompactionExecutor:68] 2015-04-02 20:28:28,217 CompactionManager.java:521 - No files to compact for user defined compaction
INFO [ScheduledTasks:1] 2015-04-02 20:38:22,989 ColumnFamilyStore.java:856 - Enqueuing flush of compaction_history: 1214 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:76] 2015-04-02 20:38:22,990 Memtable.java:324 - Writing Memtable-compaction_history#1912511539(240 serialized bytes, 9 ops, 0%/0% of on/off-heap limit)
INFO [ScheduledTasks:1] 2015-04-02 20:38:22,992 ColumnFamilyStore.java:856 - Enqueuing flush of sstable_activity: 65753 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:77] 2015-04-02 20:38:22,993 Memtable.java:324 - Writing Memtable-sstable_activity#1348219988(6642 serialized bytes, 2952 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:76] 2015-04-02 20:38:23,049 Memtable.java:363 - Completed flushing /var/lib/cassandra/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/system-compaction_history-ka-107-Data.db (250 bytes) for commitlog position ReplayPosition(segmentId=1427954898588, position=3841853)
INFO [MemtableFlushWriter:77] 2015-04-02 20:38:23,053 Memtable.java:363 - Completed flushing /var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-210-Data.db (3916 bytes) for commitlog position ReplayPosition(segmentId=1427954898588, position=3841853)
INFO [CompactionExecutor:69] 2015-04-02 20:38:28,217 CompactionManager.java:521 - No files to compact for user defined compaction
No logs are generated after this point, possibly node went down and server side apis start throwing error.
Upgrade to Cassandra 2.1.5, see CASSANDRA-8067
Related
I am running my hive query on EMR cluster that which is 25 nodes cluster and i have used r4.4xlarge in stances to run this .
When i run my query i get below error .
Job Commit failed with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: FEAF40B78D086BEE; S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=), S3 Extended Request ID: yteHc4bRl1MrmVhqmnzm06rdzQNN8VcRwd4zqOa+rUY8m2HC2QTt9GoGR/Qu1wuJPILx4mchHRU=)'
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.tez.TezTask
/mnt/var/lib/hadoop/steps/s-10YQZ5Z5PRUVJ/./hive-script:617:in `<main>': Error executing cmd: /usr/share/aws/emr/scripts/hive-script "--base-path" "s3://us-east-1.elasticmapreduce/libs/hive/" "--hive-versions" "latest" "--run-hive-script" "--args" "-f" "s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-22_App/d77a6a82-26f4-4f06-a1ea-e83677256a55/01/DeltaOutPut/processing/Scripts/script.sql" (RuntimeError)
Command exiting with ret '1'
I have tried settings all king of HIVE parameter combinations like below
emrfs-site fs.s3.consistent.retryPolicyType exponential
emrfs-site fs.s3.consistent.metadata.tableName EmrFSMetadataAlt
emrfs-site fs.s3.consistent.metadata.write.capacity 300
emrfs-site fs.s3.consistent.metadata.read.capacity 600
emrfs-site fs.s3.consistent true
hive-site hive.exec.stagingdir .hive-staging
hive-site hive.tez.java.opts -Xmx47364m
hive-site hive.stats.fetch.column.stats true
hive-site hive.stats.fetch.partition.stats true
hive-site hive.vectorized.execution.enabled false
hive-site hive.vectorized.execution.reduce.enabled false
hive-site tez.am.resource.memory.mb 15000
hive-site hive.auto.convert.join false
hive-site hive.compute.query.using.stats true
hive-site hive.cbo.enable true
hive-site tez.task.resource.memory.mb 16000
But every time it failed .
I tried increasing the no of nodes/bigger instances in the EMR cluster but result is still same .
I also tried with and without Tez but still did not worked for me .
Here is my sample query .I am just copying the part of my query
insert into filediffPcfp.TableDelta
Select rgt.FILLER1,rgt.DUNSNUMBER,rgt.BUSINESSNAME,rgt.TRADESTYLENAME,rgt.REGISTEREDADDRESSINDICATOR
Please help me identify the issue .
Adding full yarn logs
2019-02-26 06:28:54,318 [INFO] [TezChild] |exec.FileSinkOperator|: Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: Writing to temp file: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_task_tmp.-ext-10000/_tmp.000000_1
2019-02-26 06:28:54,319 [INFO] [TezChild] |exec.FileSinkOperator|: New Final Path: FS s3://205067-pcfp-app-stepfun-s3appbucket-qa/2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/DeltaOutPut/processing/Delta/.hive-staging_hive_2019-02-26_06-15-00_804_541842212852799084-1/_tmp.-ext-10000/000000_1
2019-02-26 06:28:54,681 [INFO] [TezChild] |exec.FileSinkOperator|: FS[11]: records written - 1
2019-02-26 06:28:54,877 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000
2019-02-26 06:28:56,632 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 10000
2019-02-26 06:29:13,301 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 100000
2019-02-26 06:31:59,207 [INFO] [TezChild] |exec.MapOperator|: MAP[0]: records read - 1000000
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM
2019-02-26 06:34:42,686 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: Attempting to abort attempt_1551161362408_0001_7_01_000000_1 due to an invocation of shutdownRequested
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Received abort
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.TezProcessor|: Forwarding abort to RecordProcessor
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |tez.MapRecordProcessor|: Forwarding abort to mapOp: {} MAP
2019-02-26 06:34:42,687 [INFO] [TaskHeartbeatThread] |exec.MapOperator|: Received abort in operator: MAP
2019-02-26 06:34:42,705 [INFO] [TezChild] |s3.S3FSInputStream|: Encountered exception while reading '2019-02-26_App/d996dfaa-1a62-4062-9350-d0c2bd62e867/01/IncrFile/WB.ACTIVE.OCT17_01_OF_10.gz', will retry by attempting to reopen stream.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AbortedException:
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at com.amazon.ws.emr.hadoop.fs.s3n.InputStreamWithInfo.read(InputStreamWithInfo.java:173)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:136)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:179)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:163)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:255)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
Switch from Tez mode to MR. It should start working. Also remove all the tez related properties.
set hive.execution.engine=spark;
Let me answer my own question .
First thing very important that we have noticed while running HIVE jobs on EMR is that STEP error is misleading vertex failed will not point you in the correct direction .
So better to check for hive logs.
Now if our instance is terminated, then we can not log into master instance and see the logs , in that case we have to look for nodes application logs .
Here is how we can find that nodes logs .
Get the master instance id something like this (i-04d04d9a8f7d28fd1) and with that search for in nodes .
Then open below path
/applications/hive/user/hive/hive.log.gz
Here you can find the expected error .
Also we have to look for the containers logs for the failed nodes ,failed nodes details can be found in master instance node.
hadooplogs/j-25RSD7FFOL5JB/node/i-03f8a646a7ae97aae/daemons/
This daemons nodes logs can be found only if cluster is running else after terminating the cluster EMR does not pushes the logs into S3 log uri .
When i looked at it i got the real reason why it was failing .
For me this was the reason of failure
On checking the master instance's instance-controller logs, i saw there were multiple core instances went into un-healthy state :
2019-02-27 07:50:03,905 INFO Poller: InstanceJointStatusMap contains 21 entries (R:21):
i-0131b7a6abd0fb8e7 1541s R 1500s ig-28 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com I: 18s Y:U 81s c: 0 am: 0 H:R 0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
i-01672279d170dafd3 1539s R 1500s ig-28 ip-10-97-54-69.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:R 79s c: 0 am:241664 H:R 0.7%
i-0227ac0f0932bd0b3 1539s R 1500s ig-28 ip-10-97-51-197.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:R 79s c: 0 am:241664 H:R 4.1%
i-02355f335c190be40 1544s R 1500s ig-28 ip-10-97-52-150.tr-fr-nonprod.aws-int.thomsonreuters.com I: 22s Y:R 84s c: 0 am:241664 H:R 0.2%
i-024ed22b6affdd5ec 1540s R 1500s ig-28 ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com I: 16s Y:U 79s c: 0 am: 0 H:R 0.6%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
Also after some time yarn Blacklisted the Core instances:
2019-02-27 07:46:39,676 INFO Poller: Determining health status for App Monitor: aws157.instancecontroller.apphealth.monitor.YarnMonitor
2019-02-27 07:46:39,688 INFO Poller: SlaveRecord i-0ac26bd7886fec338 changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,695 INFO Poller: SlaveRecord i-0131b7a6abd0fb8e7 changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,695 INFO Poller: Update SlaveRecordDbRow for i-0131b7a6abd0fb8e7 ip-10-97-51-145.tr-fr-nonprod.aws-int.thomsonreuters.com
2019-02-27 07:47:13,696 INFO Poller: SlaveRecord i-024ed22b6affdd5ec changed state from RUNNING to BLACKLISTED
2019-02-27 07:47:13,696 INFO Poller: Update SlaveRecordDbRow for i-024ed22b6affdd5ec ip-10-97-55-123.tr-fr-nonprod.aws-int.thomsonreuters.com
On checking the instance nodes instance-controller logs, I can see the /mnt got full due to job caching and usage went beyond threshold i.e 90% by default.
Because of this yarn :
2019-02-27 07:40:52,231 INFO dsm-1: /mnt total 27633 MB free 2068 MB used 25565 MB
2019-02-27 07:40:52,231 INFO dsm-1: / total 100663 MB free 97932 MB used 2731 MB
2019-02-27 07:40:52,231 INFO dsm-1: cycle 17 /mnt/var/log freeSpaceMb: 2068/27633 MB freeRatio:0.07
2019-02-27 07:40:52,248 INFO dsm-1: /mnt/var/log stats :
-> As in my dataset, source table is having .gz compression. As .gz compressed files are non-splitable due to this 1 file is having 1 map task assigned to it. And as the map task will decompress the file in /mnt, it may also lead to that issue.
-> Processing a large amount of data in EMR needs some hive properties to be optimized. Below are the few optimization property can be set in the cluster to make the query run in a better way.
V.V.V.V.V.I
Increase the EBS volume size for Core instances
Important is that we have to increase the EBS voulume for each core not alone for the master because EBS volume is where /mnt gets mounted not on the route .
This alone has solved my problem but below configuration also helped me optimize the HIVE jobs
hive-site.xml
-------------
"hive.exec.compress.intermediate" : "true",
"hive.intermediate.compression.codec" : "org.apache.hadoop.io.compress.SnappyCodec",
"hive.intermediate.compression.type" : "BLOCK"
yarn-site.xml
-------------
"max-disk-utilization-per-disk-percentage" : "99"
And this has resolved my issue permanently .
Hope some one will get benefited with my answer
I import data from file logs.csv to Hbase table using command hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,log" logs hdfs://ip:9000/tmp/logs.csv. At the end of the executing command I get a summary shown below however there is no information on how long it took to add data to the Hbase. Do you have any idea how can I check this?
2018-10-06 23:09:17,647 INFO [LocalJobRunner Map Task Executor #0] mapred.Task: Final Counters for attempt_local1534176268_0001_m_000001_0: Counters: 21
File System Counters
FILE: Number of bytes read=37162012
FILE: Number of bytes written=37835107
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=162892986
HDFS: Number of bytes written=0
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Map-Reduce Framework
Map input records=175896
Map output records=175896
Input split bytes=106
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=18
Total committed heap usage (bytes)=2075918336
ImportTsv
Bad Lines=0
File Input Format Counters
Bytes Read=28671162
File Output Format Counters
Bytes Written=0
2018-10-06 23:09:17,647 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner: Finishing task: attempt_local1534176268_0001_m_000001_0
2018-10-06 23:09:17,647 INFO [Thread-37] mapred.LocalJobRunner: map task executor complete.
2018-10-06 23:09:18,191 INFO [main] mapreduce.Job: Job job_local1534176268_0001 completed successfully
2018-10-06 23:09:18,220 INFO [main] mapreduce.Job: Counters: 21
File System Counters
FILE: Number of bytes read=74323793
FILE: Number of bytes written=75670214
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=297114810
HDFS: Number of bytes written=0
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Map-Reduce Framework
Map input records=1000000
Map output records=1000000
Input split bytes=212
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=55
Total committed heap usage (bytes)=4151836672
ImportTsv
Bad Lines=0
File Input Format Counters
Bytes Read=162892986
File Output Format Counters
Bytes Written=0
yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
LOGS
2018-10-16 09:39:53,350 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-10-16 09:39:53,350 WARN org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is interrupted. Exiting.
2018-10-16 09:39:53,351 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting
2018-10-16 09:39:53,352 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
2018-10-16 09:39:53,353 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
2018-10-16 09:39:53,353 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2018-10-16 09:39:53,354 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.ConnectException: Call From myserver/myip to 0.0.0.0:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:238)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:369)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:637)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
Caused by: java.net.ConnectException: Call From myserver/myip to 0.0.0.0:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1493)
at org.apache.hadoop.ipc.Client.call(Client.java:1435)
at org.apache.hadoop.ipc.Client.call(Client.java:1345)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy73.registerNodeManager(Unknown Source)
at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:73)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
at com.sun.proxy.$Proxy74.registerNodeManager(Unknown Source)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:343)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:232)
... 6 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 22 more
2018-10-16 09:39:53,358 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at myserver/myip
************************************************************/
It is map/reduce job, you can see the execution time in yarn UI. it's default port is 8088.
I launched two m1.medium nodes on amazon ec2 for executing my pig script, but looks like it failed at the first line (even before MapReduce start): raw = LOAD 's3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000' USING TextLoader as (line:chararray);
The error message I got:
2015-02-04 02:15:39,804 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-02-04 02:15:39,821 [JobControl] INFO org.apache.hadoop.mapred.JobClient - Default number of map tasks: null
2015-02-04 02:15:39,822 [JobControl] INFO org.apache.hadoop.mapred.JobClient - Setting default number of map tasks based on cluster size to : 20
... (omitted)
2015-02-04 02:18:40,955 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-02-04 02:18:40,956 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201502040202_0002 has failed! Stop running all dependent jobs
2015-02-04 02:18:40,956 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-02-04 02:18:40,997 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Java heap space
2015-02-04 02:18:40,997 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-02-04 02:18:40,997 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 1.0.3 0.11.1.1-amzn hadoop 2015-02-04 02:15:32 2015-02-04 02:18:40 GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201502050202_0002 ngroup,raw,triples,tt GROUP_BY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201502050202_0002_m_000022
Input(s):
Failed to read data from "s3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
I think the code should be fine since I have ever successfully loaded other data with the same syntax, and the link to s3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000 looks valid. I suspect it might be related to some of my EC2 settings, but not sure how to investigate further or narrow down the problem. Anyone has a clue?
"Java heap space" error message gives some clues. Your files seem to be quite large (~2GB). Make sure that you have enough memory for each task runner to read the data.
The problem was currently solved by changing my node from m1.medium to m3.large , thanks for the good hint from #Nat as he pointed out the error message regarding with java heap space. I'll update more details later.
Running Hadoop Version 1.2.1 using Ubuntu VM
4 VM's
1. hadoop-NN ( Name Node)
2. hadoop-snn ( Secondary Name Node)
3. hadoop-dn01 ( data node 1)
4. hadoop-dn02 ( data node 2)
All process starts using start-all.sh
I don't see edit happenings in Secondary Name Node which means that fsiamge on secondary is not getting updated.
LOg file on SecondaryNameNode shows following error.
2015-02-04 13:16:12,083 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 50
2015-02-04 13:16:12,086 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0
2015-02-04 13:16:12,087 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Start loading edits file /tmp/hadoop-hadoop/dfs/namesecondary/current/edits
2015-02-04 13:16:12,088 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: EOF of /tmp/hadoop-hadoop/dfs/namesecondary/current/edits, reached end of edit log Number of transactions found: 8. Bytes read: 740
2015-02-04 13:16:12,088 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Edits file /tmp/hadoop-hadoop/dfs/namesecondary/current/edits of size 740 edits # 8 loaded in 0 seconds.
2015-02-04 13:16:12,088 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 0 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0
2015-02-04 13:16:12,128 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: closing edit log: position=740, editlog=/tmp/hadoop-hadoop/dfs/namesecondary/current/edits
2015-02-04 13:16:12,128 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: close success: truncate to 740, editlog=/tmp/hadoop-hadoop/dfs/namesecondary/current/edits
2015-02-04 13:16:12,130 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file /tmp/hadoop-hadoop/dfs/namesecondary/current/fsimage of size 5124 bytes saved in 0 seconds.
2015-02-04 13:16:12,229 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: closing edit log: position=4, editlog=/tmp/hadoop-hadoop/dfs/namesecondary/current/edits
2015-02-04 13:16:12,230 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: close success: truncate to 4, editlog=/tmp/hadoop-hadoop/dfs/namesecondary/current/edits
2015-02-04 13:16:12,485 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL hadoop-nn:50070putimage=1&port=50090&machine=0.0.0.0&token=-41:307905665:0:1423080068000:1423079764851&newChecksum=9bbe4619db3323211ed473f3f8acb7a9
2015-02-04 13:16:12,485 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://hadoop-nn:50070/getimage?putimage=1&port=50090&machine=0.0.0.0&token=-41:307905665:0:1423080068000:1423079764851&newChecksum=9bbe4619db3323211ed473f3f8acb7a9
2015-02-04 13:16:12,489 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint:
2015-02-04 13:16:12,490 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.io.FileNotFoundException: http://hadoop-nn:50070/getimage?putimage=1&port=50090&machine=0.0.0.0&token=-41:307905665:0:1423080068000:1423079764851&newChecksum=9bbe4619db3323211ed473f3f8acb7a9
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:177)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.putFSImage(SecondaryNameNode.java:462)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:525)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:396)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:360)
at java.lang.Thread.run(Thread.java:745)
<property>
<name>dfs.secondary.http.address</name>
<value>hadoop-snn:50090</value>
</property>
Adding this tag in hdfs-site.xml solves issue.
A pig script (not particularly more complex than any others I have built) before the job starts it seems to loop on this for a long time:
2013-10-08 10:46:07,655 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 10
2013-10-08 10:46:07,659 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 10
2013-10-08 10:46:09,168 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 10
2013-10-08 10:46:09,168 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 10
2013-10-08 10:46:11,381 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 10
2013-10-08 10:46:11,381 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 10
2013-10-08 10:46:13,875 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 10
2013-10-08 10:46:13,875 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 10
2013-10-08 10:46:16,303 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 10
It repeats the above for around 4 minutes when usually this step is completed in seconds. I have not been able to identify the cause - other than removing parts of the script but the issue does not seem to be caused by any particular part of the script. I have other scripts as complex as this one and I have not had this problem. What could be causing the issue?
I can't say for certain without more information, but it appears that pig is waiting for your cluster's JobTracker to start running the underlying Map/Reduce jobs generated by your script. There are numerious reasons why this could be happening such as running on a shared cluster which has run out of resources. You'll most likely have to look at your cluster's JobTracker and/or TaskTrackers to know the exact reason.