rabbitmq-server start failed - rabbitmq

Facing issue with rabbitmq server startup. Rabbitmq-server start fails with logs -
noti] <0.147.0> Protocol 'inet_tcp': register/listen error: etimedout
noti] <0.147.0>
erro] <0.144.0> supervisor: {local,net_sup}
erro] <0.144.0> errorContext: start_error
erro] <0.144.0> reason: {'EXIT',nodistribution}
erro] <0.144.0> offender: [{pid,undefined},
erro] <0.144.0> {id,net_kernel},
erro] <0.144.0> {mfargs,{net_kernel,start_link,
erro] <0.144.0> [[rabbit_prelaunch_186103#localhost,
erro] <0.144.0> shortnames],
erro] <0.144.0> false,net_sup_dynamic]}},
erro] <0.144.0> {restart_type,permanent},
erro] <0.144.0> {shutdown,2000},
erro] <0.144.0> {child_type,worker}]
erro] <0.144.0>
BOOT FAILED
[erro] <0.131.0>
[erro] <0.131.0> BOOT FAILED
[erro] <0.131.0> ===========
[erro] <0.131.0> Exception during startup:
[erro] <0.131.0>
[erro] <0.131.0> error:{badmatch,{error,{{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}},{child,undefined,net_sup_dynamic,{erl_distribution,start_link,[[rabbit_prelaunch_186103#localhost,shortnames],false,net_sup_dynamic]},permanent,1000,supervisor,[erl_distribution]}}}}
[erro] <0.131.0>
[erro] <0.131.0> rabbit_prelaunch_dist:duplicate_node_check/1, line 78
[erro] <0.131.0> rabbit_prelaunch_dist:setup/1, line 23
[erro] <0.131.0> rabbit_prelaunch:do_run/0, line 115
[erro] <0.131.0> rabbit_prelaunch:run_prelaunch_first_phase/0, line 32
[erro] <0.131.0> supervisor:do_start_child_i/3, line 385
[erro] <0.131.0> supervisor:do_start_child/2, line 371
[erro] <0.131.0> supervisor:-start_children/2-fun-0-/3, line 355
[erro] <0.131.0> supervisor:children_map/4, line 1171
This issue is on centos and rabbitmq version is 3.9.11 / erlang 23.3.4
Already tried rebooting the system, deleting /var/lib/rabbitmq/mnesia files and restarting rabbitmq (no luck).
Has anyone faced this issue?

Related

RabbitMQ server crashing

Version - rabbitMQ : 3.9.11, Erlang : 24.3.4.1
We are using nodejs process to publish and consume.
All of a sudden RabbitMQ stops responding. Even the web page would not respond (Dashboard).
Usage / RAM / queue sizes are normal at the time of crash.
If we see the log at crash time we got this:
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> supervisor: {<0.21497.4>,rabbit_channel_sup}
**2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> errorContext: shutdown_error
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> reason: killed**
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> offender: [{pid,<0.21500.4>},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {id,channel},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {mfargs,
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {rabbit_channel,start_link,
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> [1,<0.21491.4>,<0.21498.4>,<0.21491.4>,
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> <<"myserverip:59346 -> myserverip:5672">>,
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> rabbit_framing_amqp_0_9_1,
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {user,<<"username">>,
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> [administrator],
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> [{rabbit_auth_backend_internal,none}]},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> <<"/">>,
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> [{<<"publisher_confirms">>,bool,true},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {<<"exchange_exchange_bindings">>,bool,true},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {<<"basic.nack">>,bool,true},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {<<"consumer_cancel_notify">>,bool,true},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {<<"connection.blocked">>,bool,true},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {<<"authentication_failure_close">>,bool,true}],
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> <0.21492.4>,<0.21499.4>]}},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {restart_type,intrinsic},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {shutdown,70000},
2023-02-16 15:12:29.939986+05:30 [erro] <0.21497.4> {child_type,worker}]
2023-02-16 15:12:35.434023+05:30 [erro] <0.21684.4> supervisor: {<0.21684.4>,rabbit_channel_sup}
We are not getting any exception on our node process. It keeps publishing.
Please Help us to diagnose and solve the issue
Thank you.
We try to diagnose using rabbitmq-diagnostics cluster_status, but we are not getting any response from the server.
Server won't respond until we restart the service.

Calico Felix-Typha connection cancelled on ARM64 EKS

I'm attempting to get Calico installed on an Graviton EKS cluster using the manifests listed here: https://docs.aws.amazon.com/eks/latest/userguide/calico.html
In order to successfully run on ARM64, I'm using a tigera ImageSet with the sha256 manifest of the master-arm64 tags for calico and tigera containers (when they exist). ref: https://projectcalico.docs.tigera.io/maintenance/image-options/imageset
apiVersion: operator.tigera.io/v1
kind: ImageSet
metadata:
name: calico-master
spec:
images:
- image: "calico/apiserver"
digest: "sha256:1a2bc0bad25eb95e77353d59e6ad9edc9d56aa9caebdcfbd027e8ddb7eb956b1"
- image: "calico/cni"
digest: "sha256:a257ee22e3d9e74d2b4c6362045147002104cea6101d3aaefa74661b91fea89b"
- image: "calico/kube-controllers"
digest: "sha256:fd101df470937e14033f602e5817e31e46933c6088a8bdc6fc80e43a1c9e011b"
- image: "calico/node"
digest: "sha256:8694683b9bd0d13caef2e67f1486ded0e843c810f1eb9d4c021a5ffdedd4af8d"
- image: "calico/typha"
digest: "sha256:174b0c47db4297623500cc044826bc259af28974cf5e0df4f84244e824cfda52"
- image: "calico/pod2daemon-flexvol"
digest: "sha256:a276db19af1cba49b7a032ee259e0e0f198575d8af27c9cadfebfe4d63bf15bf"
- image: "calico/windows-upgrade"
digest: "sha256:0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"
- image: "tigera/operator"
digest: "sha256:01327468115202b72519fbe99344b2cc64cca37302d12840827190d97c2ba9cb"
- image: "tigera/key-cert-provisioner"
digest: "sha256:b8b4f0ae606626e029c77dc1c30199f8484f797d2bc52d8e484efc5b938725ad"
Tigera and Calico seem to launch fine, but my calico-node daemonset remains 0/1 forever due to the Felix-Typha connection:
kubectl -n calico-system get all
The logs for Typha (calico-typha deployment):
2022-03-03 15:27:34.692 [INFO][7] sync_server.go 368: Accepted from 192.168.40.212:49482 port=5473
2022-03-03 15:27:34.705 [INFO][7] sync_server.go 393: New connection connID=0x3e1 port=5473
2022-03-03 15:27:34.705 [INFO][7] sync_server.go 558: Per-connection goroutine started client=192.168.40.212:49482 connID=0x3e1
2022-03-03 15:27:34.705 [INFO][7] sync_server.go 636: Failed to read from client client=192.168.40.212:49482 connID=0x3e1 error=gob: name not registered for interface: "github.com/projectcalico/calico/typha/pkg/syncproto.MsgClientHello" thread="read"
2022-03-03 15:27:34.705 [INFO][7] sync_server.go 629: Read goroutine finished client=192.168.40.212:49482 connID=0x3e1 thread="read"
2022-03-03 15:27:34.705 [INFO][7] sync_server.go 666: Asked to stop by context. client=192.168.40.212:49482 connID=0x3e1
2022-03-03 15:27:34.705 [WARNING][7] sync_server.go 675: Failed to read client hello. client=192.168.40.212:49482 connID=0x3e1 error=context canceled
2022-03-03 15:27:34.705 [INFO][7] sync_server.go 545: Client connection shutting down. client=192.168.40.212:49482 connID=0x3e1
2022-03-03 15:27:34.705 [INFO][7] sync_server.go 554: Client connection shut down. client=192.168.40.212:49482 connID=0x3e1
2022-03-03 15:27:34.705 [INFO][7] sync_server.go 421: Connection handler finished error=context canceled
The logs for Felix (calico-node daemonset):
2022-03-03 15:33:40.427 [INFO][13853] status-reporter/startup.go 425: Early log level set to info
2022-03-03 15:33:40.428 [INFO][13853] status-reporter/config.go 60: Found FELIX_TYPHAK8SSERVICENAME=calico-typha
2022-03-03 15:33:40.428 [INFO][13853] status-reporter/config.go 60: Found FELIX_TYPHAK8SNAMESPACE=calico-system
2022-03-03 15:33:40.428 [INFO][13853] status-reporter/config.go 60: Found FELIX_TYPHAKEYFILE=/felix-certs/key.key
2022-03-03 15:33:40.428 [INFO][13853] status-reporter/config.go 60: Found FELIX_TYPHACERTFILE=/felix-certs/cert.crt
2022-03-03 15:33:40.428 [INFO][13853] status-reporter/config.go 60: Found FELIX_TYPHACAFILE=/typha-ca/caBundle
2022-03-03 15:33:40.428 [INFO][13853] status-reporter/config.go 60: Found FELIX_TYPHACN=typha-server
2022-03-03 15:33:40.447 [INFO][13853] status-reporter/discovery.go 163: Found ready Typha addresses. addrs=[]string{"192.168.193.138:5473", "192.168.36.92:5473"}
2022-03-03 15:33:40.447 [INFO][13853] status-reporter/discovery.go 166: Chose Typha to connect to. choice="192.168.36.92:5473"
2022-03-03 15:33:40.447 [INFO][13853] status-reporter/startsyncerclient.go 56: Connecting to Typha. addr="192.168.36.92:5473"
2022-03-03 15:33:40.447 [INFO][13853] status-reporter/sync_client.go 71: requiringTLS=true
2022-03-03 15:33:40.447 [INFO][13853] status-reporter/sync_client.go 200: Starting Typha client
2022-03-03 15:33:40.447 [INFO][13853] status-reporter/sync_client.go 71: requiringTLS=true
2022-03-03 15:33:40.448 [INFO][13853] status-reporter/tlsutils.go 39: Make certificate verifier requiredCN="typha-server" requiredURISAN="" roots=&x509.CertPool{byName:map[string][]int{"0,1*0(\x06\x03U\x04\x03\f!tigera-operator-signer#1646270114":[]int{0}}, lazyCerts:[]x509.lazyCert{x509.lazyCert{rawSubject:[]uint8{0x30, 0x2c, 0x31, 0x2a, 0x30, 0x28, 0x6, 0x3, 0x55, 0x4, 0x3, 0xc, 0x21, 0x74, 0x69, 0x67, 0x65, 0x72, 0x61, 0x2d, 0x6f, 0x70, 0x65, 0x72, 0x61, 0x74, 0x6f, 0x72, 0x2d, 0x73, 0x69, 0x67, 0x6e, 0x65, 0x72, 0x40, 0x31, 0x36, 0x34, 0x36, 0x32, 0x37, 0x30, 0x31, 0x31, 0x34}, getCert:(func() (*x509.Certificate, error))(0x6d99f0)}}, haveSum:map[x509.sum224]bool{x509.sum224{0xc0, 0x54, 0x82, 0x63, 0xb1, 0xf5, 0xe0, 0xda, 0x83, 0x69, 0x3f, 0x40, 0x66, 0xf7, 0x5a, 0x72, 0x3a, 0x4e, 0x4a, 0xe6, 0x1a, 0xfe, 0xb0, 0xa5, 0x5d, 0xd1, 0x2e, 0xdf}:true}}
2022-03-03 15:33:40.448 [INFO][13853] status-reporter/sync_client.go 252: Connecting to Typha. address="192.168.36.92:5473" connID=0x0 type="node-status"
2022-03-03 15:33:40.455 [INFO][13853] status-reporter/tlsutils.go 46: Verify certificate chain signing address="192.168.36.92:5473" connID=0x0 type="node-status"
2022-03-03 15:33:40.461 [INFO][13853] status-reporter/sync_client.go 267: Connected to Typha. address="192.168.36.92:5473" connID=0x0 type="node-status"
2022-03-03 15:33:40.461 [INFO][13853] status-reporter/sync_client.go 301: Started Typha client main loop address="192.168.36.92:5473" connID=0x0 type="node-status"
2022-03-03 15:33:40.462 [ERROR][13853] status-reporter/sync_client.go 293: Failed to read from server address="192.168.36.92:5473" connID=0x0 error=EOF type="node-status"
2022-03-03 15:33:40.462 [INFO][13853] status-reporter/sync_client.go 166: Typha client Context asked us to exit connID=0x0 type="node-status"
2022-03-03 15:33:40.462 [FATAL][13853] status-reporter/startsyncerclient.go 77: Connection to Typha failed

Insertion into hive ORC format table from other hive formats cannot be selected from

After creating a new hive external table in ORC format which is bucketed and inserting from another table (with the exact schema) but in Avro format (and non-bucketed) when selecting from the new table there are many errors. I put the stack of the errors here (some of them are repeated and had to delete from the end due to lack of space):
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1572541024266_0020_1_00, diagnostics=[Task failed, taskId=task_1572541024266_0020_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1572541024266_0020_1_00_000000_0:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:74)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: java.io.IOException: java.io.IOException: Error reading file: path/000003_0
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
... 16 more
Caused by: java.io.IOException: Error reading file: /path/000003_0
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:93)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:238)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:213)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
... 22 more
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for column 44 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
at org.apache.orc.impl.SerializationUtils.readFully(SerializationUtils.java:119)
at org.apache.orc.impl.SerializationUtils.readLongLE(SerializationUtils.java:102)
at org.apache.orc.impl.SerializationUtils.readDouble(SerializationUtils.java:98)
at org.apache.orc.impl.TreeReaderFactory$DoubleTreeReader.nextVector(TreeReaderFactory.java:762)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:1833)
at org.apache.orc.impl.TreeReaderFactory$ListTreeReader.nextVector(TreeReaderFactory.java:2001)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
... 27 more
], TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) : attempt_1572541024266_0020_1_00_000000_1:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:74)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
... 16 more
Caused by: java.io.IOException: Error reading file: /path/000003_0
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:93)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:238)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:213)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
... 22 more
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for column 44 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
at org.apache.orc.impl.SerializationUtils.readFully(SerializationUtils.java:119)
at org.apache.orc.impl.SerializationUtils.readLongLE(SerializationUtils.java:102)
at org.apache.orc.impl.SerializationUtils.readDouble(SerializationUtils.java:98)
at org.apache.orc.impl.TreeReaderFactory$DoubleTreeReader.nextVector(TreeReaderFactory.java:762)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:1833)
at org.apache.orc.impl.TreeReaderFactory$ListTreeReader.nextVector(TreeReaderFactory.java:2001)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
... 27 more
], TaskAttempt 2 failed, info=[Error: Error while running task ( failure ) : attempt_1572541024266_0020_1_00_000000_2:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:74)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
... 16 more
Caused by: java.io.IOException: Error reading file: /path/000003_0
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:93)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:238)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:213)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
... 22 more
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for column 44 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
at org.apache.orc.impl.SerializationUtils.readFully(SerializationUtils.java:119)
at org.apache.orc.impl.SerializationUtils.readLongLE(SerializationUtils.java:102)
at org.apache.orc.impl.SerializationUtils.readDouble(SerializationUtils.java:98)
at org.apache.orc.impl.TreeReaderFactory$DoubleTreeReader.nextVector(TreeReaderFactory.java:762)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:1833)
at org.apache.orc.impl.TreeReaderFactory$ListTreeReader.nextVector(TreeReaderFactory.java:2001)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
... 27 more
], TaskAttempt 3 failed, info=[Error: Error while running task ( failure ) : attempt_1572541024266_0020_1_00_000000_3:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:74)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
... 16 more
Caused by: java.io.IOException: Error reading file: /path/000003_0
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:93)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:238)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:213)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
... 22 more
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for column 44 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
at org.apache.orc.impl.SerializationUtils.readFully(SerializationUtils.java:119)
at org.apache.orc.impl.SerializationUtils.readLongLE(SerializationUtils.java:102)
at org.apache.orc.impl.SerializationUtils.readDouble(SerializationUtils.java:98)
at org.apache.orc.impl.TreeReaderFactory$DoubleTreeReader.nextVector(TreeReaderFactory.java:762)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:1833)
at org.apache.orc.impl.TreeReaderFactory$ListTreeReader.nextVector(TreeReaderFactory.java:2001)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
... 27 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1572541024266_0020_1_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1572541024266_0020_1_00, diagnostics=[Task failed, taskId=task_1572541024266_0020_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1572541024266_0020_1_00_000000_0:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:74)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
... 16 more
Caused by: java.io.IOException: Error reading file: /path/000003_0
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:93)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:238)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:213)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
... 22 more
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for column 44 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
at org.apache.orc.impl.SerializationUtils.readFully(SerializationUtils.java:119)
at org.apache.orc.impl.SerializationUtils.readLongLE(SerializationUtils.java:102)
at org.apache.orc.impl.SerializationUtils.readDouble(SerializationUtils.java:98)
at org.apache.orc.impl.TreeReaderFactory$DoubleTreeReader.nextVector(TreeReaderFactory.java:762)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextVector(TreeReaderFactory.java:1833)
at org.apache.orc.impl.TreeReaderFactory$ListTreeReader.nextVector(TreeReaderFactory.java:2001)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
... 27 more
], TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) : attempt_1572541024266_0020_1_00_000000_1:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:74)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: java.io.IOException: java.io.IOException: Error reading file: /path/000003_0
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:365)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:151)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62)
... 16 more
Caused by: java.io.IOException: Error reading file:/path/000003_0
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:93)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:238)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:213)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)
... 22 more
Any suggestion on how to resolve this issue?

Apache Metrics Collector install failed while deploying Apache Ambari 2.5.1

I've tried to deploy Apache Ambari 2.5.1, and Apache Metrics Collector install failed. I have researched this issue and I can not find the same issue in the Internet. Can you help me to solve this problem? Thanks!
stderr:
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/scripts/metrics_collector.py", line 86, in <module> AmsCollector().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 329, in execute method(env)
File "/var/lib/ambari-agent/cache/common-services/AMBARI_METRICS/0.1.0/package/scripts/metrics_collector.py", line 36, in install self.install_packages(env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 693, in install_packages retry_count=agent_stack_retry_count)
File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 155, in __init__ self.env.run()
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in run self.run_action(resource, action)
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 124, in run_action provider_action()
File "/usr/lib/python2.6/site-packages/resource_management/core/providers/package/__init__.py", line 54, in action_install self.install_package(package_name, self.resource.use_repos, self.resource.skip_repos)
File "/usr/lib/python2.6/site-packages/resource_management/core/providers/package/yumrpm.py", line 51, in install_package self.checked_call_with_retries(cmd, sudo=True, logoutput=self.get_logoutput())
File "/usr/lib/python2.6/site-packages/resource_management/core/providers/package/__init__.py", line 86, in checked_call_with_retries return self._call_with_retries(cmd, is_checked=True, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/providers/package/__init__.py", line 98, in _call_with_retries code, out = func(cmd, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 72, in inner result = function(command, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 102, in checked_call tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 150, in _call_wrapper result = _call(command, **kwargs_copy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 303, in _call raise ExecutionFailed(err_msg, code, out, err)
resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/bin/yum -d 0 -e 0 -y install ambari-metrics-collector' returned 1.
Error: Nothing to do
stdout:
2017-07-19 17:09:34,336 - Stack Feature Version Info: stack_version=2.6, version=None, current_cluster_version=None -> 2.6
2017-07-19 17:09:34,338 - Using hadoop conf dir: /usr/hdp/current/hadoop-client/conf User Group mapping (user_group) is missing in the hostLevelParams
2017-07-19 17:09:34,341 - Group['livy'] {}
2017-07-19 17:09:34,343 - Group['spark'] {}
2017-07-19 17:09:34,343 - Group['hadoop'] {}
2017-07-19 17:09:34,344 - Group['users'] {}
2017-07-19 17:09:34,344 - User['hive'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop']}
2017-07-19 17:09:34,345 - User['livy'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop']}
2017-07-19 17:09:34,345 - User['zookeeper'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop']}
2017-07-19 17:09:34,346 - User['spark'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop']}
2017-07-19 17:09:34,347 - User['ams'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop']}
2017-07-19 17:09:34,348 - User['ambari-qa'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['users']}
2017-07-19 17:09:34,348 - User['tez'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['users']}
2017-07-19 17:09:34,349 - User['hdfs'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop']}
2017-07-19 17:09:34,350 - User['yarn'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop']}
2017-07-19 17:09:34,350 - User['hcat'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop']}
2017-07-19 17:09:34,351 - User['mapred'] {'gid': 'hadoop', 'fetch_nonlocal_groups': True, 'groups': ['hadoop']}
2017-07-19 17:09:34,352 - File['/var/lib/ambari-agent/tmp/changeUid.sh'] {'content': StaticFile('changeToSecureUid.sh'), 'mode': 0555}
2017-07-19 17:09:34,354 - Execute['/var/lib/ambari-agent/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa'] {'not_if': '(test $(id -u ambari-qa) -gt 1000) || (false)'}
2017-07-19 17:09:34,361 - Skipping Execute['/var/lib/ambari-agent/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa'] due to not_if
2017-07-19 17:09:34,362 - Group['hdfs'] {}
2017-07-19 17:09:34,363 - User['hdfs'] {'fetch_nonlocal_groups': True, 'groups': ['hadoop', 'hdfs']}
2017-07-19 17:09:34,363 - FS Type:
2017-07-19 17:09:34,364 - Directory['/etc/hadoop'] {'mode': 0755}
2017-07-19 17:09:34,395 - File['/usr/hdp/current/hadoop-client/conf/hadoop-env.sh'] {'content': InlineTemplate(...), 'owner': 'hdfs', 'group': 'hadoop'}
2017-07-19 17:09:34,396 - Directory['/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir'] {'owner': 'hdfs', 'group': 'hadoop', 'mode': 01777}
2017-07-19 17:09:34,421 - Initializing 2 repositories
2017-07-19 17:09:34,422 - Repository['HDP-2.6'] {'base_url': 'http://s3.amazonaws.com/dev.hortonworks.com/HDP/centos6/2.x/BUILDS/2.6.3.0-63', 'action': ['create'], 'components': ['HDP', 'main'], 'repo_template': '[{{repo_id}}]\nname={{repo_id}}\n{% if mirror_list %}mirrorlist={{mirror_list}}{% else %}baseurl={{base_url}}{% endif %}\n\npath=/\nenabled=1\ngpgcheck=0', 'repo_file_name': 'HDP', 'mirror_list': None}
2017-07-19 17:09:34,430 - File['/etc/yum.repos.d/HDP.repo'] {'content': '[HDP-2.6]\nname=HDP-2.6\nbaseurl=http://s3.amazonaws.com/dev.hortonworks.com/HDP/centos6/2.x/BUILDS/2.6.3.0-63\n\npath=/\nenabled=1\ngpgcheck=0'}
2017-07-19 17:09:34,432 - Repository['HDP-UTILS-1.1.0.21'] {'base_url': 'http://s3.amazonaws.com/dev.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos6', 'action': ['create'], 'components': ['HDP-UTILS', 'main'], 'repo_template': '[{{repo_id}}]\nname={{repo_id}}\n{% if mirror_list %}mirrorlist={{mirror_list}}{% else %}baseurl={{base_url}}{% endif %}\n\npath=/\nenabled=1\ngpgcheck=0', 'repo_file_name': 'HDP-UTILS', 'mirror_list': None}
2017-07-19 17:09:34,436 - File['/etc/yum.repos.d/HDP-UTILS.repo'] {'content': '[HDP-UTILS-1.1.0.21]\nname=HDP-UTILS-1.1.0.21\nbaseurl=http://s3.amazonaws.com/dev.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos6\n\npath=/\nenabled=1\ngpgcheck=0'}
2017-07-19 17:09:34,437 - Package['unzip'] {'retry_on_repo_unavailability': False, 'retry_count': 5}
2017-07-19 17:09:34,511 - Skipping installation of existing package unzip
2017-07-19 17:09:34,511 - Package['curl'] {'retry_on_repo_unavailability': False, 'retry_count': 5}
2017-07-19 17:09:34,519 - Skipping installation of existing package curl
2017-07-19 17:09:34,519 - Package['hdp-select'] {'retry_on_repo_unavailability': False, 'retry_count': 5}
2017-07-19 17:09:34,526 - Skipping installation of existing package hdp-select
2017-07-19 17:09:34,708 - Using hadoop conf dir: /usr/hdp/current/hadoop-client/conf
2017-07-19 17:09:34,712 - checked_call['hostid'] {}
2017-07-19 17:09:34,729 - checked_call returned (0, 'a8c02132')
2017-07-19 17:09:34,733 - Package['ambari-metrics-collector'] {'retry_on_repo_unavailability': False, 'retry_count': 5}
2017-07-19 17:09:34,809 - Installing package ambari-metrics-collector ('/usr/bin/yum -d 0 -e 0 -y install ambari-metrics-collector')
2017-07-19 17:09:37,830 - Execution of '/usr/bin/yum -d 0 -e 0 -y install ambari-metrics-collector' returned 1. Error: Nothing to do
2017-07-19 17:09:37,830 - Failed to install package ambari-metrics-collector. Executing '/usr/bin/yum clean metadata'
2017-07-19 17:09:38,131 - Retrying to install package ambari-metrics-collector after 30 seconds
Command failed after 1 tries
resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/bin/yum -d 0 -e 0 -y install ambari-metrics-collector' returned 1.
Error: Nothing to do
Usually, this error means that some package is not available in repo, already installed and so on. Try running command manually in verbose mode like
/usr/bin/yum -y install ambari-metrics-collector
and post entire yum output.
Taking the analysis part of Dmitriusan's answer and adding that the suggestion to run the yum manually may not work as the issue is about the package not being available in the repo. You may have to either add a repo that contains package in /etc/yum.repos.d/ and then retry or run the command in Dmitriusan's answer. Or manually download the rpm and install it as "rpm -i /path/to/package.rpm"

Celery RabbitMQ broker failover connect issue

I have 3 RabbitMQ nodes in cluster in HA mode. Each node is on separate Docker container.
I am using Celery version 4 and kombu version 4.
I have used this command to set HA policy:
rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'
Celery config looks like this:
CELERY = dict(
broker_url=[
'amqp://guest#rabbitmq1:5672',
'amqp://guest#rabbitmq2:5672',
'amqp://guest#rabbitmq3:5672',
],
celery_queue_ha_policy='all',
...
)
Everything works fine until I stop master RabbitMQ application in order to test Celery failover feature using command:
rabbitmqctl stop_app
Immediately after RabbitMQ application is stopped I started seeing errors in log bellow. Frequency of log messages is very high and it doesn't slow down with number of attempts.
According to logs Celery tries to reconnect using next failover, but it gets interrupted by another try to reconnect to master node that was stopped. The same thing happens over and over like in infinite loop.
[2017-03-17 15:10:28,084: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#rabbitmq1:5672//: [Errno 111] Connection refused.
Will retry using next failover.
[2017-03-17 15:10:28,300: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL. See http://www.rabbitmq.com/', 'product': 'RabbitMQ', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'capabilities': {'exchange_exchange_bindings': True, 'connection.blocked': True, 'authentication_failure_close': True, 'direct_reply_to': True, 'basic.nack': True, 'per_consumer_qos': True, 'consumer_priorities': True, 'consumer_cancel_notify': True, 'publisher_confirms': True}, 'cluster_name': 'rabbit#rabbitmq1', 'platform': 'Erlang/OTP', 'version': '3.6.6'}, mechanisms: [u'PLAIN', u'AMQPLAIN'], locales: [u'en_US']
[2017-03-17 15:10:28,302: DEBUG/MainProcess] ^-- substep ok
[2017-03-17 15:10:28,303: DEBUG/MainProcess] | Consumer: Starting Mingle
[2017-03-17 15:10:28,303: INFO/MainProcess] mingle: searching for neighbors
[2017-03-17 15:10:28,303: DEBUG/MainProcess] using channel_id: 1
[2017-03-17 15:10:28,318: DEBUG/MainProcess] Channel open
[2017-03-17 15:10:28,470: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 38, in start
self.sync(c)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 42, in sync
replies = self.send_hello(c)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 55, in send_hello
replies = inspect.hello(c.hostname, our_revoked._data) or {}
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 129, in hello
return self._request('hello', from_node=from_node, revoked=revoked)
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 81, in _request
timeout=self.timeout, reply=True,
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 436, in broadcast
limit, callback, channel=channel,
File "/usr/local/lib/python2.7/site-packages/kombu/pidbox.py", line 315, in _broadcast
serializer=serializer)
File "/usr/local/lib/python2.7/site-packages/kombu/pidbox.py", line 290, in _publish
serializer=serializer,
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 187, in _publish
channel = self.channel
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 209, in _get_channel
channel = self._channel = channel()
File "/usr/local/lib/python2.7/site-packages/kombu/utils/functional.py", line 38, in __call__
value = self.__value__ = self.__contract__()
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 224, in <lambda>
channel = ChannelPromise(lambda: connection.default_channel)
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 819, in default_channel
self.connection
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 802, in connection
self._connection = self._establish_connection()
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 757, in _establish_connection
conn = self.transport.establish_connection()
File "/usr/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
conn.connect()
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 294, in connect
self.transport.connect()
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 120, in connect
self._connect(self.host, self.port, self.connect_timeout)
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 161, in _connect
self.sock.connect(sa)
File "/usr/local/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
[2017-03-17 15:10:28,508: DEBUG/MainProcess] Closed channel #1
[2017-03-17 15:10:28,570: DEBUG/MainProcess] | Consumer: Restarting event loop...
[2017-03-17 15:10:28,572: DEBUG/MainProcess] | Consumer: Restarting Gossip...
[2017-03-17 15:10:28,575: DEBUG/MainProcess] | Consumer: Restarting Heart...
[2017-03-17 15:10:28,648: DEBUG/MainProcess] | Consumer: Restarting Control...
[2017-03-17 15:10:28,655: DEBUG/MainProcess] | Consumer: Restarting Tasks...
[2017-03-17 15:10:28,655: DEBUG/MainProcess] Canceling task consumer...
[2017-03-17 15:10:28,655: DEBUG/MainProcess] | Consumer: Restarting Mingle...
[2017-03-17 15:10:28,655: DEBUG/MainProcess] | Consumer: Restarting Events...
[2017-03-17 15:10:28,672: DEBUG/MainProcess] | Consumer: Restarting Connection...
[2017-03-17 15:10:28,673: DEBUG/MainProcess] | Consumer: Starting Connection
[2017-03-17 15:10:28,947: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#rabbitmq1:5672//: [Errno 111] Connection refused.
Will retry using next failover.
[2017-03-17 15:10:29,345: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL. See http://www.rabbitmq.com/', 'product': 'RabbitMQ', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'capabilities': {'exchange_exchange_bindings': True, 'connection.blocked': True, 'authentication_failure_close': True, 'direct_reply_to': True, 'basic.nack': True, 'per_consumer_qos': True, 'consumer_priorities': True, 'consumer_cancel_notify': True, 'publisher_confirms': True}, 'cluster_name': 'rabbit#rabbitmq1', 'platform': 'Erlang/OTP', 'version': '3.6.6'}, mechanisms: [u'PLAIN', u'AMQPLAIN'], locales: [u'en_US']
[2017-03-17 15:10:29,506: INFO/MainProcess] Connected to amqp://guest:**#rabbitmq2:5672//
[2017-03-17 15:10:29,535: DEBUG/MainProcess] ^-- substep ok
[2017-03-17 15:10:29,569: DEBUG/MainProcess] | Consumer: Starting Events
[2017-03-17 15:10:29,682: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#rabbitmq1:5672//: [Errno 111] Connection refused.
Will retry using next failover.
[2017-03-17 15:10:29,740: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL. See http://www.rabbitmq.com/', 'product': 'RabbitMQ', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'capabilities': {'exchange_exchange_bindings': True, 'connection.blocked': True, 'authentication_failure_close': True, 'direct_reply_to': True, 'basic.nack': True, 'per_consumer_qos': True, 'consumer_priorities': True, 'consumer_cancel_notify': True, 'publisher_confirms': True}, 'cluster_name': 'rabbit#rabbitmq1', 'platform': 'Erlang/OTP', 'version': '3.6.6'}, mechanisms: [u'PLAIN', u'AMQPLAIN'], locales: [u'en_US']
[2017-03-17 15:10:29,768: DEBUG/MainProcess] ^-- substep ok
[2017-03-17 15:10:29,770: DEBUG/MainProcess] | Consumer: Starting Mingle
[2017-03-17 15:10:29,770: INFO/MainProcess] mingle: searching for neighbors
[2017-03-17 15:10:29,771: DEBUG/MainProcess] using channel_id: 1
[2017-03-17 15:10:29,795: DEBUG/MainProcess] Channel open
[2017-03-17 15:10:29,874: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 38, in start
self.sync(c)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 42, in sync
replies = self.send_hello(c)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 55, in send_hello
replies = inspect.hello(c.hostname, our_revoked._data) or {}
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 129, in hello
return self._request('hello', from_node=from_node, revoked=revoked)
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 81, in _request
timeout=self.timeout, reply=True,
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 436, in broadcast
limit, callback, channel=channel,
File "/usr/local/lib/python2.7/site-packages/kombu/pidbox.py", line 315, in _broadcast
serializer=serializer)
File "/usr/local/lib/python2.7/site-packages/kombu/pidbox.py", line 290, in _publish
serializer=serializer,
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 187, in _publish
channel = self.channel
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 209, in _get_channel
channel = self._channel = channel()
File "/usr/local/lib/python2.7/site-packages/kombu/utils/functional.py", line 38, in __call__
value = self.__value__ = self.__contract__()
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 224, in <lambda>
channel = ChannelPromise(lambda: connection.default_channel)
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 819, in default_channel
self.connection
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 802, in connection
self._connection = self._establish_connection()
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 757, in _establish_connection
conn = self.transport.establish_connection()
File "/usr/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
conn.connect()
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 294, in connect
self.transport.connect()
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 120, in connect
self._connect(self.host, self.port, self.connect_timeout)
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 161, in _connect
self.sock.connect(sa)
File "/usr/local/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
[2017-03-17 15:10:29,887: DEBUG/MainProcess] Closed channel #1
[2017-03-17 15:10:29,907: DEBUG/MainProcess] | Consumer: Restarting event loop...
[2017-03-17 15:10:29,908: DEBUG/MainProcess] | Consumer: Restarting Gossip...
[2017-03-17 15:10:29,908: DEBUG/MainProcess] | Consumer: Restarting Heart...
[2017-03-17 15:10:29,908: DEBUG/MainProcess] | Consumer: Restarting Control...
[2017-03-17 15:10:29,909: DEBUG/MainProcess] | Consumer: Restarting Tasks...
[2017-03-17 15:10:29,910: DEBUG/MainProcess] Canceling task consumer...
[2017-03-17 15:10:29,911: DEBUG/MainProcess] | Consumer: Restarting Mingle...
[2017-03-17 15:10:29,912: DEBUG/MainProcess] | Consumer: Restarting Events...
[2017-03-17 15:10:29,953: DEBUG/MainProcess] | Consumer: Restarting Connection...
[2017-03-17 15:10:29,954: DEBUG/MainProcess] | Consumer: Starting Connection
[2017-03-17 15:10:30,036: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#rabbitmq1:5672//: [Errno 111] Connection refused.
Will retry using next failover.
Unfortunately, Celery documentation doesn't say much about failover topic.
Its definitely bug, I have created issue on GitHub: https://github.com/celery/celery/issues/3921
Thanks to George Psarakis I have managed to avoid bug using --without-mingle flag for Celery workers, eg:
celery worker -A app.tasks -l debug --without-mingle