Stucked HDP adding host - ambari

I have an HDP cluster with 2 nodes, we had some problem and 1 host heartbeat was lost without any chance to recover it as the machine failed so we finally re-install Ubuntu and configure it again.
It was impossible to recover the host in ambari (trying to give same FQDN, ip, configuration,...) so I tried to change hostname and add it as a completely new host.
I did get able to complete installation step 2 with 'SUCCESS' status, but it get stucked whith the following message 'Please wait while the hosts are being checked for potential problems...' for hours.
I attach ambari-server log, ambari-agent log ambari registration log and error image.
Do you have some idea about what is happening and how to solve it?
Thanks.
ambari-server.log
12 jun 2018 09:34:55,667 WARN [ambari-action-scheduler] ExecutionCommandWrapper:185 - Unable to lookup the cluster by ID; assuming that there is no cluster and therefore no configs for this execution command: Cluster not found, clusterName=clusterID=-1
12 jun 2018 09:34:56,675 WARN [ambari-action-scheduler] ExecutionCommandWrapper:185 - Unable to lookup the cluster by ID; assuming that there is no cluster and therefore no configs for this execution command: Cluster not found, clusterName=clusterID=-1
12 jun 2018 09:34:57,683 WARN [ambari-action-scheduler] ExecutionCommandWrapper:185 - Unable to lookup the cluster by ID; assuming that there is no cluster and therefore no configs for this execution command: Cluster not found, clusterName=clusterID=-1
ambari-agent.log
INFO 2018-06-12 09:00:16,026 Controller.py:512 - Registration response from bigdata was OK
INFO 2018-06-12 09:00:16,026 Controller.py:517 - Resetting ActionQueue...
INFO 2018-06-12 09:00:26,035 Controller.py:304 - Heartbeat (response id = 0) with server is running...
INFO 2018-06-12 09:00:26,036 Controller.py:311 - Building heartbeat message
INFO 2018-06-12 09:00:26,037 Heartbeat.py:90 - Adding host info/state to heartbeat message.
INFO 2018-06-12 09:00:26,099 logger.py:75 - Testing the JVM's JCE policy to see it if supports an unlimited key length.
INFO 2018-06-12 09:00:26,168 Hardware.py:176 - Some mount points were ignored: /dev, /run, /, /dev/shm, /run/lock, /sys/fs/cgroup, /boot, /run/user/1000, /run/user/0, /run/user/994
INFO 2018-06-12 09:00:26,169 Controller.py:320 - Sending Heartbeat (id = 0)
INFO 2018-06-12 09:00:26,174 Controller.py:332 - Heartbeat response received (id = 1)
INFO 2018-06-12 09:00:26,174 Controller.py:341 - Heartbeat interval is 10 seconds
INFO 2018-06-12 09:00:26,174 Controller.py:377 - Updating configurations from heartbeat
INFO 2018-06-12 09:00:26,174 Controller.py:386 - Adding cancel/execution commands
INFO 2018-06-12 09:00:26,174 Controller.py:403 - Adding recovery commands
INFO 2018-06-12 09:00:26,174 Controller.py:471 - Waiting 9.9 for next heartbeat
INFO 2018-06-12 09:00:36,075 Controller.py:478 - Wait for next heartbeat over
registration log
INFO 2018-06-12 09:34:38,350 Controller.py:512 - Registration response from bigdata was OK
INFO 2018-06-12 09:34:38,350 Controller.py:517 - Resetting ActionQueue...
', None)
Connection to master.es closed.
SSH command execution finished
host=master.es, exitcode=0
Command end time 2018-06-12 09:34:38
Registering with the server...
Registering with the server...

Do an ambari-agent reset on all nodes.
Change cluster name.

Related

GCS Connector fails when enabling SASL, SSL, or SASL_SSL

I have been able to successfully connect the GCS connector without SASL or SSL enabled. When I enable SASL and SSL; connect-standalone does not seem to be able to communicate with the brokers.
The problem appears to be with the gcs-sink-license-manager. This is what I have found from the logs but they aren't super helpful for me to actually figuring out what the issue is....
LOGS
[2018-12-19 16:29:05,645] INFO [AdminClient clientId=gcs-sink-license-manager] Metadata update failed (org.apache.kafka.clients.admin.internals.AdminMetadataManager:238)
org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call.
[2018-12-19 16:29:05,647] ERROR WorkerConnector{id=gcs-sink} Error while starting connector (org.apache.kafka.connect.runtime.WorkerConnector:119)
org.apache.kafka.connect.errors.ConnectException: Timed out while checking for or creating topic(s) '_confluent-command'. This could indicate a connectivity issue, unavailable topic partitions, or if this is your first use of the topic it may have taken too long to create.
at org.apache.kafka.connect.util.TopicAdmin.createTopics(TopicAdmin.java:251)
at io.confluent.license.LicenseStore$1.run(LicenseStore.java:159)
at org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:126)
at io.confluent.license.LicenseStore.start(LicenseStore.java:187)
at io.confluent.license.LicenseManager.<init>(LicenseManager.java:42)
at io.confluent.connect.gcs.GcsSinkConnector.checkLicense(GcsSinkConnector.java:80)
at io.confluent.connect.gcs.GcsSinkConnector.start(GcsSinkConnector.java:67)
at org.apache.kafka.connect.runtime.WorkerConnector.doStart(WorkerConnector.java:111)
at org.apache.kafka.connect.runtime.WorkerConnector.start(WorkerConnector.java:136)
at org.apache.kafka.connect.runtime.WorkerConnector.transitionTo(WorkerConnector.java:195)
at org.apache.kafka.connect.runtime.Worker.startConnector(Worker.java:241)
at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.startConnector(StandaloneHerder.java:297)
at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.putConnectorConfig(StandaloneHerder.java:206)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:107)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.
[2018-12-19 16:29:05,649] INFO Finished creating connector gcs-sink (org.apache.kafka.connect.runtime.Worker:257)
[2018-12-19 16:29:05,650] INFO Skipping reconfiguration of connector gcs-sink since it is not running (org.apache.kafka.connect.runtime.standalone.StandaloneHerder:329)
[2018-12-19 16:29:05,652] INFO Created connector gcs-sink (org.apache.kafka.connect.cli.ConnectStandalone:104)
Connector Properties
connector.class="io.confluent.connect.gcs.GcsSinkConnector"
storage.class="io.confluent.connect.gcs.storage.GcsStorage"
bootstrap.servers=kafka1:19092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000
plugin.path=/usr/share/java,/usr/share/confluent-hub-components
gcs.sasl.properties
#Connector
format.class=io.confluent.connect.gcs.format.json.JsonFormat
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
flush.size=3
# confluent.license=
#GCS
name=gcs-sink
connector.class=io.confluent.connect.gcs.GcsSinkConnector
gcs.bucket.name=kafka-bucket-4c
gcs.part.size=5242880
gcs.credentials.path=/usr/share/assets/gcs-key.json
confluent.topic.bootstrap.servers=kafka1:19092
topics=sandbox
confluent.topic.replication.factor=1
#Storage
storage.class=io.confluent.connect.gcs.storage.GcsStorage
client.id=gcs-standalone-sink
# Sink authentication settings
consumer.log4j.root.loglevel=DEBUG
consumer.bootstrap.servers=kafka1:19092
consumer.sasl.mechanism=PLAIN
consumer.security.protocol=SASL_PLAINTEXT
consumer.ssl.endpoint.identification.algorithm=
Dockerfile
FROM confluentinc/cp-kafka-connect
ADD assets /usr/share/assets
# ENV CONNECT_OPTS "-Djava.security.auth.login.config=/usr/share/assets/kafka_admin_account.conf -Djavax.net.ssl.trustStore=/usr/share/assets/secrets/kafka.client.truststore.jks -Djavax.net.ssl.trustStorePassword=changeit"
ENV KAFKA_OPTS "-Djava.security.auth.login.config=/usr/share/assets/secrets/kafka_admin_account.conf -Djavax.net.debug=all"
ENV CONNECT_OPTS "-Djava.security.auth.login.config=/usr/share/assets/secrets/kafka_admin_account.conf -Djavax.net.debug=all"
COPY assets/secrets/cacerts /usr/lib/jvm/zulu-8-amd64/jre/lib/security/cacerts
CMD ["/bin/bash", "-c", "connect-standalone ${CONNECT_PROPS} ${GCS_PROPS}"]
docker-compose file
kafka1:
image: company-kafka-secure
# build: ./
depends_on:
- zookeeper
ports:
- 19091:19091
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: SASL_PLAINTEXT://kafka1:19092,EXT://localhost:19091
KAFKA_LISTENERS: SASL_PLAINTEXT://:19092,EXT://:19091
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: SASL_PLAINTEXT:SASL_PLAINTEXT,EXT:SASL_PLAINTEXT
KAFKA_SASL_MECHANISM_INTER_BROKER_PROTOCOL: PLAIN
KAFKA_SECURITY_INTER_BROKER_PROTOCOL: SASL_PLAINTEXT
KAFKA_SASL_ENABLED_MECHANISMS: PLAIN
KAFKA_ZOOKEEPER_CONNECTION_TIMEOUT_MS: 6000
ZOOKEEPER_SASL_ENABLED: "false"
KAFKA_AUTHORIZER_CLASS_NAME: com.us.digital.kafka.security.authorization.KafkaAuthorizer
CONFLUENT_METRICS_ENABLE: "false"
volumes:
- ./secrets:/etc/kafka/secrets
networks:
- message_hub
kafka_gcs_connect:
build: ./kafka-connect
ports:
- 28082:28082
depends_on:
- kafka1
- kafka3
- kafka2
- zookeeper
environment:
CONNECT_PROPS: /usr/share/assets/connect-standalone.sasl.properties
CONNECT_REST_PORT: 28082
GCS_PROPS: /usr/share/assets/gcs.sasl.properties
networks:
- message_hub
CONNECT_BOOTSTRAP_SERVERS=kafka1:19092,kafka2:29092,kafka3:39092
CONNECT_CONFLUENT_TOPIC_BOOTSTRAP_SERVERS=kafka1:19092,kafka2:29092,kafka3:39092
CONNECT_CONFLUENT_LICENSE=
CONNECT_KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE=false
CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE=false
CONNECT_CONFIG_STORAGE_TOPIC=connect-config
CONNECT_OFFSET_STORAGE_TOPIC=connect-offsets
CONNECT_STATUS_STORAGE_TOPIC=connect-status
CONNECT_REPLICATION_FACTOR=1
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR=1
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR=1
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR=1
CONNECT_SECURITY_PROTOCOL=SASL_PLAINTEXT
CONNECT_SASL_MECHANISM=PLAIN
CONNECT_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM=
CONNECT_CONSUMER_BOOTSTRAP_SERVERS=kafka1:19092,kafka2:29092,kafka3:39092
CONNECT_CONSUMER_SECURITY_PROTOCOL=SASL_PLAINTEXT
CONNECT_CONSUMER_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM=
CONNECT_CONSUMER_SASL_MECHANISM=PLAIN
CONNECT_GROUP_ID=gcs-kafka-connector
CONNECT_INTERNAL_KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_INTERNAL_VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter
CONNECT_REST_PORT=28082
CONNECT_PLUGIN_PATH=/usr/share/java,/usr/share/confluent-hub-components
KAFKA_OPTS=-Djava.security.auth.login.config=/usr/share/assets/kafka_admin_account.conf
Here is all of the properties I found I needed to get SASL working with a gcs connector.

Yarn report flink job as FINISHED and SUCCEED when flink job failure

I am running flink job on yarn, we use "fink run" in command line to submit our job to yarn, one day we had an exception on flink job, as we didn't enable the flink restart strategy so it simply failed, but eventually we found that the job status was "SUCCEED" from the yarn application list, which we expect to be "FAILED".
Flink CLI log:
06/12/2018 03:13:37 FlatMap (getTagStorageMapper.flatMap)(23/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (ResultReducer.reduceGroup)(31/32) switched to CANCELED
06/12/2018 03:13:37 FlatMap (SubClassEDFJoinMapper.flatMap)(29/32) switched to CANCELED
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (SubClassInventoryMapper.flatMap)(27/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (OutputReducer.reduceGroup)(28/32) switched to CANCELED
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (BIMBQMInstrumentMapper.flatMap)(27/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (BIMBQMGovCorpReduce.reduceGroup)(30/32) switched to CANCELED
06/12/2018 03:13:37 FlatMap (BIMBQMEVMJoinMapper.flatMap)(32/32) switched to CANCELED
06/12/2018 03:13:37 Job execution switched to status FAILED.
No JobSubmissionResult returned, please make sure you called ExecutionEnvironment.execute()
2018-06-12 03:13:37,625 INFO org.apache.flink.yarn.YarnClusterClient - Sending shutdown request to the Application Master
2018-06-12 03:13:37,625 INFO org.apache.flink.yarn.YarnClusterClient - Start application client.
2018-06-12 03:13:37,630 INFO org.apache.flink.yarn.ApplicationClient - Notification about new leader address akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,632 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager.
2018-06-12 03:13:37,633 INFO org.apache.flink.yarn.ApplicationClient - Received address of new leader akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,634 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager null.
2018-06-12 03:13:37,635 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager.
2018-06-12 03:13:37,688 INFO org.apache.flink.yarn.ApplicationClient - Successfully registered at the ResourceManager using JobManager Actor[akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager#182802345]
2018-06-12 03:13:38,648 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager.
2018-06-12 03:13:39,480 INFO org.apache.flink.yarn.YarnClusterClient - Application application_1528772982594_0001 finished with state FINISHED and final state SUCCEEDED at 1528773218662
2018-06-12 03:13:39,480 INFO org.apache.flink.yarn.YarnClusterClient - YARN Client is shutting down
2018-06-12 03:13:39,582 INFO org.apache.flink.yarn.ApplicationClient - Stopped Application client.
2018-06-12 03:13:39,583 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager Actor[akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager#182802345].
Flink job manager Log:
FlatMap (BIMBQMEVMJoinMapper.flatMap) (32/32) (67a002e07fe799c1624a471340c8cf9d) switched from CANCELING to CANCELED.
Try to restart or fail the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) if no longer possible.
Requesting new TaskManager container with 8192 megabytes memory. Pending requests: 1
Job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) switched from state FAILING to FAILED.
Could not restart the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) because the restart strategy prevented it.
Unregistered task manager ip-10-97-44-186/10.97.44.186. Number of registered task managers 31. Number of available slots 31
Stopping JobManager with final application status SUCCEEDED and diagnostics: Flink YARN Client requested shutdown
Shutting down cluster with status SUCCEEDED : Flink YARN Client requested shutdown
Unregistering application from the YARN Resource Manager
Waiting for application to be successfully unregistered.
Can anybody help me understand why does yarn say my flink job was "SUCCEED"?
The reported application status in Yarn does not reflect the status of the executed job but the status of the Flink cluster since this is the Yarn application. Thus, the final status of the Yarn application only depends on whether the Flink cluster finished properly or not. Differently said, if a job fails, then it does not necessarily mean that the Flink cluster failed. These are two different things.

WildFly Swarm apps using an external ActiveMQ broker

I'm having a very hard time to get two WildFly swarm apps (based on 2017.9.5 version) communicate with each other over a standalone ActiveMQ 5.14.3 broker. All done using YAML config as I can't have a main method in my case.
after reading hundreds of outdated examples and inaccurate pages of documentation, I settled with following settings for both producer and consumer apps:
swarm:
messaging-activemq:
servers:
default:
jms-topics:
domain-events: {}
messaging:
remote:
name: remote-mq
host: localhost
port: 61616
jndi-name: java:/jms/remote-mq
remote: true
Now it seems that at least part of the setting is correct as the apps start except for following warning:
2017-09-16 14:20:04,385 WARN [org.jboss.activemq.artemis.wildfly.integration.recovery] (MSC service thread 1-2) AMQ122018: Could not start recovery discovery on XARecoveryConfig [transportConfiguration=[TransportConfiguration(name=, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61616&localAddress=::&host=localhost], discoveryConfiguration=null, username=null, password=****, JNDI_NAME=java:/jms/remote-mq], we will retry every recovery scan until the server is available
Also when producer tries to send messages it just times out and I get following exception (just the last part):
Caused by: javax.jms.JMSException: Failed to create session factory
at org.apache.activemq.artemis.jms.client.ActiveMQConnectionFactory.createConnectionInternal(ActiveMQConnectionFactory.java:727)
at org.apache.activemq.artemis.jms.client.ActiveMQConnectionFactory.createXAConnection(ActiveMQConnectionFactory.java:304)
at org.apache.activemq.artemis.jms.client.ActiveMQConnectionFactory.createXAConnection(ActiveMQConnectionFactory.java:300)
at org.apache.activemq.artemis.ra.ActiveMQRAManagedConnection.setup(ActiveMQRAManagedConnection.java:785)
... 127 more
Caused by: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ119013: Timed out waiting to receive cluster topology. Group:null]
at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:797)
at org.apache.activemq.artemis.jms.client.ActiveMQConnectionFactory.createConnectionInternal(ActiveMQConnectionFactory.java:724)
... 130 more
I suspect that the problem is ActiveMQ has security turned on, but I found no place to give username and password to swarm config.
The ActiveMQ instance is running using Docker and following compose file:
version: '2'
services:
activemq:
image: webcenter/activemq
environment:
- ACTIVEMQ_NAME=amqp-srv1
- ACTIVEMQ_REMOVE_DEFAULT_ACCOUNT=true
- ACTIVEMQ_ADMIN_LOGIN=admin
- ACTIVEMQ_ADMIN_PASSWORD=your_password
- ACTIVEMQ_WRITE_LOGIN=producer_login
- ACTIVEMQ_WRITE_PASSWORD=producer_password
- ACTIVEMQ_READ_LOGIN=consumer_login
- ACTIVEMQ_READ_PASSWORD=consumer_password
- ACTIVEMQ_JMX_LOGIN=jmx_login
- ACTIVEMQ_JMX_PASSWORD=jmx_password
- ACTIVEMQ_MIN_MEMORY=1024
- ACTIVEMQ_MAX_MEMORY=4096
- ACTIVEMQ_ENABLED_SCHEDULER=true
ports:
- "1883:1883"
- "5672:5672"
- "8161:8161"
- "61616:61616"
- "61613:61613"
- "61614:61614"
any idea what's going wrong?
I had bad times trying to get it working too. The following YML solved my problem:
swarm:
network:
socket-binding-groups:
standard-sockets:
outbound-socket-bindings:
myapp-socket-binding:
remote-host: localhost
remote-port: 61616
messaging-activemq:
servers:
default:
remote-connectors:
myapp-connector:
socket-binding: myapp-socket-binding
pooled-connection-factories:
myAppRemote:
user: username
password: password
connectors:
- myapp-connector
entries:
- 'java:/jms/remote-mq'

Jenkins selenium grid hub refuses incoming selenium Node connections from external IP's

I have a Selenium Grid hub running on Jenkins using the Selenium Plugin.
I have a Selenium grid node running on the same machine and it is successfully connected to the Hub.
From an external machine i cant't seem to ping the 4444 port on which hub is running through Jenkins.
I can ping the port, if Hub is started separately through command line.
I have Firewalls disabled on both my machines so its not a network issue.
java -jar selenium-server-standalone-2.46.0.jar -role node -hub http://<IP>:4444/grid/register -timeout 10000 -browserTimeout 10000 -sessionMaxIdleTimeInSeconds 10000
16:34:58.122 INFO - Launching a Selenium Grid node
16:34:59.982 WARN - error getting the parameters from the hub. The node may end up with wrong timeouts.Connect to <IP>:4444 [<IP>] failed: Connection refused: connect
16:35:00.029 INFO - Java: Oracle Corporation 25.51-b03
16:35:00.029 INFO - OS: Windows 8.1 6.3 amd64
16:35:00.044 INFO - v2.46.0, with Core v2.46.0. Built from revision 87c69e2
16:35:00.107 INFO - Driver class not found: com.opera.core.systems.OperaDriver
16:35:00.107 INFO - Driver provider com.opera.core.systems.OperaDriver is not registered
16:35:00.154 INFO - Version Jetty/5.1.x
16:35:00.154 INFO - Started HttpContext[/selenium-server,/selenium-server]
16:35:00.154 INFO - Started org.openqa.jetty.jetty.servlet.ServletHandler#76a4d6c
16:35:00.154 INFO - Started HttpContext[/wd,/wd]
16:35:00.154 INFO - Started HttpContext[/selenium-server/driver,/selenium-server/driver]
16:35:00.154 INFO - Started HttpContext[/,/]
16:35:00.154 INFO - Started SocketListener on 0.0.0.0:5555
16:35:00.154 INFO - Started org.openqa.jetty.jetty.Server#1f7030a6
16:35:00.154 INFO - Selenium Grid node is up and ready to register to the hub
16:35:00.185 INFO - Starting auto registration thread. Will try to register every 5000 ms.
16:35:00.200 INFO - Registering the node to the hub: http://<IP>/grid/register
16:35:01.232 INFO - Couldn't register this node: Error sending the registration request: Connect to <IP>:4444 [IP] failed: Connection refused: connect
16:35:07.232 INFO - Couldn't register this node: The hub is down or not responding: Connect to <IP>:4444 [IP] failed: Connection refused: connect
Any help is appreciated.
Is that the HUB has Max connection limit set and you have already reached the connection Limit.
You can simply debug it by shutdown the HUB and restart the hub and register the node you are trying to the HUB. Hope it will register .
Must be the Connection Limit as if you don't supply anything i think a HUB can handle 5 instances.
Please try and share your reply

storm-YARN : Topology submitted but no supervisor nodes assigned

I'm running WordCount storm TOPOLOGY (https://github.com/nathanmarz/storm-starter/).
I'm launching this topology via storm yarn, but it's not running any worker nodes. In the storm ui I cannot see any supervisor nodes assigned. In my code I've configured to start three worker nodes.
Do we need to do something else to get the supervisors started?
I also looked at nimbus.log, stderr, stdout & ui.log there are no signs of error.
storm jar target/storm-starter-0.9.3-rc2-SNAPSHOT-jar-with-dependencies.jar storm.starter.WordCountTopology TEST-WC-TOPO
1045 [main] INFO backtype.storm.StormSubmitter - Successfully uploaded topology jar to assigned location: storm-local/nimbus/inbox/stormjar-4a08ebf7-022c-4b22-86be-1efc6e8f6be0.jar
1046 [main] INFO backtype.storm.StormSubmitter - Submitting topology WCT in distributed mode with conf {"topology.workers":3,"topology.debug":true}
1438 [main] INFO backtype.storm.StormSubmitter - Finished submitting topology: WCT