storm-YARN : Topology submitted but no supervisor nodes assigned - hadoop-yarn

I'm running WordCount storm TOPOLOGY (https://github.com/nathanmarz/storm-starter/).
I'm launching this topology via storm yarn, but it's not running any worker nodes. In the storm ui I cannot see any supervisor nodes assigned. In my code I've configured to start three worker nodes.
Do we need to do something else to get the supervisors started?
I also looked at nimbus.log, stderr, stdout & ui.log there are no signs of error.
storm jar target/storm-starter-0.9.3-rc2-SNAPSHOT-jar-with-dependencies.jar storm.starter.WordCountTopology TEST-WC-TOPO
1045 [main] INFO backtype.storm.StormSubmitter - Successfully uploaded topology jar to assigned location: storm-local/nimbus/inbox/stormjar-4a08ebf7-022c-4b22-86be-1efc6e8f6be0.jar
1046 [main] INFO backtype.storm.StormSubmitter - Submitting topology WCT in distributed mode with conf {"topology.workers":3,"topology.debug":true}
1438 [main] INFO backtype.storm.StormSubmitter - Finished submitting topology: WCT

Related

X-Ray Daemon don't receive any data from envoy

I have a service running a task definition with three containers:
service itself
envoy
x-ray daemon
And I want to trace and monitor my services interacting with each other with x-ray.
But I don't see any data in x-ray.
I can see the request logs and everything in the envoy logs but there are no error messages about missing connection to the x-ray daemon.
Envoy container has three env variables:
APPMESH_VIRTUAL_NODE_NAME = mesh/mesh-name/virtualNode/service-virtual-node
ENABLE_ENVOY_XRAY_TRACING = 1
ENVOY_LOG_LEVEL = trace
The x-ray daemon is pretty plain and has just a name and an image (amazon/aws-xray-daemon:1).
But when looking in the logs of the x-ray dameon, there is only the following:
2022-05-31T14:48:05.042+02:00 2022-05-31T12:48:05Z [Info] Initializing AWS X-Ray daemon 3.0.0
2022-05-31T14:48:05.042+02:00 2022-05-31T12:48:05Z [Info] Using buffer memory limit of 76 MB
2022-05-31T14:48:05.042+02:00 2022-05-31T12:48:05Z [Info] 1216 segment buffers allocated
2022-05-31T14:48:05.051+02:00 2022-05-31T12:48:05Z [Info] Using region: eu-central-1
2022-05-31T14:48:05.788+02:00 2022-05-31T12:48:05Z [Error] Get instance id metadata failed: RequestError: send request failed
2022-05-31T14:48:05.788+02:00 caused by: Get http://169.254.169.254/latest/meta-data/instance-id: dial tcp xxx.xxx.xxx.254:80: connect: invalid argument
2022-05-31T14:48:05.789+02:00 2022-05-31T12:48:05Z [Info] Starting proxy http server on 127.0.0.1:2000
As far as I read, the error you can see in these logs doesn't affect the functionality (https://repost.aws/questions/QUr6JJxyeLRUK5M4tadg944w).
I'm pretty sure I'm missing a configuration or access right.
It's running already on staging but I set this up several weeks ago and I don't find any differences between the configurations.
Thanks in advance!
In my case, I made a copy-paste mistake by copying trailing line break into the name of the environment variable ENABLE_ENVOY_XRAY_TRACING which wasn't visible in the overview and only inside the text field.

Yarn report flink job as FINISHED and SUCCEED when flink job failure

I am running flink job on yarn, we use "fink run" in command line to submit our job to yarn, one day we had an exception on flink job, as we didn't enable the flink restart strategy so it simply failed, but eventually we found that the job status was "SUCCEED" from the yarn application list, which we expect to be "FAILED".
Flink CLI log:
06/12/2018 03:13:37 FlatMap (getTagStorageMapper.flatMap)(23/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (ResultReducer.reduceGroup)(31/32) switched to CANCELED
06/12/2018 03:13:37 FlatMap (SubClassEDFJoinMapper.flatMap)(29/32) switched to CANCELED
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (SubClassInventoryMapper.flatMap)(27/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (OutputReducer.reduceGroup)(28/32) switched to CANCELED
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (BIMBQMInstrumentMapper.flatMap)(27/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (BIMBQMGovCorpReduce.reduceGroup)(30/32) switched to CANCELED
06/12/2018 03:13:37 FlatMap (BIMBQMEVMJoinMapper.flatMap)(32/32) switched to CANCELED
06/12/2018 03:13:37 Job execution switched to status FAILED.
No JobSubmissionResult returned, please make sure you called ExecutionEnvironment.execute()
2018-06-12 03:13:37,625 INFO org.apache.flink.yarn.YarnClusterClient - Sending shutdown request to the Application Master
2018-06-12 03:13:37,625 INFO org.apache.flink.yarn.YarnClusterClient - Start application client.
2018-06-12 03:13:37,630 INFO org.apache.flink.yarn.ApplicationClient - Notification about new leader address akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,632 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager.
2018-06-12 03:13:37,633 INFO org.apache.flink.yarn.ApplicationClient - Received address of new leader akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,634 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager null.
2018-06-12 03:13:37,635 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager.
2018-06-12 03:13:37,688 INFO org.apache.flink.yarn.ApplicationClient - Successfully registered at the ResourceManager using JobManager Actor[akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager#182802345]
2018-06-12 03:13:38,648 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager.
2018-06-12 03:13:39,480 INFO org.apache.flink.yarn.YarnClusterClient - Application application_1528772982594_0001 finished with state FINISHED and final state SUCCEEDED at 1528773218662
2018-06-12 03:13:39,480 INFO org.apache.flink.yarn.YarnClusterClient - YARN Client is shutting down
2018-06-12 03:13:39,582 INFO org.apache.flink.yarn.ApplicationClient - Stopped Application client.
2018-06-12 03:13:39,583 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager Actor[akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager#182802345].
Flink job manager Log:
FlatMap (BIMBQMEVMJoinMapper.flatMap) (32/32) (67a002e07fe799c1624a471340c8cf9d) switched from CANCELING to CANCELED.
Try to restart or fail the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) if no longer possible.
Requesting new TaskManager container with 8192 megabytes memory. Pending requests: 1
Job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) switched from state FAILING to FAILED.
Could not restart the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) because the restart strategy prevented it.
Unregistered task manager ip-10-97-44-186/10.97.44.186. Number of registered task managers 31. Number of available slots 31
Stopping JobManager with final application status SUCCEEDED and diagnostics: Flink YARN Client requested shutdown
Shutting down cluster with status SUCCEEDED : Flink YARN Client requested shutdown
Unregistering application from the YARN Resource Manager
Waiting for application to be successfully unregistered.
Can anybody help me understand why does yarn say my flink job was "SUCCEED"?
The reported application status in Yarn does not reflect the status of the executed job but the status of the Flink cluster since this is the Yarn application. Thus, the final status of the Yarn application only depends on whether the Flink cluster finished properly or not. Differently said, if a job fails, then it does not necessarily mean that the Flink cluster failed. These are two different things.

Data Node and Node Manager is not starting in pseudo-cluster mode(Apache Hadoop)

The Data Node and Node Manager is not starting in pseudo-cluster mode (Apache Hadoop).
Seeing this error in the log file:
***2017-08-22 17:15:08,403 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unexpected error starting NodeStatusUpdater
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, Message from ResourceManager: NodeManager from archit doesn't satisfy minimum allocations, Sending SHUTDOWN signal to the NodeManager.***
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:278)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:197)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:496)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
2017-08-22 17:15:08,404 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
The "ResourceManager: NodeManager from archit doesn't satisfy minimum allocations" error is seen when node on which node manager is being started does not have enough resources w.r.t yarn.scheduler.minimum-allocation-vcores and yarn.scheduler.minimum-allocation-mb configurations.
Reduce values of yarn.scheduler.minimum-allocation-vcores and yarn.scheduler.minimum-allocation-mb then restart yarn.

activemq failover using multiple instances in master slave mode on same linux machine

I have setup ActiveMQ mulitple instances to achieve failover in master slave mode in windows.
While setting up the same i just created 3 instances under bin folder without changing any port and started all 3 instances one by one. First instance became master and remaining were in slave mode until I stopped master instance.
Now I am trying to achieve the same in Linux environment. First instance starts successfully but when I start second instance in a different window it throws below error:
ERROR | Failed to start Apache ActiveMQ ([instance2, ID:132vm6-57227-1478597606120-0:1], java.io.IOException: Transport Connector could not be registered in JMX: java.io.IOException: Failed to bind to server socket: tcp://0.0.0.0:61616?maximumConnections=1000&wireFormat.maxFrameSize=104857600 due to: java.net.BindException: Address already in use)
INFO | Apache ActiveMQ 5.14.0 (instance2, ID:132vm6-57227-1478597606120-0:1) is shutting down
INFO | Connector openwire stopped
INFO | Connector amqp stopped
INFO | Connector stomp stopped
INFO | Connector mqtt stopped
INFO | Connector ws stopped
INFO | PListStore:[/opt/apache-activemq-5.14.0/bin/instance2/data/instance2/tmp_storage] stopped
INFO | Stopping async queue tasks
INFO | Stopping async topic tasks
INFO | Stopped KahaDB
INFO | Apache ActiveMQ 5.14.0 (instance2, ID:132vm6-57227-1478597606120-0:1) uptime 0.585 seconds
INFO | Apache ActiveMQ 5.14.0 (instance2, ID:132vm6-57227-1478597606120-0:1) is shutdown
INFO | Closing org.apache.activemq.xbean.XBeanBrokerFactory$1#4233871a: startup date [Tue Nov 08 15:03:24 IST 2016]; root of context hierarchy
WARN | Exception thrown from LifecycleProcessor on context close
java.lang.IllegalStateException: LifecycleProcessor not initialized - call 'refresh' before invoking lifecycle methods via the context: org.apache.activemq.xbean.XBeanBrokerFactory$1#4233871a: startup date [Tue Nov 08 15:03:24 IST 2016]; root of context hierarchy
at org.springframework.context.support.AbstractApplicationContext.getLifecycleProcessor(AbstractApplicationContext.java:357)[spring-context-4.1.9.RELEASE.jar:4.1.9.RELEASE]
at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:884)[spring-context-4.1.9.RELEASE.jar:4.1.9.RELEASE]
at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:843)[spring-context-4.1.9.RELEASE.jar:4.1.9.RELEASE]
at org.apache.activemq.hooks.SpringContextHook.run(SpringContextHook.java:30)[activemq-spring-5.14.0.jar:5.14.0]
at org.apache.activemq.broker.BrokerService.stop(BrokerService.java:875)[activemq-broker-5.14.0.jar:5.14.0]
at org.apache.activemq.xbean.XBeanBrokerService.stop(XBeanBrokerService.java:122)[activemq-spring-5.14.0.jar:5.14.0]
at org.apache.activemq.broker.BrokerService.start(BrokerService.java:629)[activemq-broker-5.14.0.jar:5.14.0]
at org.apache.activemq.xbean.XBeanBrokerService.afterPropertiesSet(XBeanBrokerService.java:73)[activemq-spring-5.14.0.jar:5.14.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)[:1.7.0_65]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)[:1.7.0_65]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)[:1.7.0_65]
at java.lang.reflect.Method.invoke(Method.java:606)[:1.7.0_65]
I am using ActiveMQ 5.14 version.
If anybody has encountered a similar issue, kindly provide your inputs.
To get multiple instances of ActiveMQ running on the same machine, you need to change the ports that they try to open. There are (at least) 3 ports that need to be changed:
The transportConnector ports that accept messaging traffic. These are defined in theactivemq.xml file. Typically you only need the openwire one - this is 61616 by default; I usually change this in the other ActiveMQ instances to 61626, 61636 etc. You can usually comment out the others if you don't intend to use them.
The Jetty HTTP port. This is defined in the jetty.xml file. The default is 8161, set the next ones to 8162, 8163 etc.
The JMX port. This one's a bit tricky, as you need to stick a piece of config into the activemq.xml to explicitly define it as follows:
<managementContext>
<managementContext createConnector="true" connectorPort="1099"/>
</managementContext>
You can then change this to 1199, 1299 on the other instances. Hope this helps.

Error running topology in production cluster with Apache Storm 1.0.0, topology does not start

I have a topology that runs well on a Local cluster.
But when I try to run it on a production cluster the following things happens:
The nimbus is up
The storm UI is up
The two workers I use are up
Zookeper is up
I run storm with
storm jar myjar.jar MyClass
Nimbus submits the topology
The topologies and the workers appears in the storm UI
BUT:
The topology does not start despite the fact that its status is ACTIVE
The log file of the topology does not appear in the workers.
I have the following log in the worker on the supervisor.log:
2016-04-15 13:18:19.831 o.a.s.d.supervisor [WARN] There was a connection problem with nimbus. #error {
:cause jobs-rec-storm-nimbus
:via
[{:type java.lang.RuntimeException
:message org.apache.storm.thrift.transport.TTransportException: java.net.UnknownHostException: jobs-rec-storm-nimbus
:at [org.apache.storm.security.auth.TBackoffConnect retryNext TBackoffConnect.java 64]}
{:type org.apache.storm.thrift.transport.TTransportException
:message java.net.UnknownHostException: jobs-rec-storm-nimbus
:at [org.apache.storm.thrift.transport.TSocket open TSocket.java 226]}
{:type java.net.UnknownHostException
:message jobs-rec-storm-nimbus
:at [java.net.AbstractPlainSocketImpl connect AbstractPlainSocketImpl.java 184]}]
:trace
[[java.net.AbstractPlainSocketImpl connect AbstractPlainSocketImpl.java 184]
[java.net.SocksSocketImpl connect SocksSocketImpl.java 392]
[java.net.Socket connect Socket.java 589]
[org.apache.storm.thrift.transport.TSocket open TSocket.java 221]
[org.apache.storm.thrift.transport.TFramedTransport open TFramedTransport.java 81]
[org.apache.storm.security.auth.SimpleTransportPlugin connect SimpleTransportPlugin.java 103]
[org.apache.storm.security.auth.TBackoffConnect doConnectWithRetry TBackoffConnect.java 53]
[org.apache.storm.security.auth.ThriftClient reconnect ThriftClient.java 99]
[org.apache.storm.security.auth.ThriftClient <init> ThriftClient.java 69]
[org.apache.storm.utils.NimbusClient <init> NimbusClient.java 106]
[org.apache.storm.utils.NimbusClient getConfiguredClientAs NimbusClient.java 78]
[org.apache.storm.utils.NimbusClient getConfiguredClient NimbusClient.java 41]
[org.apache.storm.blobstore.NimbusBlobStore prepare NimbusBlobStore.java 268]
[org.apache.storm.utils.Utils getClientBlobStoreForSupervisor Utils.java 462]
[org.apache.storm.daemon.supervisor$fn__9590 invoke supervisor.clj 942]
[clojure.lang.MultiFn invoke MultiFn.java 243]
[org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9351$fn__9369 invoke supervisor.clj 582]
[org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9351 invoke supervisor.clj 581]
[org.apache.storm.event$event_manager$fn__8903 invoke event.clj 40]
[clojure.lang.AFn run AFn.java 22]
[java.lang.Thread run Thread.java 745]]}
2016-04-15 13:18:19.831 o.a.s.d.supervisor [INFO] Finished downloading code for storm id jobs-KafkaMigration-topology-3-1460740616
2016-04-15 13:18:19.850 o.a.s.d.supervisor [INFO] Missing topology storm code, so can't launch worker with assignment ...(some more numbers)
So I asume that I have a connection problem with nimbus, but the properties file in the worker is:
storm.zookeeper.servers:
- "192.168.22.209"
- "192.168.22.216"
- "192.168.22.217"
storm.local.dir: "/app/home/storm"
storm.zookeeper.root: "/storm-prod"
#
nimbus.seeds: ["192.168.120.96"]
And if I make a ping to the nimbus ip from the workers, it returns OK
Where is the error, How can I fix it?
Thanks!
Whats appears to happen in this context is that Storm supervisor resolves nimbus from whatever is configured in storm.yaml seeds/host the first time and from then on uses nimbus host name to download the topology artifacts.
If that is correct, DNS is mandatory for a cluster setup. This is far from ideal, specially when using containers in an orchestrated environment like kubernetes.
Current workaround i'm using is adding
storm.local.hostname: "<local.ip.value>"
to the storm.yaml
Thanks to #bastien who provided the tip on storm user mailing list
I ran into the similar issue. Turns out my firewall rules were blocking the supervisor ports. Make sure the supervisor and nimbus are able to talk to each other.
I found that I need to have the hostnames of the boxes match what I was calling them in the /etc/hosts file
in host file i had
xxx.xxx.xxx.xxx nimbus
but the host name on the box was different and it was pulling the hostname from the os
changing the host name on the os of the nimbus server resolved my issue.