Yarn report flink job as FINISHED and SUCCEED when flink job failure - hadoop-yarn

I am running flink job on yarn, we use "fink run" in command line to submit our job to yarn, one day we had an exception on flink job, as we didn't enable the flink restart strategy so it simply failed, but eventually we found that the job status was "SUCCEED" from the yarn application list, which we expect to be "FAILED".
Flink CLI log:
06/12/2018 03:13:37 FlatMap (getTagStorageMapper.flatMap)(23/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (ResultReducer.reduceGroup)(31/32) switched to CANCELED
06/12/2018 03:13:37 FlatMap (SubClassEDFJoinMapper.flatMap)(29/32) switched to CANCELED
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (SubClassInventoryMapper.flatMap)(27/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (OutputReducer.reduceGroup)(28/32) switched to CANCELED
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (BIMBQMInstrumentMapper.flatMap)(27/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (BIMBQMGovCorpReduce.reduceGroup)(30/32) switched to CANCELED
06/12/2018 03:13:37 FlatMap (BIMBQMEVMJoinMapper.flatMap)(32/32) switched to CANCELED
06/12/2018 03:13:37 Job execution switched to status FAILED.
No JobSubmissionResult returned, please make sure you called ExecutionEnvironment.execute()
2018-06-12 03:13:37,625 INFO org.apache.flink.yarn.YarnClusterClient - Sending shutdown request to the Application Master
2018-06-12 03:13:37,625 INFO org.apache.flink.yarn.YarnClusterClient - Start application client.
2018-06-12 03:13:37,630 INFO org.apache.flink.yarn.ApplicationClient - Notification about new leader address akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,632 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager.
2018-06-12 03:13:37,633 INFO org.apache.flink.yarn.ApplicationClient - Received address of new leader akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,634 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager null.
2018-06-12 03:13:37,635 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager.
2018-06-12 03:13:37,688 INFO org.apache.flink.yarn.ApplicationClient - Successfully registered at the ResourceManager using JobManager Actor[akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager#182802345]
2018-06-12 03:13:38,648 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager.
2018-06-12 03:13:39,480 INFO org.apache.flink.yarn.YarnClusterClient - Application application_1528772982594_0001 finished with state FINISHED and final state SUCCEEDED at 1528773218662
2018-06-12 03:13:39,480 INFO org.apache.flink.yarn.YarnClusterClient - YARN Client is shutting down
2018-06-12 03:13:39,582 INFO org.apache.flink.yarn.ApplicationClient - Stopped Application client.
2018-06-12 03:13:39,583 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager Actor[akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager#182802345].
Flink job manager Log:
FlatMap (BIMBQMEVMJoinMapper.flatMap) (32/32) (67a002e07fe799c1624a471340c8cf9d) switched from CANCELING to CANCELED.
Try to restart or fail the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) if no longer possible.
Requesting new TaskManager container with 8192 megabytes memory. Pending requests: 1
Job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) switched from state FAILING to FAILED.
Could not restart the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) because the restart strategy prevented it.
Unregistered task manager ip-10-97-44-186/10.97.44.186. Number of registered task managers 31. Number of available slots 31
Stopping JobManager with final application status SUCCEEDED and diagnostics: Flink YARN Client requested shutdown
Shutting down cluster with status SUCCEEDED : Flink YARN Client requested shutdown
Unregistering application from the YARN Resource Manager
Waiting for application to be successfully unregistered.
Can anybody help me understand why does yarn say my flink job was "SUCCEED"?

The reported application status in Yarn does not reflect the status of the executed job but the status of the Flink cluster since this is the Yarn application. Thus, the final status of the Yarn application only depends on whether the Flink cluster finished properly or not. Differently said, if a job fails, then it does not necessarily mean that the Flink cluster failed. These are two different things.

Related

Stucked HDP adding host

I have an HDP cluster with 2 nodes, we had some problem and 1 host heartbeat was lost without any chance to recover it as the machine failed so we finally re-install Ubuntu and configure it again.
It was impossible to recover the host in ambari (trying to give same FQDN, ip, configuration,...) so I tried to change hostname and add it as a completely new host.
I did get able to complete installation step 2 with 'SUCCESS' status, but it get stucked whith the following message 'Please wait while the hosts are being checked for potential problems...' for hours.
I attach ambari-server log, ambari-agent log ambari registration log and error image.
Do you have some idea about what is happening and how to solve it?
Thanks.
ambari-server.log
12 jun 2018 09:34:55,667 WARN [ambari-action-scheduler] ExecutionCommandWrapper:185 - Unable to lookup the cluster by ID; assuming that there is no cluster and therefore no configs for this execution command: Cluster not found, clusterName=clusterID=-1
12 jun 2018 09:34:56,675 WARN [ambari-action-scheduler] ExecutionCommandWrapper:185 - Unable to lookup the cluster by ID; assuming that there is no cluster and therefore no configs for this execution command: Cluster not found, clusterName=clusterID=-1
12 jun 2018 09:34:57,683 WARN [ambari-action-scheduler] ExecutionCommandWrapper:185 - Unable to lookup the cluster by ID; assuming that there is no cluster and therefore no configs for this execution command: Cluster not found, clusterName=clusterID=-1
ambari-agent.log
INFO 2018-06-12 09:00:16,026 Controller.py:512 - Registration response from bigdata was OK
INFO 2018-06-12 09:00:16,026 Controller.py:517 - Resetting ActionQueue...
INFO 2018-06-12 09:00:26,035 Controller.py:304 - Heartbeat (response id = 0) with server is running...
INFO 2018-06-12 09:00:26,036 Controller.py:311 - Building heartbeat message
INFO 2018-06-12 09:00:26,037 Heartbeat.py:90 - Adding host info/state to heartbeat message.
INFO 2018-06-12 09:00:26,099 logger.py:75 - Testing the JVM's JCE policy to see it if supports an unlimited key length.
INFO 2018-06-12 09:00:26,168 Hardware.py:176 - Some mount points were ignored: /dev, /run, /, /dev/shm, /run/lock, /sys/fs/cgroup, /boot, /run/user/1000, /run/user/0, /run/user/994
INFO 2018-06-12 09:00:26,169 Controller.py:320 - Sending Heartbeat (id = 0)
INFO 2018-06-12 09:00:26,174 Controller.py:332 - Heartbeat response received (id = 1)
INFO 2018-06-12 09:00:26,174 Controller.py:341 - Heartbeat interval is 10 seconds
INFO 2018-06-12 09:00:26,174 Controller.py:377 - Updating configurations from heartbeat
INFO 2018-06-12 09:00:26,174 Controller.py:386 - Adding cancel/execution commands
INFO 2018-06-12 09:00:26,174 Controller.py:403 - Adding recovery commands
INFO 2018-06-12 09:00:26,174 Controller.py:471 - Waiting 9.9 for next heartbeat
INFO 2018-06-12 09:00:36,075 Controller.py:478 - Wait for next heartbeat over
registration log
INFO 2018-06-12 09:34:38,350 Controller.py:512 - Registration response from bigdata was OK
INFO 2018-06-12 09:34:38,350 Controller.py:517 - Resetting ActionQueue...
', None)
Connection to master.es closed.
SSH command execution finished
host=master.es, exitcode=0
Command end time 2018-06-12 09:34:38
Registering with the server...
Registering with the server...
Do an ambari-agent reset on all nodes.
Change cluster name.

Data Node and Node Manager is not starting in pseudo-cluster mode(Apache Hadoop)

The Data Node and Node Manager is not starting in pseudo-cluster mode (Apache Hadoop).
Seeing this error in the log file:
***2017-08-22 17:15:08,403 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unexpected error starting NodeStatusUpdater
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, Message from ResourceManager: NodeManager from archit doesn't satisfy minimum allocations, Sending SHUTDOWN signal to the NodeManager.***
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:278)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:197)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:496)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
2017-08-22 17:15:08,404 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
The "ResourceManager: NodeManager from archit doesn't satisfy minimum allocations" error is seen when node on which node manager is being started does not have enough resources w.r.t yarn.scheduler.minimum-allocation-vcores and yarn.scheduler.minimum-allocation-mb configurations.
Reduce values of yarn.scheduler.minimum-allocation-vcores and yarn.scheduler.minimum-allocation-mb then restart yarn.

IO thread error : 1595 (Relay log write failure: could not queue event from master)

Slave status :
Last_IO_Errno: 1595
Last_IO_Error: Relay log write failure: could not queue event from master
Last_SQL_Errno: 0
from error log :
[ERROR] Slave I/O for channel 'db12': Unexpected master's heartbeat data: heartbeat is not compatible with local info; the event's data: log_file_name toku10-bin.000063<D1> log_pos 97223067, Error_code: 1623
[ERROR] Slave I/O for channel 'db12': Relay log write failure: could not queue event from master, Error_code: 1595
I tried to restarting the slave_io thread for many times, still its same.
we need to keep on start io_thread whenever it stopped manually, hope its bug from percona
I have simply written shell and scheduled the same for every 10mins to check if io_thread is not running , start slave io_thread for channel 'db12';. It's working as of now

Akka.Remote - cannot send messages to remote actor after dissassociation

I am using Akka.Remote to communicate between a server-side service application and multiple desktop client applications. The clients send a request message to the server (using Akka.net) and waits for the server to reply with a response message. The client applications are transient, meaning that they often connect to the server, stay connected for some time, disconnect and then reconnect again.
The problem I encountered is that sometimes when a client disconnects from the server actor (by shutting down its ActorSystem) and then reconnects back to the server, it does not receive any replies from the server for some time. After a few minutes the communication works without any problems. I found out that this issue occurs when the server sends a reply to a client that has disconnected during the request and is no longer reachable. The server cannot deliver the response message and it somehow marks the client endpoint as invalid.
In the log (on the server side) I am getting the following messages when the client is disconnected.
[DEBUG] 2016-01-21 13:04:58.6151 received AutoReceiveMessage <Terminated>: [akka.tcp://qb#client:8090/user/qb] - ExistenceConfirmed=True ServerActor
[DEBUG] 2016-01-21 13:04:58.6550 Stopped Akka.Remote.Transport.ProtocolStateActor
[ INFO] 2016-01-21 13:04:58.6550 Quarantined address [akka.tcp://qb#client:8090] is still unreachable or has not been restarted. Keeping it quarantined. Akka.Event.DummyClassForStringSources
[DEBUG] 2016-01-21 13:04:58.6725 Stopped Akka.Remote.ReliableDeliverySupervisor
[DEBUG] 2016-01-21 13:04:58.6725 no longer watched by [akka://myservice/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fqb%40client%3a8090-2] Akka.Remote.EndpointWriter
[DEBUG] 2016-01-21 13:04:58.6725 Disassociated [akka.tcp://myservice#server:8081] <- akka.tcp://qb#client:8090 Akka.Remote.EndpointWriter
[DEBUG] 2016-01-21 13:04:58.6725 Stopped Akka.Remote.EndpointWriter
And then when the client attempts to reconnect, I get:
[DEBUG] 2016-01-21 13:05:15.5883 ConnectResponse [akka.tcp://qb#client:8090/user/qb] ServerActor
[DEBUG] 2016-01-21 13:05:16.0467 Started (Akka.Remote.Transport.ProtocolStateActor) Akka.Remote.Transport.ProtocolStateActor
[DEBUG] 2016-01-21 13:05:16.0467 Stopped Akka.Remote.Transport.ProtocolStateActor
[ WARN] 2016-01-21 13:05:16.0467 AssociationError [akka.tcp://myservice#server:8081] -> akka.tcp://qb#client:8090: Error [Invalid address: akka.tcp://qb#client:8090] [] Akka.Remote.EndpointWriter
[ INFO] 2016-01-21 13:05:16.0467 Quarantined address [akka.tcp://qb#client:8090] is still unreachable or has not been restarted. Keeping it quarantined. Akka.Event.DummyClassForStringSources
[DEBUG] 2016-01-21 13:05:16.0643 Stopped Akka.Remote.ReliableDeliverySupervisor
[DEBUG] 2016-01-21 13:05:16.0711 no longer watched by [akka://myservice/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fqb%40client%3a8090-4] Akka.Remote.EndpointWriter
[DEBUG] 2016-01-21 13:05:16.0711 Disassociated [akka.tcp://myservice#server:8081] -> akka.tcp://qb#client:8090 Akka.Remote.EndpointWriter
[DEBUG] 2016-01-21 13:05:16.0711 Stopped Akka.Remote.EndpointWriter
[DEBUG] 2016-01-21 13:05:16.0867 received AutoReceiveMessage <Terminated>: [akka://myservice/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fqb%40client%3a8090-4] - ExistenceConfirmed=True Akka.Remote.EndpointManager
[DEBUG] 2016-01-21 13:05:16.0867 Terminated [akka.tcp://qb#client:8090/user/qb] ServerActor
I suspect that this behavior is a feature of Akka.net, however, I need to implement my system so that clients can disconnect and then reconnect back to the server without the need to wait. Is there any way to disable the quarantine mechanism or to gracefully close the client endpoint on the server so that the client endpoint doesn't get quarantined?
[ INFO] 2016-01-21 13:04:58.6550 Quarantined address [akka.tcp://qb#client:8090] is still unreachable or has not been restarted. Keeping it quarantined. - that says it all. The node was quarantined which requires a restart of the actor system.
However, IMHO - just upgrade to Akka.NET 1.0.6, which we released on Monday. We made the remoting policy manager much less brittle than it has been historically.

storm-YARN : Topology submitted but no supervisor nodes assigned

I'm running WordCount storm TOPOLOGY (https://github.com/nathanmarz/storm-starter/).
I'm launching this topology via storm yarn, but it's not running any worker nodes. In the storm ui I cannot see any supervisor nodes assigned. In my code I've configured to start three worker nodes.
Do we need to do something else to get the supervisors started?
I also looked at nimbus.log, stderr, stdout & ui.log there are no signs of error.
storm jar target/storm-starter-0.9.3-rc2-SNAPSHOT-jar-with-dependencies.jar storm.starter.WordCountTopology TEST-WC-TOPO
1045 [main] INFO backtype.storm.StormSubmitter - Successfully uploaded topology jar to assigned location: storm-local/nimbus/inbox/stormjar-4a08ebf7-022c-4b22-86be-1efc6e8f6be0.jar
1046 [main] INFO backtype.storm.StormSubmitter - Submitting topology WCT in distributed mode with conf {"topology.workers":3,"topology.debug":true}
1438 [main] INFO backtype.storm.StormSubmitter - Finished submitting topology: WCT