Node doesn't rejoin cluster after being downed - akka.net

I'm using Akka.NET's cluster (1.0.5) functionality to implement a service that consists of a master node that receives requests over HTTP and farms the work out to worker nodes that have joined the cluster.
The idea is to be able to easily accomplish the following:
add worker nodes to cluster when demand is high (check)
be able to reboot the master node or take it offline (maintentance/misbehaviour/whatever) and have the workers reconnect when it becomes available (check)
upgrade/reboot a misbehaving worker and have it reconnect to the master node (fail!)
The first point works as you'd expect: a new instance (Azure Cloud Service worker role) is spun up, and joins the master - which is also the seed node.
For the second point, all worker nodes have an actor that listens to cluster gossip and it determines if the master node has died. If this is the case, the worker node actor system will be rebooted.
The last point is where I'm stuck. The master node also listens to cluster gossip to determine when a worker has become unreachable (ClusterEvent.UnreachableMember) or is shutting down (Exiting status) and decides if it should be downed. According to what I've understood from documentation, the only way to have a "new" version of the same node rejoin the cluster is to down the old version first.
Unfortunately this doesn't seem to be happening. In the test scenario I ran to reproduce the problem locally in the compute emulator, these were the steps:
Start the master node (port 8090)
Start the worker node (port 9090)
Do some work
Kill the worker node abruptly
Start the worker node back up
Below are relevant snippets from the logs I collected for both nodes during this test:
Master:
Worker becomes unreachable:
[WARNING][07/12/2015 20:39:35][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] Cluster Node [akka.tcp://InventoryService#127.0.0.1:8090] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://InventoryService#0.0.0.0:9090, status = Up]
Master node calls Cluster.Leave() and Cluster.Down() on the worker's address:
[DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Leave
[INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marked address [akka.tcp://InventoryService#0.0.0.0:9090] as Leaving]
[DEBUG][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.ClusterUserAction+Down
[INFO][07/12/2015 20:39:35][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Marking unreachable node [akka.tcp://InventoryService#0.0.0.0:9090] as Down
[DEBUG][07/12/2015 20:39:35][Thread 0020][[akka://InventoryService/system/cluster/core/daemon/heartbeatSender]] Cluster Node [akka.tcp://InventoryService#127.0.0.1:8090] - Heartbeat to [akka.tcp://InventoryService#0.0.0.0:9090]
[INFO][07/12/2015 20:39:36][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] Leader is removing unreachable node [akka.tcp://InventoryService#0.0.0.0:9090]
Master confirms the old node will no longer be allowed to join (seems to have a bug though, see the first line - gated instead for akka.tcp://InventoryService#0.0.0.0:9090 ms, which I imagine would be the time it is supposed to be gated):
[WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService#0.0.0.0:9090] with unknown UID is reported as quarantined, but address cannot be quarantined without knowing the UID, gated instead for akka.tcp://InventoryService#0.0.0.0:9090 ms
[DEBUG][07/12/2015 20:39:36][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%400.0.0.0%3a9090-2/endpointWriter]] Disassociated [akka.tcp://InventoryService#127.0.0.1:8090] -> akka.tcp://InventoryService#0.0.0.0:9090
[DEBUG][07/12/2015 20:39:36][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Association to [akka.tcp://InventoryService#0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
[WARNING][07/12/2015 20:39:36][Thread 0013][remoting] Association to [akka.tcp://InventoryService#0.0.0.0:9090] having UID [1198519768] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
Worker boots and tries to connect to the master:
[DEBUG][07/12/2015 20:40:20][Thread 0013][remoting] Associated [akka.tcp://InventoryService#127.0.0.1:8090] <- akka.tcp://InventoryService#0.0.0.0:9090
[DEBUG][07/12/2015 20:40:21][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:21][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:23][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:28][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:33][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:38][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:43][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:48][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:53][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:40:58][Thread 0022][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:03][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:08][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:13][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
[DEBUG][07/12/2015 20:41:18][Thread 0023][[akka://InventoryService/system/cluster/core/daemon]] [Initialized] Received Akka.Cluster.InternalClusterAction+InitJoin
What is happening here?
Worker:
Booting back up after being killed:
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[DEBUG][07/12/2015 20:40:20][Thread 0021][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:18][Thread 0020][[akka://InventoryService/system/cluster/core/daemon]] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[DEBUG][07/12/2015 20:40:21][Thread 0015][[akka://InventoryService/system/endpointManager/reliableEndpointWriter-akka.tcp%3a%2f%2fInventoryService%40127.0.0.1%3a8090-1/endpointWriter]] Drained buffer with maxWriteCount: 50, fullBackoffCount: 1,smallBackoffCount: 0, noBackoffCount: 0,adaptiveBackoff: 10000
And thats it...nothing else gets written to the log!
Full log files:
Master: http://pastebin.com/raw.php?i=WtjEhV1V
Worker: http://pastebin.com/raw.php?i=QGPxkqEd
Master cluster config:
cluster {
seed-nodes = ["master's address here"]
roles = [ InventoryServiceMaster, InventoryServiceWorker ]
failure-detector {
acceptable-heartbeat-pause = 5s
threshold = 10.0
}
}
Worker's config is the same, but only has the InventoryServiceWorker role.
What am I missing here? Is this a configuration problem? (I'm hoping its not a bug - I've seen someone else report a similar problem on Github).
EDIT:
Just to be clear, I'm not using the Akka.dll from Nuget since it contains a serialization bug - I checked ou the current master applied the fix and did a Release build. The logs have debug information because I kept the PDB from the build.
EDIT 2:
In the worker log, after rebooting, the event Akka.Cluster.InternalClusterAction+JoinSeedNodes appears twice because I originally had a manual call to Cluster.JoinSeedNodes(). I've since removed this but the result is still the same.

This has been resolved as of Akka.NET 1.1 - our UID system wasn't implemented correctly prior to that (1.0.5, at the time of this post) but it works fine now.

Related

Data Node and Node Manager is not starting in pseudo-cluster mode(Apache Hadoop)

The Data Node and Node Manager is not starting in pseudo-cluster mode (Apache Hadoop).
Seeing this error in the log file:
***2017-08-22 17:15:08,403 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unexpected error starting NodeStatusUpdater
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, Message from ResourceManager: NodeManager from archit doesn't satisfy minimum allocations, Sending SHUTDOWN signal to the NodeManager.***
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:278)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:197)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:496)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
2017-08-22 17:15:08,404 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
The "ResourceManager: NodeManager from archit doesn't satisfy minimum allocations" error is seen when node on which node manager is being started does not have enough resources w.r.t yarn.scheduler.minimum-allocation-vcores and yarn.scheduler.minimum-allocation-mb configurations.
Reduce values of yarn.scheduler.minimum-allocation-vcores and yarn.scheduler.minimum-allocation-mb then restart yarn.

IO thread error : 1595 (Relay log write failure: could not queue event from master)

Slave status :
Last_IO_Errno: 1595
Last_IO_Error: Relay log write failure: could not queue event from master
Last_SQL_Errno: 0
from error log :
[ERROR] Slave I/O for channel 'db12': Unexpected master's heartbeat data: heartbeat is not compatible with local info; the event's data: log_file_name toku10-bin.000063<D1> log_pos 97223067, Error_code: 1623
[ERROR] Slave I/O for channel 'db12': Relay log write failure: could not queue event from master, Error_code: 1595
I tried to restarting the slave_io thread for many times, still its same.
we need to keep on start io_thread whenever it stopped manually, hope its bug from percona
I have simply written shell and scheduled the same for every 10mins to check if io_thread is not running , start slave io_thread for channel 'db12';. It's working as of now

Datanode error : NameSystem.getDatanode

Help please Folks
I am trying to set up my Hadoop multinode env (1 master, 1 secondary and 3 slaves - hadoop 2.7.1/Ubuntu 14 on AWS) and i am getting "NameSystem.getDatanode" ERROR message. I browsed and read and tried but reach my limits. Could you point me at least in some direction
Logs (extract) from master - xxx-141/142/143 are the ip of the slaves
'''''''''''''''''''''''''''''''''''
Line 134: 2016-01-23 17:36:19,432 ERROR org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getDatanode: Data node DatanodeRegistration(XXX.XX.XX.143:50010, datanodeUuid=6826238d-9213-4b19-a6eb-13115e3bea8d, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-57295bbd-e78e-4265-99f7-fdacccbcb33a;nsid=1674724909;c=0) is attempting to report storage ID 6826238d-9213-4b19-a6eb-13115e3bea8d. Node 172.31.22.141:50010 is expected to serve this storage.
Line 135: 2016-01-23 17:36:19,457 ERROR org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getDatanode: Data node DatanodeRegistration(XXX.XX.XX.142:50010, datanodeUuid=6826238d-9213-4b19-a6eb-13115e3bea8d, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-57295bbd-e78e-4265-99f7-fdacccbcb33a;nsid=1674724909;c=0) is attempting to report storage ID 6826238d-9213-4b19-a6eb-13115e3bea8d. Node 172.31.22.141:50010 is expected to serve this storage.
Line 159: 2016-01-23 17:36:20,988 ERROR org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getDatanode: Data node DatanodeRegistration(XXX.XX.XX.141:50010, datanodeUuid=6826238d-9213-4b19-a6eb-13115e3bea8d, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-57295bbd-e78e-4265-99f7-fdacccbcb33a;nsid=1674724909;c=0) is attempting to report storage ID 6826238d-9213-4b19-a6eb-13115e3bea8d. Node XXX.XX.XX.143:50010 is expected to serve this storage.
Extract From SLAVE2 SERVER logs
'''''''''''''''''''''''''''''''''''''
2016-01-23 17:36:14,812 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
`2016-01-23 17:36:18,607 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x3c90bbfe60c, containing 1 storage report(s), of which we sent 0. The reports had 1 total blocks and used 0 RPC(s). This took 4 msec to generate and 144 msecs for RPC and NN processing. Got back no commands.
2016-01-23 17:36:18,608 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1050309752-MAST.XX.XX.169-1453113991010 (Datanode Uuid 6826238d-9213-4b19-a6eb-13115e3bea8d) service to master/MAST.XX.XX.169:9000 is shutting down
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnregisteredNodeException): Data node DatanodeRegistration(1XX.XX.XX.142:50010, datanodeUuid=6826238d-9213-4b19-a6eb-13115e3bea8d, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-57295bbd-e78e-4265-99f7-fdacccbcb33a;nsid=1674724909;c=0) is attempting to report storage ID 6826238d-9213-4b19-a6eb-13115e3bea8d. Node 1XX.XX.XX.141:50010 is expected to serve this storage.
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.getDatanode(DatanodeManager.java:495)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1791)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1315)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:163)
at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:28543)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.blockReport(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:199)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:463)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:688)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:823)
at java.lang.Thread.run(Thread.java:745)
2016-01-23 17:36:18,610 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1050309752-MAST.XX.XX.169-1453113991010 (Datanode Uuid 6826238d-9213-4b19-a6eb-13115e3bea8d) service to master/MAST.XX.XX.169:9000
2016-01-23 17:36:18,611 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1050309752-MAST.XX.XX.169-1453113991010 (Datanode Uuid 6826238d-9213-4b19-a6eb-13115e3bea8d)
2016-01-23 17:36:18,611 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing block pool BP-1050309752-MAST.XX.XX.169-1453113991010
2016-01-23 17:36:20,611 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2016-01-23 17:36:20,613 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2016-01-23 17:36:20,614 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at ip-SLAV-XX-XX-142/1XX.XX.XX.142
************************************************************/`
it looks like you have three slaves
172.31.22.141:50010
172.31.22.142:50010
172.31.22.143:50010
and you created two of them from a clone of the first slave, after the slave was already included in the cluster.
The two clones now already have a copy of the DFS and use the same storage ID as the first slave. Only one slave with the same ID is expected by the name server. It is trying to tell you this by logging:
[...] is attempting to report storage ID [...].
Node [...]:50010 is expected to serve this storage.
You can try removing the dfs directory on two of the slaves, then restarting them.
i.e. stop the slaves, do a rm -rf on the dfs directory like:
rm -rf /tmp/hadoop-hadoop/dfs/
You can then restart and check that all slaves are connecting and test file replication, e.g. by setting the replication level to 4 for some files like:
hdfs dfs -setrep -w 4 -R /user/somedir
The -w option causes the command to wait until replication has succeeded.

GridGain network connection: Is it possible to forward a node via SSH?

I would like to SSH into a remote machine running a gridgain instance and connect to it from a local gridgain instance. Can this be done?
How is the gridgain network connection being done? As far as I could sse the node spins up and listens on the first available port on 47100-47200. But it opens some more ports too.
It seems not be sufficient to just e.g. forward 47100 on the remote machine (the remote machines gridgain port) to local 47100. Probably the communication is not just client server but symmetrical with the remote node trying to connect to my home node?
Is there documentation on the network protocol?
I tried a symetrically forwarding the
GridTcpCommunicationSpi.DFLT_PORTs (47100+) and
GridTcpDiscoverySpi.DFLT_PORTs (47500+)
ports.
The nodes are able to connect. On the local node I first get this warning:
WARN GridTcpCommunicationSpi - Connect timed out (consider increasing 'connTimeout' configuration property) [addr=/10.240.136.167:47100]
WARN GridTcpDiscoverySpi - Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout. Current timeout: 5000.
WARN GridDhtPreloader - <gg-utility-sys-cache> Failed to wait for initial partition map exchange. Possible reasons are:
^-- Transactions in deadlock.
^-- Long running transactions (ignore if this is the case).
^-- Unreleased explicit locks.
WARN GridTcpDiscoverySpi - Timed out waiting for message to be read (most probably, the reason is in long GC pauses on remote node. Current timeout: 5000.
This is a timeout when somehow trying to connect to connect to 10.240.136.167:47100 - which is the remote machines local IP, which is obviously impossible.
But it looks nice as I get the following:
INFO GridDiscoveryManager - Topology snapshot [ver=2, nodes=2, CPUs=6, heap=2.7GB]
On executing the following broadcast test:
grid.compute().broadcast(new GridRunnable() {
#Override
public void run() {
System.out.println("hello!");
}
});
I get this fatal error on the remote machine, whatever it may be:
[SEVERE][gridgain-#9%pub-null%][GridJobProcessor] Task was not deployed or was redeployed since task execution [taskName=nix.GoogleGridRun$Test, taskClsName=at$
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1732)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[19:58:02,237][SEVERE][gridgain-#11%pub-null%][GridJobProcessor] Task was not deployed or was redeployed since task execution [taskName=nix.GoogleGridRun$1, taskClsName=at.a$
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
class org.gridgain.grid.GridDeploymentException: Task was not deployed or was redeployed since task execution [taskName=nix.GoogleGridRun$1, taskClsName=at.ac.ait.is.infrase$
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1107)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1732)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
On the client side I don't see anything but:
INFO GridDeploymentLocalStore - Class locally deployed: class nix.GoogleGridRun$1
hello!
When I try to push the broadcast again via the debugger, then I get the following on the local machine and the same error message as before on the remote machine:
ERROR GridTaskWorker - Failed to obtain remote job result policy for result from GridComputeTask.result(..) method (will fail the whole task): GridJobResultImpl [job=o.g.g.kernal.processors.closure.GridClosureProcessor$10#7e89183d, sib=GridJobSiblingImpl [sesId=4c17983b841-43f8b9fa-87ae-4a20-99a1-8d36f5eb74a4, jobId=0d17983b841-ef0084a6-f6a7-4501-87a0-3c5eb7c72bca, nodeId=ef0084a6-f6a7-4501-87a0-3c5eb7c72bca, isJobDone=false], jobCtx=GridJobContextImpl [jobId=0d17983b841-ef0084a6-f6a7-4501-87a0-3c5eb7c72bca, attrs={}], node=GridTcpDiscoveryNode [id=ef0084a6-f6a7-4501-87a0-3c5eb7c72bca, addrs=[10.240.136.167, 127.0.0.1], sockAddrs=[/10.240.136.167:47500, /10.240.136.167:47500, /127.0.0.1:47500], discPort=47500, order=1, loc=false, ver=6.5.0#20140925-sha1:6dc3d773], ex=class o.g.g.GridDeploymentException: Task was not deployed or was redeployed since task execution [taskName=nix.GoogleGridRun$Test, taskClsName=nix.GoogleGridRun$Test, codeVer=0, clsLdrId=eb17983b841-43f8b9fa-87ae-4a20-99a1-8d36f5eb74a4, seqNum=1411761402302, depMode=SHARED, dep=null]
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
, hasRes=true, isCancelled=false, isOccupied=true]
class org.gridgain.grid.GridException: Remote job threw user exception (override or implement GridComputeTask.result(..) method if you would like to have automatic failover for this exception).
at org.gridgain.grid.compute.GridComputeTaskAdapter.result(GridComputeTaskAdapter.java:109)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker$3.apply(GridTaskWorker.java:819)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker$3.apply(GridTaskWorker.java:812)
at org.gridgain.grid.util.GridUtils.wrapThreadLoader(GridUtils.java:6093)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker.result(GridTaskWorker.java:812)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker.onResponse(GridTaskWorker.java:708)
at org.gridgain.grid.kernal.processors.task.GridTaskProcessor.processJobExecuteResponse(GridTaskProcessor.java:906)
at org.gridgain.grid.kernal.processors.task.GridTaskProcessor$JobMessageListener.onMessage(GridTaskProcessor.java:1138)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: class org.gridgain.grid.GridDeploymentException: Task was not deployed or was redeployed since task execution [taskName=nix.GoogleGridRun$Test, taskClsName=nix.GoogleGridRun$Test, codeVer=0, clsLdrId=eb17983b841-43f8b9fa-87ae-4a20-99a1-8d36f5eb74a4, seqNum=1411761402302, depMode=SHARED, dep=null]
For more information see:
Troubleshooting: http://bit.ly/GridGain-Troubleshooting
Documentation Center: http://bit.ly/GridGain-Documentation
at org.gridgain.grid.kernal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1107)
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1732)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:654)
at org.gridgain.grid.kernal.managers.communication.GridIoManager.access$1800(GridIoManager.java:62)
at org.gridgain.grid.kernal.managers.communication.GridIoManager$6.body(GridIoManager.java:615)
at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
On the local host side I have connections between the virtual and real ports
tcp6 0 0 127.0.0.1:47100 127.0.0.1:38272 VERBUNDEN 12280/java
tcp6 0 0 127.0.0.1:38272 127.0.0.1:47100 VERBUNDEN 12280/java
And some more to and from the ssh client (also java)
tcp6 45832 0 78.101.12.107:47101 146.148.119.62:51867 VERBUNDEN 12280/java
tcp6 231 0 78.101.12.107:47501 146.148.119.62:46219 CLOSE_WAIT 12280/java
tcp6 48 0 78.101.12.107:37129 146.148.119.62:22 VERBUNDEN 12280/java
tcp6 1 0 78.101.12.107:47501 146.148.119.62:44391 CLOSE_WAIT 12280/java
78.101.12.107 = local ip
146.148.119.62 = remote ip
I looked at netstat on a successful local 2 node grid I see the following connections being made:
tcp6 0 0 ::1:47501 ::1:43143 VERBUNDEN 10218/java
tcp6 0 0 ::1:47500 ::1:34708 VERBUNDEN 9496/java
tcp6 0 0 ::1:34708 ::1:47500 VERBUNDEN 10218/java
tcp6 0 0 ::1:43143 ::1:47501 VERBUNDEN 9496/java
These are between the GridTcpCommunicationSpi.DFLT_PORTs and GridTcpDiscoverySpi.DFLT_PORTs - so these should maybe be enough.
Any Ideas on what could be wrong?
Home node should be available from cluster as well. You have 2 options:
Setup VPN
Implement and configure GridAddressResolver for all nodes which will turn their local addresses to external addresses. This will require to setup port forwarding in your home network.

Netty UDT examples on amazon ec2

I am not a networking guru so I am probably missing something simple.
I have built both the message echo client and msg echo server as runnable jar files using the netty 4.0.11 http://dl.bintray.com/netty/downloads/netty-4.0.11.Final.tar.bz2
I was able to load all of the correct dependencies using maven and the project builds and runs correctly locally and and on the server. I can run a server on my local host and connect to it from a client on the same (localhost , both on my local machine and on my amazon ec2 instance. Again it works connecting to itself (localhost) on both my machine, and my server computer.
THe problem is that I cannot connect to the server from outside the machine it is running, for example, I want to connect to my echo msg server (running on my ec2 instance) from the echo msg client running on my local computer.
I have setup the amazon security settings to allow UDP from the correct port and from the ip of my local machine. I run the Echo server on the ec2 instance and it correctly starts up:
REGISTERED
ACTIVE
DATAGRAM LISTENING bind=/0.0.0.0:1234 peer=null:0
But I just cannot connect from outside the local host, here is the error message I am getting from the client machine,
I have also turned off all firewalls (temporarily) on my local machine. Still, when i try to connect I get:
CONNECT(/70.36.197.242:1234, null)
10:11:17.000 [connect-0] DEBUG com.barchart.udt.EpollUDT - ep 1 rem [id: 0x3287e50e] DATAGRAM CONNECTING bind=/0.0.0.0:55005 peer=null:0
10:11:17.000 [connect-0] DEBUG com.barchart.udt.EpollUDT - ep 1 add [id: 0x3287e50e] DATAGRAM CONNECTING bind=/0.0.0.0:55005 peer=null:0 ERROR_WRITE
10/24/13 10:11:19 AM ===========================================================
udt.echo.message.MsgEchoClientHandler:
rate:
count = 0
mean rate = 0.00 bytes/s
1-minute rate = 0.00 bytes/s
5-minute rate = 0.00 bytes/s
15-minute rate = 0.00 bytes/s
10:11:20.017 [connect-0] WARN com.barchart.udt.nio.SelectionKeyUDT - logic error :
[id: 0x3287e50e] poll=ERROR_WRITE ready=---- inter=-C-- DATAGRAM CONNECTOR CONNECTING bind=/0.0.0.0:55005 peer=null:0
java.lang.Exception: Unexpected error report.
at com.barchart.udt.nio.SelectionKeyUDT.logError(SelectionKeyUDT.java:436) [barchart-udt-bundle-2.3.0.jar:na]
at com.barchart.udt.nio.SelectionKeyUDT.doRead(SelectionKeyUDT.java:205) [barchart-udt-bundle-2.3.0.jar:na]
at com.barchart.udt.nio.SelectorUDT.doResultsRead(SelectorUDT.java:334) [barchart-udt-bundle-2.3.0.jar:na]
at com.barchart.udt.nio.SelectorUDT.doResults(SelectorUDT.java:309) [barchart-udt-bundle-2.3.0.jar:na]
at com.barchart.udt.nio.SelectorUDT.doEpollExclusive(SelectorUDT.java:234) [barchart-udt-bundle-2.3.0.jar:na]
at com.barchart.udt.nio.SelectorUDT.doEpollEnter(SelectorUDT.java:196) [barchart-udt-bundle-2.3.0.jar:na]
at com.barchart.udt.nio.SelectorUDT.select(SelectorUDT.java:455) [barchart-udt-bundle-2.3.0.jar:na]
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:596) [netty-all-4.0.11.Final.jar:na]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:306) [netty-all-4.0.11.Final.jar:na]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) [netty-all-4.0.11.Final.jar:na]
at java.lang.Thread.run(Unknown Source) [na:1.7.0_06]
10:11:20.018 [connect-0] DEBUG com.barchart.udt.EpollUDT - ep 1 rem [id: 0x3287e50e] DATAGRAM CONNECTING bind=/0.0.0.0:55005 peer=null:0
10:11:20.018 [connect-0] ERROR c.barchart.udt.nio.SocketChannelUDT - connect failure : [id: 0x3287e50e] DATAGRAM CONNECTING bind=/0.0.0.0:55005 peer=null:0
10:11:20.018 [connect-0] INFO i.n.handler.logging.LoggingHandler - [id: 0x53f0f817] CLOSE()
Exception in thread "main" java.io.IOException
at com.barchart.udt.nio.SocketChannelUDT.finishConnect(SocketChannelUDT.java:236)
at io.netty.channel.udt.nio.NioUdtMessageConnectorChannel.doFinishConnect(NioUdtMessageConnectorChannel.java:132)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:228)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:502)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:417)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:348)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
at java.lang.Thread.run(Unknown Source)
Nevermind, after doing some more trouble shooting I found the solution. I had to in addition to allowing the port in the amazon security settings, also allow the UDP port on the server firewall.